Benchmarking

The run matrix, the two workloads, and how every number on the home page is produced.

Harness

The primary runner is benchmarks/run_benchmark.py. It builds a list of named configurations (the run matrix), starts the FastAPI service with each configuration, drives it with a fixed request count and concurrency level, and writes per-run JSON plus an aggregated summary.csv.

For local development the convenience wrapper benchmarks/run_local_benchmarks.py wires up the local Docker services (bolt://127.0.0.1:7687 for Neo4j and redis://127.0.0.1:6379/0 for Redis) with sensible defaults.

Running the matrix locallybash
# LDBC SNB SF1 run
python benchmarks/run_local_benchmarks.py --output-dir benchmark_results/local_run

# SSCA-inspired run
python benchmarks/ssca_workload.py --scale 10 --edge-factor 8 --clear-first
python benchmarks/run_local_benchmarks.py \
  --workload ssca \
  --ssca-scale 10 \
  --ssca-edge-factor 8 \
  --output-dir benchmark_results/ssca_run

The run matrix

The matrix is built by build_run_matrix(). It produces these 11 configurations, in this order:

#NameWhat is enabled
1baselineNothing. Vanilla cache key + Neo4j on every miss.
2isolated_jitter_plain_xfetchOnly JITTER_STAMPEDE, with TSPR_REFRESH_MODE=plain.
3isolated_jitter_topology_sensitive_xfetchOnly JITTER_STAMPEDE, with TSPR_REFRESH_MODE=topology_sensitive.
4isolated_jitter_stampedeOnly JITTER_STAMPEDE, default mode.
5isolated_frequency_awareOnly FREQUENCY_AWARE.
6isolated_adaptive_prefetchOnly ADAPTIVE_PREFETCH.
7isolated_overlapping_subqueriesOnly OVERLAPPING_SUBQUERIES.
8cumulative_jitter_stampedeAdd JITTER on top of baseline.
9cumulative_frequency_aware+ FREQ_AWARE.
10cumulative_adaptive_prefetch+ PREFETCH.
11cumulative_overlapping_subqueries+ OVERLAP.
12all_enabledEvery flag on (except EXTERNAL_BFS).

EXTERNAL_BFS is forced off in every row because it regressed on the real dataset. See External BFS for the full explanation.

build_run_matrix()Python
BENCHMARK_FLAGS = [
    "JITTER_STAMPEDE",
    "FREQUENCY_AWARE",
    "ADAPTIVE_PREFETCH",
    "OVERLAPPING_SUBQUERIES",
]
DISABLED_BENCHMARK_FLAGS = {"EXTERNAL_BFS": False}

def build_run_matrix() -> list[tuple[str, dict[str, bool]]]:
    runs = [("baseline", {**DISABLED_BENCHMARK_FLAGS, **{f: False for f in BENCHMARK_FLAGS}})]
    runs.extend([
        ("isolated_jitter_plain_xfetch", {"JITTER_STAMPEDE": True, "TSPR_REFRESH_MODE": "plain", ...}),
        ("isolated_jitter_topology_sensitive_xfetch", {"JITTER_STAMPEDE": True, "TSPR_REFRESH_MODE": "topology_sensitive", ...}),
    ])
    for flag in BENCHMARK_FLAGS:
        toggles = {**DISABLED_BENCHMARK_FLAGS, **{f: False for f in BENCHMARK_FLAGS}}
        toggles[flag] = True
        runs.append((f"isolated_{flag.lower()}", toggles))
    progressive = {f: False for f in BENCHMARK_FLAGS}
    for flag in BENCHMARK_FLAGS:
        progressive[flag] = True
        runs.append((f"cumulative_{flag.lower()}", {**DISABLED_BENCHMARK_FLAGS, **deepcopy(progressive)}))
    runs.append(("all_enabled", {**DISABLED_BENCHMARK_FLAGS, **{f: True for f in BENCHMARK_FLAGS}}))
    return runs

The two workloads

LDBC SNB Interactive v1 (SF1)

The primary benchmark uses the LDBC SNB Interactive v1 dataset at scale factor SF1, serialised as CsvMergeForeign with StringDateFormatter. To keep the import lightweight and the schema relevant to the middleware story, only Person nodes from dynamic/person_0_0.csv and KNOWS relationships from dynamic/person_knows_person_0_0.csv are imported. Substitution parameters come from the matching substitution_parameters-sf1 bundle.

The harness pre-fetches per-person degrees from Neo4j via fetch_person_degrees and injects them into request params. Without this enrichment, every query collapses to degree=1 and the topology-sensitive XFetch rule degenerates to plain XFetch.

▶ Dataset caveat

Because only Person and KNOWS are imported, the published numbers describe a subset of the full SNB schema, not the full heterogeneous social graph. This is an explicit choice driven by the middleware's focus on path traversal, and it is documented at every reporting point.

SSCA-inspired synthetic workload

The secondary benchmark is generated by benchmarks/ssca_workload.py, which produces an R-MAT-like directed weighted graph and loads it into Neo4j as SSCANode and LINK. The companion module build_ssca_queries emits Cypher workloads analogous to the HPCS SSCA#2 kernels: heavy-edge frontier traversals, subgraph extraction, weighted reachability, and a centrality-proxy query.

The thesis evaluation uses --scale 10 --edge-factor 8, which produces a graph that is small enough to load locally but skewed enough to exercise the topology-sensitive optimizations.

Metrics

Each run reports the following:

Headline numbers (this repository)

Pulled directly from benchmark_results/sf1_matrix_canonical_overlap/summary.csv and benchmark_results/ssca_run/summary.csv:

RunLDBC qpsLDBC P95 (ms)SSCA qpsSSCA P95 (ms)
baseline 56.77242.40 48.76167.22
isolated_jitter_plain_xfetch 50.90260.29 255.2813.51
isolated_jitter_topology_sensitive_xfetch 51.49252.08 305.229.37
isolated_overlapping_subqueries 207.5553.96 559.587.00
all_enabled 187.0351.20 413.617.95

Chart generation

benchmarks/generate_ppt_charts.py turns a summary CSV into slide-ready SVG and PNG bar charts under three subdirectories:

Generating chartsbash
python benchmarks/generate_ppt_charts.py \
  --summary-csv benchmark_results/sf1_matrix_canonical_overlap/summary.csv

python benchmarks/generate_cross_workload_charts.py \
  --ldbc-csv  benchmark_results/sf1_matrix_canonical_overlap/summary.csv \
  --ssca-csv  benchmark_results/ssca_run/summary.csv

How to read the numbers

Why "all_enabled" is sometimes lower than the best isolated run

On LDBC the isolated overlap run reaches 207.55 qps but all-enabled is 187.03 qps. The 9% gap reflects the per-request overhead of the other modules: prefetcher fan-out, frequency-hash lookups, and background refresh tasks all cost a small amount of CPU and Redis traffic. For workloads that look like LDBC, an operator who only cares about throughput could disable the other three modules and run with overlap-only.

Why hit rate alone does not predict P95

LDBC's baseline hit rate is 0.68 but its baseline P95 is 242 ms. SSCA's baseline hit rate is 0.36 but the all-enabled P95 collapses to 7.95 ms. The difference is miss cost: SSCA misses are larger frontiers, so each miss saved removes a bigger latency tail, even though there are more misses overall. Hit rate is a useful diagnostic but not a target metric.

Why SSCA gains more

Three properties amplify the runtime layer's impact on SSCA:

  1. Path shapes recur, so canonical-signature reuse fires often.
  2. Degree distribution is skewed, so topology-sensitive XFetch has a real signal.
  3. Kernel sweeps are predictable, so the adaptive prefetcher's first-order Markov is enough.

None of these hold as strongly on the LDBC Person/KNOWS subset, which is why the all-enabled gain is 8.48× on SSCA but 3.30× on LDBC. Both numbers are real; they just measure different things.