Benchmarking
The run matrix, the two workloads, and how every number on the home page is produced.
Harness
The primary runner is benchmarks/run_benchmark.py. It builds a list of named configurations (the run matrix), starts the FastAPI service with each configuration, drives it with a fixed request count and concurrency level, and writes per-run JSON plus an aggregated summary.csv.
For local development the convenience wrapper benchmarks/run_local_benchmarks.py wires up the local Docker services (bolt://127.0.0.1:7687 for Neo4j and redis://127.0.0.1:6379/0 for Redis) with sensible defaults.
# LDBC SNB SF1 run
python benchmarks/run_local_benchmarks.py --output-dir benchmark_results/local_run
# SSCA-inspired run
python benchmarks/ssca_workload.py --scale 10 --edge-factor 8 --clear-first
python benchmarks/run_local_benchmarks.py \
--workload ssca \
--ssca-scale 10 \
--ssca-edge-factor 8 \
--output-dir benchmark_results/ssca_run
The run matrix
The matrix is built by build_run_matrix(). It produces these 11 configurations, in this order:
| # | Name | What is enabled |
|---|---|---|
| 1 | baseline | Nothing. Vanilla cache key + Neo4j on every miss. |
| 2 | isolated_jitter_plain_xfetch | Only JITTER_STAMPEDE, with TSPR_REFRESH_MODE=plain. |
| 3 | isolated_jitter_topology_sensitive_xfetch | Only JITTER_STAMPEDE, with TSPR_REFRESH_MODE=topology_sensitive. |
| 4 | isolated_jitter_stampede | Only JITTER_STAMPEDE, default mode. |
| 5 | isolated_frequency_aware | Only FREQUENCY_AWARE. |
| 6 | isolated_adaptive_prefetch | Only ADAPTIVE_PREFETCH. |
| 7 | isolated_overlapping_subqueries | Only OVERLAPPING_SUBQUERIES. |
| 8 | cumulative_jitter_stampede | Add JITTER on top of baseline. |
| 9 | cumulative_frequency_aware | + FREQ_AWARE. |
| 10 | cumulative_adaptive_prefetch | + PREFETCH. |
| 11 | cumulative_overlapping_subqueries | + OVERLAP. |
| 12 | all_enabled | Every flag on (except EXTERNAL_BFS). |
EXTERNAL_BFS is forced off in every row because it regressed on the real dataset. See External BFS for the full explanation.
BENCHMARK_FLAGS = [
"JITTER_STAMPEDE",
"FREQUENCY_AWARE",
"ADAPTIVE_PREFETCH",
"OVERLAPPING_SUBQUERIES",
]
DISABLED_BENCHMARK_FLAGS = {"EXTERNAL_BFS": False}
def build_run_matrix() -> list[tuple[str, dict[str, bool]]]:
runs = [("baseline", {**DISABLED_BENCHMARK_FLAGS, **{f: False for f in BENCHMARK_FLAGS}})]
runs.extend([
("isolated_jitter_plain_xfetch", {"JITTER_STAMPEDE": True, "TSPR_REFRESH_MODE": "plain", ...}),
("isolated_jitter_topology_sensitive_xfetch", {"JITTER_STAMPEDE": True, "TSPR_REFRESH_MODE": "topology_sensitive", ...}),
])
for flag in BENCHMARK_FLAGS:
toggles = {**DISABLED_BENCHMARK_FLAGS, **{f: False for f in BENCHMARK_FLAGS}}
toggles[flag] = True
runs.append((f"isolated_{flag.lower()}", toggles))
progressive = {f: False for f in BENCHMARK_FLAGS}
for flag in BENCHMARK_FLAGS:
progressive[flag] = True
runs.append((f"cumulative_{flag.lower()}", {**DISABLED_BENCHMARK_FLAGS, **deepcopy(progressive)}))
runs.append(("all_enabled", {**DISABLED_BENCHMARK_FLAGS, **{f: True for f in BENCHMARK_FLAGS}}))
return runs
The two workloads
LDBC SNB Interactive v1 (SF1)
The primary benchmark uses the LDBC SNB Interactive v1 dataset at scale factor SF1, serialised as CsvMergeForeign with StringDateFormatter. To keep the import lightweight and the schema relevant to the middleware story, only Person nodes from dynamic/person_0_0.csv and KNOWS relationships from dynamic/person_knows_person_0_0.csv are imported. Substitution parameters come from the matching substitution_parameters-sf1 bundle.
The harness pre-fetches per-person degrees from Neo4j via fetch_person_degrees and injects them into request params. Without this enrichment, every query collapses to degree=1 and the topology-sensitive XFetch rule degenerates to plain XFetch.
Because only Person and KNOWS are imported, the published numbers describe a subset of the full SNB schema, not the full heterogeneous social graph. This is an explicit choice driven by the middleware's focus on path traversal, and it is documented at every reporting point.
SSCA-inspired synthetic workload
The secondary benchmark is generated by benchmarks/ssca_workload.py, which produces an R-MAT-like directed weighted graph and loads it into Neo4j as SSCANode and LINK. The companion module build_ssca_queries emits Cypher workloads analogous to the HPCS SSCA#2 kernels: heavy-edge frontier traversals, subgraph extraction, weighted reachability, and a centrality-proxy query.
The thesis evaluation uses --scale 10 --edge-factor 8, which produces a graph that is small enough to load locally but skewed enough to exercise the topology-sensitive optimizations.
Metrics
Each run reports the following:
- throughput_qps — completed queries per second.
- p50_latency_ms / p95_latency_ms / p99_latency_ms — latency percentiles.
- cache_hit_rate — fraction of requests served from Redis.
- subquery_reuse_count — overlap-cache hit count.
- prefetch_hits_total / prefetch_waste_total — accuracy of the prefetcher.
- stampede_events_total / single_flight_hits — stampede protection activity.
- hot_key_hits — fraction of hits on keys above the frequency threshold.
Headline numbers (this repository)
Pulled directly from benchmark_results/sf1_matrix_canonical_overlap/summary.csv and benchmark_results/ssca_run/summary.csv:
| Run | LDBC qps | LDBC P95 (ms) | SSCA qps | SSCA P95 (ms) |
|---|---|---|---|---|
| baseline | 56.77 | 242.40 | 48.76 | 167.22 |
| isolated_jitter_plain_xfetch | 50.90 | 260.29 | 255.28 | 13.51 |
| isolated_jitter_topology_sensitive_xfetch | 51.49 | 252.08 | 305.22 | 9.37 |
| isolated_overlapping_subqueries | 207.55 | 53.96 | 559.58 | 7.00 |
| all_enabled | 187.03 | 51.20 | 413.61 | 7.95 |
Chart generation
benchmarks/generate_ppt_charts.py turns a summary CSV into slide-ready SVG and PNG bar charts under three subdirectories:
charts/overview/— overall throughput and latency summaries.charts/pairwise/— separate baseline vs isolated charts per technique.charts/combined/— side-by-side baseline vs all_enabled comparisons.
python benchmarks/generate_ppt_charts.py \
--summary-csv benchmark_results/sf1_matrix_canonical_overlap/summary.csv
python benchmarks/generate_cross_workload_charts.py \
--ldbc-csv benchmark_results/sf1_matrix_canonical_overlap/summary.csv \
--ssca-csv benchmark_results/ssca_run/summary.csv
How to read the numbers
Why "all_enabled" is sometimes lower than the best isolated run
On LDBC the isolated overlap run reaches 207.55 qps but all-enabled is 187.03 qps. The 9% gap reflects the per-request overhead of the other modules: prefetcher fan-out, frequency-hash lookups, and background refresh tasks all cost a small amount of CPU and Redis traffic. For workloads that look like LDBC, an operator who only cares about throughput could disable the other three modules and run with overlap-only.
Why hit rate alone does not predict P95
LDBC's baseline hit rate is 0.68 but its baseline P95 is 242 ms. SSCA's baseline hit rate is 0.36 but the all-enabled P95 collapses to 7.95 ms. The difference is miss cost: SSCA misses are larger frontiers, so each miss saved removes a bigger latency tail, even though there are more misses overall. Hit rate is a useful diagnostic but not a target metric.
Why SSCA gains more
Three properties amplify the runtime layer's impact on SSCA:
- Path shapes recur, so canonical-signature reuse fires often.
- Degree distribution is skewed, so topology-sensitive XFetch has a real signal.
- Kernel sweeps are predictable, so the adaptive prefetcher's first-order Markov is enough.
None of these hold as strongly on the LDBC Person/KNOWS subset, which is why the all-enabled gain is 8.48× on SSCA but 3.30× on LDBC. Both numbers are real; they just measure different things.