Benchmarking

The run matrix, the two workloads, and how every number on the home page is produced.

Harness

The primary runner is benchmarks/run_benchmark.py. It builds a list of named configurations (the run matrix), starts the FastAPI service with each configuration, drives it with a fixed request count and concurrency level, and writes per-run JSON plus an aggregated summary.csv.

For local development the convenience wrapper benchmarks/run_local_benchmarks.py wires up the local Docker services (bolt://127.0.0.1:7687 for Neo4j and redis://127.0.0.1:6379/0 for Redis) with sensible defaults.

Running the matrix locallybash

# LDBC SNB SF1 run
python benchmarks/run_local_benchmarks.py --output-dir benchmark_results/local_run

# SSCA-inspired run
python benchmarks/ssca_workload.py --scale 10 --edge-factor 8 --clear-first
python benchmarks/run_local_benchmarks.py \
  --workload ssca \
  --ssca-scale 10 \
  --ssca-edge-factor 8 \
  --output-dir benchmark_results/ssca_run

The run matrix

The matrix is built by build_run_matrix(). It produces these 11 configurations, in this order:

#	Name	What is enabled
1	`baseline`	Nothing. Vanilla cache key + Neo4j on every miss.
2	`isolated_jitter_plain_xfetch`	Only `JITTER_STAMPEDE`, with `TSPR_REFRESH_MODE=plain`.
3	`isolated_jitter_topology_sensitive_xfetch`	Only `JITTER_STAMPEDE`, with `TSPR_REFRESH_MODE=topology_sensitive`.
4	`isolated_jitter_stampede`	Only `JITTER_STAMPEDE`, default mode.
5	`isolated_frequency_aware`	Only `FREQUENCY_AWARE`.
6	`isolated_adaptive_prefetch`	Only `ADAPTIVE_PREFETCH`.
7	`isolated_overlapping_subqueries`	Only `OVERLAPPING_SUBQUERIES`.
8	`cumulative_jitter_stampede`	Add JITTER on top of baseline.
9	`cumulative_frequency_aware`	+ FREQ_AWARE.
10	`cumulative_adaptive_prefetch`	+ PREFETCH.
11	`cumulative_overlapping_subqueries`	+ OVERLAP.
12	`all_enabled`	Every flag on (except `EXTERNAL_BFS`).

EXTERNAL_BFS is forced off in every row because it regressed on the real dataset. See External BFS for the full explanation.

build_run_matrix()Python

BENCHMARK_FLAGS = [
    "JITTER_STAMPEDE",
    "FREQUENCY_AWARE",
    "ADAPTIVE_PREFETCH",
    "OVERLAPPING_SUBQUERIES",
]
DISABLED_BENCHMARK_FLAGS = {"EXTERNAL_BFS": False}

def build_run_matrix() -> list[tuple[str, dict[str, bool]]]:
    runs = [("baseline", {**DISABLED_BENCHMARK_FLAGS, **{f: False for f in BENCHMARK_FLAGS}})]
    runs.extend([
        ("isolated_jitter_plain_xfetch", {"JITTER_STAMPEDE": True, "TSPR_REFRESH_MODE": "plain", ...}),
        ("isolated_jitter_topology_sensitive_xfetch", {"JITTER_STAMPEDE": True, "TSPR_REFRESH_MODE": "topology_sensitive", ...}),
    ])
    for flag in BENCHMARK_FLAGS:
        toggles = {**DISABLED_BENCHMARK_FLAGS, **{f: False for f in BENCHMARK_FLAGS}}
        toggles[flag] = True
        runs.append((f"isolated_{flag.lower()}", toggles))
    progressive = {f: False for f in BENCHMARK_FLAGS}
    for flag in BENCHMARK_FLAGS:
        progressive[flag] = True
        runs.append((f"cumulative_{flag.lower()}", {**DISABLED_BENCHMARK_FLAGS, **deepcopy(progressive)}))
    runs.append(("all_enabled", {**DISABLED_BENCHMARK_FLAGS, **{f: True for f in BENCHMARK_FLAGS}}))
    return runs

The two workloads

LDBC SNB Interactive v1 (SF1)

The primary benchmark uses the LDBC SNB Interactive v1 dataset at scale factor SF1, serialised as CsvMergeForeign with StringDateFormatter. To keep the import lightweight and the schema relevant to the middleware story, only Person nodes from dynamic/person_0_0.csv and KNOWS relationships from dynamic/person_knows_person_0_0.csv are imported. Substitution parameters come from the matching substitution_parameters-sf1 bundle.

The harness pre-fetches per-person degrees from Neo4j via fetch_person_degrees and injects them into request params. Without this enrichment, every query collapses to degree=1 and the topology-sensitive XFetch rule degenerates to plain XFetch.

▶ Dataset caveat

Because only Person and KNOWS are imported, the published numbers describe a subset of the full SNB schema, not the full heterogeneous social graph. This is an explicit choice driven by the middleware's focus on path traversal, and it is documented at every reporting point.

SSCA-inspired synthetic workload

The secondary benchmark is generated by benchmarks/ssca_workload.py, which produces an R-MAT-like directed weighted graph and loads it into Neo4j as SSCANode and LINK. The companion module build_ssca_queries emits Cypher workloads analogous to the HPCS SSCA#2 kernels: heavy-edge frontier traversals, subgraph extraction, weighted reachability, and a centrality-proxy query.

The thesis evaluation uses --scale 10 --edge-factor 8, which produces a graph that is small enough to load locally but skewed enough to exercise the topology-sensitive optimizations.

Metrics

Each run reports the following:

throughput_qps — completed queries per second.
p50_latency_ms / p95_latency_ms / p99_latency_ms — latency percentiles.
cache_hit_rate — fraction of requests served from Redis.
subquery_reuse_count — overlap-cache hit count.
prefetch_hits_total / prefetch_waste_total — accuracy of the prefetcher.
stampede_events_total / single_flight_hits — stampede protection activity.
hot_key_hits — fraction of hits on keys above the frequency threshold.

Headline numbers (this repository)

Pulled directly from benchmark_results/sf1_matrix_canonical_overlap/summary.csv and benchmark_results/ssca_run/summary.csv:

Run	LDBC qps	LDBC P95 (ms)	SSCA qps	SSCA P95 (ms)
baseline	56.77	242.40	48.76	167.22
isolated_jitter_plain_xfetch	50.90	260.29	255.28	13.51
isolated_jitter_topology_sensitive_xfetch	51.49	252.08	305.22	9.37
isolated_overlapping_subqueries	207.55	53.96	559.58	7.00
all_enabled	187.03	51.20	413.61	7.95

Chart generation

benchmarks/generate_ppt_charts.py turns a summary CSV into slide-ready SVG and PNG bar charts under three subdirectories:

charts/overview/ — overall throughput and latency summaries.
charts/pairwise/ — separate baseline vs isolated charts per technique.
charts/combined/ — side-by-side baseline vs all_enabled comparisons.

Generating chartsbash

python benchmarks/generate_ppt_charts.py \
  --summary-csv benchmark_results/sf1_matrix_canonical_overlap/summary.csv

python benchmarks/generate_cross_workload_charts.py \
  --ldbc-csv  benchmark_results/sf1_matrix_canonical_overlap/summary.csv \
  --ssca-csv  benchmark_results/ssca_run/summary.csv

How to read the numbers

Why "all_enabled" is sometimes lower than the best isolated run

On LDBC the isolated overlap run reaches 207.55 qps but all-enabled is 187.03 qps. The 9% gap reflects the per-request overhead of the other modules: prefetcher fan-out, frequency-hash lookups, and background refresh tasks all cost a small amount of CPU and Redis traffic. For workloads that look like LDBC, an operator who only cares about throughput could disable the other three modules and run with overlap-only.

Why hit rate alone does not predict P95

LDBC's baseline hit rate is 0.68 but its baseline P95 is 242 ms. SSCA's baseline hit rate is 0.36 but the all-enabled P95 collapses to 7.95 ms. The difference is miss cost: SSCA misses are larger frontiers, so each miss saved removes a bigger latency tail, even though there are more misses overall. Hit rate is a useful diagnostic but not a target metric.

Why SSCA gains more

Three properties amplify the runtime layer's impact on SSCA:

Path shapes recur, so canonical-signature reuse fires often.
Degree distribution is skewed, so topology-sensitive XFetch has a real signal.
Kernel sweeps are predictable, so the adaptive prefetcher's first-order Markov is enough.

None of these hold as strongly on the LDBC Person/KNOWS subset, which is why the all-enabled gain is 8.48× on SSCA but 3.30× on LDBC. Both numbers are real; they just measure different things.