Overlapping Subquery Reuse
Two Cypher queries that look syntactically different may describe the same graph traversal. This module finds those cases and reuses the cached result.
In isolated mode this optimizer alone reaches 207.55 qps on LDBC SF1 and 559.58 qps on the SSCA-inspired workload. On both benchmarks it explains most of the all-enabled improvement.
The idea
Consider two Cypher queries:
MATCH (a:Person)-[:KNOWS*1..3]->(b:Person)
WHERE a.id = $src
RETURN b.id
MATCH (x:Person)-[:KNOWS*1..3]->(y:Person)
WHERE x.id = $src
RETURN y.id
These are byte-different but graph-identical. A hash of the raw text would treat them as separate keys; a richer signature can recognise that they describe the same one-to-three-hop traversal between two Person nodes and serve the same result.
Two modes
The OVERLAP_CACHE_MODE flag chooses how reuse keys are constructed. The default is canonical_hybrid, which combines both schemes for maximum coverage.
fragment_only
Match by raw query wording. Hashes any substring like [:KNOWS*1..3] and uses it as a key. Cheap but brittle: rename a relationship type and reuse breaks.
canonical_hybrid
Extract a structural signature (left label, relationship type, right label, direction, length range, degree buckets) and hash that. Falls back to fragment matching for safety.
The two regexes that drive matching
# Matches a variable-length path fragment, e.g. [:KNOWS*1..3]
PATH_PATTERN_RE = re.compile(r"\[[^\]]*\*[^\]]*\]")
# Matches an entire path segment with two endpoints and a relationship in between.
PATH_SEGMENT_RE = re.compile(
r"\((?P<left>[^)]*)\)\s*-\[(?P<rel>[^\]]*\*[^\]]*)\]-(?P<direction>>)?\s*\((?P<right>[^)]*)\)"
)
LABEL_RE = re.compile(r":([A-Za-z_][A-Za-z0-9_]*)")
REL_TYPE_RE = re.compile(r":([A-Za-z_][A-Za-z0-9_]*)")
LENGTH_RE = re.compile(r"\*(?P<range>\d+\.\.\d+|\d+\.\.|\.\.\d+|\d+)?")
Building the canonical signature
For each matched path segment the module extracts six structural components, plus two degree buckets pulled from the request parameters. These are joined with | separators and hashed.
def canonical_signatures(self, query: str, params=None):
params = params or {}
signatures = []
for match in PATH_SEGMENT_RE.finditer(query):
left_label = self._extract_label(match.group("left"))
right_label = self._extract_label(match.group("right"))
rel_type = self._extract_rel_type(match.group("rel"))
length_range = self._extract_length_range(match.group("rel"))
direction = "directed" if match.group("direction") else "undirected"
source_bucket, target_bucket = self._degree_buckets(params)
signatures.append(
"|".join([
"path",
f"left={left_label}",
f"rel={rel_type}",
f"right={right_label}",
f"direction={direction}",
f"length={length_range}",
f"source_bucket={source_bucket}",
f"target_bucket={target_bucket}",
])
)
return signatures
An example signature for the queries above (with $src at a high-degree node) looks like:
path|left=Person|rel=KNOWS|right=Person|direction=directed|length=1..3|source_bucket=high|target_bucket=unknown
Degree buckets
Node degree is a continuous number, but two queries with degrees 47 and 51 should not collide on a different bucket. The module quantises into four classes:
| Bucket | Range | Typical example |
|---|---|---|
low | 0 – 9 | leaf nodes, isolated entities |
medium | 10 – 49 | typical social-graph user |
high | 50 – 199 | active users, popular entities |
super | 200+ | celebrities, hubs, brand pages |
unknown | — | parameters did not provide a degree |
The benchmark harness pre-fetches per-person degrees from Neo4j and injects them into request params, so the buckets are populated under realistic conditions. Without this enrichment, every query collapses to unknown and the topology-aware matching loses signal.
From signature to Redis key
signature_keys takes both kinds of signatures, hashes each with SHA-256, and prefixes them appropriately. In canonical_hybrid mode it returns canonical keys first (preferred lookup order), then fragment keys as fallback.
def signature_keys(self, query, params=None):
keys, seen = [], set()
if self.cache_mode in {"canonical", "canonical_hybrid"}:
for signature in self.canonical_signatures(query, params):
digest = sha256(signature.encode("utf-8")).hexdigest()
key = f"{self.settings.SUBQUERY_PREFIX}signature:{digest}"
if key not in seen:
seen.add(key); keys.append(key)
if self.cache_mode in {"fragment_only", "canonical_hybrid"}:
for fragment in self.detect_fragments(query):
digest = sha256(fragment.encode("utf-8")).hexdigest()
key = f"{self.settings.SUBQUERY_PREFIX}fragment:{digest}"
if key not in seen:
seen.add(key); keys.append(key)
return keys
Lookup and store
The orchestrator calls lookup before going to Neo4j; on success it returns the cached result and a metadata dict with subquery_reuse: True. On a cache miss the orchestrator runs the query and calls store, which writes the result under every signature key.
async def lookup(self, query, params=None):
for key in self.signature_keys(query, params):
cached = await self.redis.get(key)
if cached is not None:
SUBQUERY_REUSE_COUNT.inc()
return json.loads(cached), {"subquery_reuse": True, "subquery_key": key}
return None, {"subquery_reuse": False}
async def store(self, query, result, params=None):
encoded = json.dumps(result, default=str)
for key in self.signature_keys(query, params):
await self.redis.set(key, encoded, ex=self.settings.DEFAULT_CACHE_TTL_SECONDS)
Why the canonical signature wins
The unit test test_overlap_fragment_only_does_not_reuse_alias_changed_query shows the difference directly. With fragment_only, the two queries above miss each other because their raw fragment text differs by one character (the :KNOWS bracket has the same body but the surrounding label changed). With canonical_hybrid, the test test_overlap_reuses_canonical_path_signature_across_query_aliases confirms the second query reads from the cache populated by the first.
The canonical signature is intentionally coarse. Two queries with different WHERE filters or different RETURN projections will collide on the same key, which is wrong. The mitigation is that the orchestrator only consults overlap reuse after checking the primary key, and the primary key is derived from the full query text. So overlap is a fallback that fires when no exact match exists; collisions on the canonical key are a small risk because the workload usually issues either the exact same query or a structurally identical one, not a structurally identical query with a different return clause.