Overlapping Subquery Reuse

Two Cypher queries that look syntactically different may describe the same graph traversal. This module finds those cases and reuses the cached result.

▶ Why this is the strongest module

In isolated mode this optimizer alone reaches 207.55 qps on LDBC SF1 and 559.58 qps on the SSCA-inspired workload. On both benchmarks it explains most of the all-enabled improvement.

The idea

Consider two Cypher queries:

Query ACypher

MATCH (a:Person)-[:KNOWS*1..3]->(b:Person)
WHERE a.id = $src
RETURN b.id

Query BCypher

MATCH (x:Person)-[:KNOWS*1..3]->(y:Person)
WHERE x.id = $src
RETURN y.id

These are byte-different but graph-identical. A hash of the raw text would treat them as separate keys; a richer signature can recognise that they describe the same one-to-three-hop traversal between two Person nodes and serve the same result.

Two modes

The OVERLAP_CACHE_MODE flag chooses how reuse keys are constructed. The default is canonical_hybrid, which combines both schemes for maximum coverage.

Mode 1

fragment_only

Match by raw query wording. Hashes any substring like [:KNOWS*1..3] and uses it as a key. Cheap but brittle: rename a relationship type and reuse breaks.

regex hash

Mode 2 · Default

canonical_hybrid

Extract a structural signature (left label, relationship type, right label, direction, length range, degree buckets) and hash that. Falls back to fragment matching for safety.

novelstructural

The two regexes that drive matching

middleware/optimizations/overlapping_subqueries.pyPython

# Matches a variable-length path fragment, e.g. [:KNOWS*1..3]
PATH_PATTERN_RE = re.compile(r"\[[^\]]*\*[^\]]*\]")

# Matches an entire path segment with two endpoints and a relationship in between.
PATH_SEGMENT_RE = re.compile(
    r"\((?P<left>[^)]*)\)\s*-\[(?P<rel>[^\]]*\*[^\]]*)\]-(?P<direction>>)?\s*\((?P<right>[^)]*)\)"
)
LABEL_RE = re.compile(r":([A-Za-z_][A-Za-z0-9_]*)")
REL_TYPE_RE = re.compile(r":([A-Za-z_][A-Za-z0-9_]*)")
LENGTH_RE = re.compile(r"\*(?P<range>\d+\.\.\d+|\d+\.\.|\.\.\d+|\d+)?")

Building the canonical signature

For each matched path segment the module extracts six structural components, plus two degree buckets pulled from the request parameters. These are joined with | separators and hashed.

canonical_signatures()Python

def canonical_signatures(self, query: str, params=None):
    params = params or {}
    signatures = []
    for match in PATH_SEGMENT_RE.finditer(query):
        left_label  = self._extract_label(match.group("left"))
        right_label = self._extract_label(match.group("right"))
        rel_type    = self._extract_rel_type(match.group("rel"))
        length_range = self._extract_length_range(match.group("rel"))
        direction = "directed" if match.group("direction") else "undirected"
        source_bucket, target_bucket = self._degree_buckets(params)
        signatures.append(
            "|".join([
                "path",
                f"left={left_label}",
                f"rel={rel_type}",
                f"right={right_label}",
                f"direction={direction}",
                f"length={length_range}",
                f"source_bucket={source_bucket}",
                f"target_bucket={target_bucket}",
            ])
        )
    return signatures

An example signature for the queries above (with $src at a high-degree node) looks like:

path|left=Person|rel=KNOWS|right=Person|direction=directed|length=1..3|source_bucket=high|target_bucket=unknown

Degree buckets

Node degree is a continuous number, but two queries with degrees 47 and 51 should not collide on a different bucket. The module quantises into four classes:

Bucket	Range	Typical example
`low`	0 – 9	leaf nodes, isolated entities
`medium`	10 – 49	typical social-graph user
`high`	50 – 199	active users, popular entities
`super`	200+	celebrities, hubs, brand pages
`unknown`	—	parameters did not provide a degree

The benchmark harness pre-fetches per-person degrees from Neo4j and injects them into request params, so the buckets are populated under realistic conditions. Without this enrichment, every query collapses to unknown and the topology-aware matching loses signal.

From signature to Redis key

signature_keys takes both kinds of signatures, hashes each with SHA-256, and prefixes them appropriately. In canonical_hybrid mode it returns canonical keys first (preferred lookup order), then fragment keys as fallback.

signature_keys()Python

def signature_keys(self, query, params=None):
    keys, seen = [], set()
    if self.cache_mode in {"canonical", "canonical_hybrid"}:
        for signature in self.canonical_signatures(query, params):
            digest = sha256(signature.encode("utf-8")).hexdigest()
            key = f"{self.settings.SUBQUERY_PREFIX}signature:{digest}"
            if key not in seen:
                seen.add(key); keys.append(key)
    if self.cache_mode in {"fragment_only", "canonical_hybrid"}:
        for fragment in self.detect_fragments(query):
            digest = sha256(fragment.encode("utf-8")).hexdigest()
            key = f"{self.settings.SUBQUERY_PREFIX}fragment:{digest}"
            if key not in seen:
                seen.add(key); keys.append(key)
    return keys

Lookup and store

The orchestrator calls lookup before going to Neo4j; on success it returns the cached result and a metadata dict with subquery_reuse: True. On a cache miss the orchestrator runs the query and calls store, which writes the result under every signature key.

lookup() / store()Python

async def lookup(self, query, params=None):
    for key in self.signature_keys(query, params):
        cached = await self.redis.get(key)
        if cached is not None:
            SUBQUERY_REUSE_COUNT.inc()
            return json.loads(cached), {"subquery_reuse": True, "subquery_key": key}
    return None, {"subquery_reuse": False}

async def store(self, query, result, params=None):
    encoded = json.dumps(result, default=str)
    for key in self.signature_keys(query, params):
        await self.redis.set(key, encoded, ex=self.settings.DEFAULT_CACHE_TTL_SECONDS)

Why the canonical signature wins

The unit test test_overlap_fragment_only_does_not_reuse_alias_changed_query shows the difference directly. With fragment_only, the two queries above miss each other because their raw fragment text differs by one character (the :KNOWS bracket has the same body but the surrounding label changed). With canonical_hybrid, the test test_overlap_reuses_canonical_path_signature_across_query_aliases confirms the second query reads from the cache populated by the first.

▶ Tradeoff

The canonical signature is intentionally coarse. Two queries with different WHERE filters or different RETURN projections will collide on the same key, which is wrong. The mitigation is that the orchestrator only consults overlap reuse after checking the primary key, and the primary key is derived from the full query text. So overlap is a fallback that fires when no exact match exists; collisions on the canonical key are a small risk because the workload usually issues either the exact same query or a structurally identical one, not a structurally identical query with a different return clause.