Query Normalization

Turning a Cypher string and a parameter dictionary into a deterministic cache key. The whole optimization stack depends on this being stable.

Why this matters

Two clients can issue the same query in subtly different ways: extra whitespace, parameter dictionaries with different key insertion order, integers stringified into JSON inconsistently. If the cache key depended on any of those surface details, two semantically identical requests would miss each other and the cache would be useless. The normaliser removes all such variation.

The whole function in one screen

It is a small piece of code, but it underpins every optimizer that follows. The full implementation lives in middleware/query_parser.py:

middleware/query_parser.pyPython

import json
import re
from hashlib import sha256
from typing import Any

WHITESPACE_RE = re.compile(r"\s+")

def _canonicalize(value: Any) -> Any:
    if isinstance(value, dict):
        return {key: _canonicalize(value[key]) for key in sorted(value)}
    if isinstance(value, list):
        return [_canonicalize(item) for item in value]
    return value

def normalize_query(query: str, params: dict) -> str:
    normalized_query = WHITESPACE_RE.sub(" ", query.strip())
    canonical_params = _canonicalize(params or {})
    serialized = json.dumps(
        canonical_params,
        separators=(",", ":"),
        sort_keys=True,
        default=str,
    )
    params_hash = sha256(serialized.encode("utf-8")).hexdigest()
    return f"{normalized_query}|params:{params_hash}"

Step by step

1. Whitespace collapse

The regex \s+ matches one or more whitespace characters and replaces them with a single space. .strip() removes leading and trailing whitespace. After this step, two queries that differ only in indentation, line breaks, or stray tabs become identical.

▶ What stays different

This step does not normalise variable names, label aliases, or the shape of the query. Those are intentionally left alone here, because the cache key needs to discriminate between genuinely different queries. The richer matching for path fragments lives in the overlapping subqueries module, which extracts a structural signature on top of this normalised string.

2. Parameter canonicalisation

Python dictionaries preserve insertion order. If client A sends {"id": 1, "limit": 10} and client B sends {"limit": 10, "id": 1}, both should produce the same cache key. _canonicalize recursively rebuilds dictionaries with sorted keys and walks lists element by element. This means nested objects of arbitrary depth are also canonicalised.

3. Deterministic JSON serialisation

The canonicalised params are serialised with separators=(",", ":") (no extraneous whitespace), sort_keys=True (a belt-and-braces guarantee on top of canonicalisation), and default=str so that values like datetime objects do not crash the call.

4. SHA-256 hash

Hashing the params keeps the cache key bounded in length even when the parameters are large. The full normalised query plus the hash forms the final string, separated by a literal |params:.

Cache key prefixing

The output of normalize_query is wrapped one more time by build_cache_key in utils/key_generator.py, which prepends the configured CACHE_PREFIX:

utils/key_generator.pyPython

def build_cache_key(prefix: str, normalized: str) -> str:
    return f"{prefix}{normalized}"

Different optimizers use different prefixes so they share Redis without collisions:

Prefix	Owner	Purpose
`kgcache:`	GraphCache	Primary result cache, keyed by normalised query.
`subquery:fragment:`	OverlappingSubqueryCache	SHA-256 of a raw variable-length path fragment.
`subquery:signature:`	OverlappingSubqueryCache	SHA-256 of a canonical path signature (label, rel, length, degree bucket).
`freq:`	FrequencyAwareCache	Per-key access count, last-seen, and update-frequency counters.
`bfs:`	ExternalBFS	Cached shortest-path results.
`<key>:lock`	JitterStampede	Single-flight UUID token, guards stampede recompute.

A unit test that pins this down

The file tests/test_query_parser.py guarantees the function is deterministic regardless of dict ordering:

tests/test_query_parser.pyPython

def test_normalize_query_is_deterministic():
    q1 = normalize_query("MATCH (n) RETURN n", {"a": 1, "b": 2})
    q2 = normalize_query(" MATCH (n)\n  RETURN  n ", {"b": 2, "a": 1})
    assert q1 == q2

If you ever extend the normaliser, this is the test that will catch a regression.