Query Normalization
Turning a Cypher string and a parameter dictionary into a deterministic cache key. The whole optimization stack depends on this being stable.
Why this matters
Two clients can issue the same query in subtly different ways: extra whitespace, parameter dictionaries with different key insertion order, integers stringified into JSON inconsistently. If the cache key depended on any of those surface details, two semantically identical requests would miss each other and the cache would be useless. The normaliser removes all such variation.
The whole function in one screen
It is a small piece of code, but it underpins every optimizer that follows. The full implementation lives in middleware/query_parser.py:
import json
import re
from hashlib import sha256
from typing import Any
WHITESPACE_RE = re.compile(r"\s+")
def _canonicalize(value: Any) -> Any:
if isinstance(value, dict):
return {key: _canonicalize(value[key]) for key in sorted(value)}
if isinstance(value, list):
return [_canonicalize(item) for item in value]
return value
def normalize_query(query: str, params: dict) -> str:
normalized_query = WHITESPACE_RE.sub(" ", query.strip())
canonical_params = _canonicalize(params or {})
serialized = json.dumps(
canonical_params,
separators=(",", ":"),
sort_keys=True,
default=str,
)
params_hash = sha256(serialized.encode("utf-8")).hexdigest()
return f"{normalized_query}|params:{params_hash}"
Step by step
1. Whitespace collapse
The regex \s+ matches one or more whitespace characters and replaces them with a single space. .strip() removes leading and trailing whitespace. After this step, two queries that differ only in indentation, line breaks, or stray tabs become identical.
This step does not normalise variable names, label aliases, or the shape of the query. Those are intentionally left alone here, because the cache key needs to discriminate between genuinely different queries. The richer matching for path fragments lives in the overlapping subqueries module, which extracts a structural signature on top of this normalised string.
2. Parameter canonicalisation
Python dictionaries preserve insertion order. If client A sends {"id": 1, "limit": 10} and client B sends {"limit": 10, "id": 1}, both should produce the same cache key. _canonicalize recursively rebuilds dictionaries with sorted keys and walks lists element by element. This means nested objects of arbitrary depth are also canonicalised.
3. Deterministic JSON serialisation
The canonicalised params are serialised with separators=(",", ":") (no extraneous whitespace), sort_keys=True (a belt-and-braces guarantee on top of canonicalisation), and default=str so that values like datetime objects do not crash the call.
4. SHA-256 hash
Hashing the params keeps the cache key bounded in length even when the parameters are large. The full normalised query plus the hash forms the final string, separated by a literal |params:.
Cache key prefixing
The output of normalize_query is wrapped one more time by build_cache_key in utils/key_generator.py, which prepends the configured CACHE_PREFIX:
def build_cache_key(prefix: str, normalized: str) -> str:
return f"{prefix}{normalized}"
Different optimizers use different prefixes so they share Redis without collisions:
| Prefix | Owner | Purpose |
|---|---|---|
kgcache: | GraphCache | Primary result cache, keyed by normalised query. |
subquery:fragment: | OverlappingSubqueryCache | SHA-256 of a raw variable-length path fragment. |
subquery:signature: | OverlappingSubqueryCache | SHA-256 of a canonical path signature (label, rel, length, degree bucket). |
freq: | FrequencyAwareCache | Per-key access count, last-seen, and update-frequency counters. |
bfs: | ExternalBFS | Cached shortest-path results. |
<key>:lock | JitterStampede | Single-flight UUID token, guards stampede recompute. |
A unit test that pins this down
The file tests/test_query_parser.py guarantees the function is deterministic regardless of dict ordering:
def test_normalize_query_is_deterministic():
q1 = normalize_query("MATCH (n) RETURN n", {"a": 1, "b": 2})
q2 = normalize_query(" MATCH (n)\n RETURN n ", {"b": 2, "a": 1})
assert q1 == q2
If you ever extend the normaliser, this is the test that will catch a regression.