Skip to content

fix: correct query/explain/path resolution failures on natural-language input#1658

Open
nokternol wants to merge 8 commits into
Graphify-Labs:v8from
nokternol:fix/seed-selection-single-term-collision
Open

fix: correct query/explain/path resolution failures on natural-language input#1658
nokternol wants to merge 8 commits into
Graphify-Labs:v8from
nokternol:fix/seed-selection-single-term-collision

Conversation

@nokternol

Copy link
Copy Markdown
Contributor

Summary

query, explain, and path are designed to accept natural-language input, but several parts of the
resolution and ranking pipeline only worked correctly for short, exact identifier-style input. Multi-word
questions and phrases could silently resolve to the wrong node, an unrelated node, or no node at all, with
no indication that anything had gone wrong. This PR is a set of targeted fixes to that resolution pipeline —
seed selection, ranking, ambiguity handling, and fallback behavior — so that natural-language queries
against graph.json reliably reach the node(s) they describe, or fail visibly instead of silently.

Changes

Each change below targets a distinct, reproducible failure mode in how a natural-language query resolves to
graph nodes.

1. Guarantee per-term BFS seed diversity in query (_pick_seeds, graphify/serve.py)
_pick_seeds selected BFS seeds by combined score and discarded any candidate scoring below 20% of the top
score. On a multi-word query, one term's incidental exact match on an unrelated node (e.g. a common word
that also happens to be an identifier or field name elsewhere in the corpus) can outscore every other term's
weaker match by orders of magnitude, so the 20%-gap cutoff discarded every seed for the query's actually
relevant terms — the BFS then only ever explores the neighborhood of one unrelated node. Fix: guarantee at
least one seed per distinct query term that has any match at all, tie-broken by node degree. This does not
change single-term or already-unambiguous queries; it only recovers seeds that were previously discarded by
the gap cutoff.

2. Reattach cached semantic content on incremental AST-only rebuilds (_rebuild_code, graphify/watch.py)
The incremental code-only rebuild path evicts a re-extracted file's prior semantic nodes/edges
unconditionally and never regenerates them (no LLM call on that path by design), even when the semantic
cache already holds a valid, content-hash-matched result for the file's current bytes. Fix: check the
content-hash-keyed semantic cache for every re-extracted file before eviction and reattach any hit alongside
the fresh AST nodes. This adds no LLM cost (cache lookup only) and is byte-identical to a full rebuild's
cached output for files with a cache hit; files with no cache entry keep the existing AST-only behavior
unchanged.

3. Rank query output by relevance instead of raw degree (_subgraph_to_text, _query_graph_text,
graphify/serve.py)

_subgraph_to_text has always accepted a seeds parameter to render seed nodes first, but its only caller,
_query_graph_text, never passed it — every query result fell back to whole-graph degree-sort, so the
most-imported/highest-degree node in the corpus (test setup, DB schema, config, logger, etc.) led the output
regardless of relevance to the query, and the nodes the seeding logic actually selected could be pushed far
down or truncated out by the token budget. Fix: _query_graph_text now passes both seeds=start_nodes
(seeds render first) and scores=scored (the same per-node relevance scores already computed to select
seeds, now also used to rank the non-seed expansion). Nodes with no term overlap fall back to degree exactly
as before, so output ordering for a query with no scoring signal is unchanged.

4. Surface ambiguous or precedence-collapsed matches in explain instead of guessing (_find_node_tiers,
graphify/serve.py; graphify/__main__.py)

explain's single-node resolution already computed a full tier-ordered candidate list (exact
source-path match, exact match, prefix match, substring match) but took the first candidate unconditionally
and discarded the rest, with no signal when other candidates existed. Two related failure shapes: (a)
multiple candidates tied in the same top-precedence tier, and (b) exactly one top-tier candidate that is
weakly connected (degree ≤ 1), fully hiding a more relevant candidate one tier down purely due to tier
precedence. Fix: _find_node_tiers exposes tier boundaries so both shapes can be detected; explain now
lists all tied/hidden candidates instead of picking one, with a degree-dominance exception (top candidate's
degree at least ~3x the runner-up's resolves directly, since that is not a real ambiguity) and a --force
flag to bypass the guard and reproduce the prior single-candidate behavior.

5. Surface ambiguous endpoint matches in path instead of a warning that is ignored
(graphify/__main__.py)

path already computed a top-vs-runner-up score gap for each endpoint but only printed a stderr warning and
then silently proceeded with the top-scored candidate regardless. Fix: the same close-score-gap condition is
now a hard ambiguity gate — a numbered candidate list (label, source, degree, score) is shown instead of a
warning, with the same degree-dominance exception as (4) so a genuinely dominant symbol still resolves
directly, and the same --force flag to bypass.

6. Fall back to term-overlap candidates in explain instead of failing silently on a multi-word phrase
(graphify/__main__.py)

_find_node_tiers requires the query's tokens, rejoined into a single string, to match/prefix/substring a
single node's label as a whole. A genuine multi-word natural-language phrase (e.g. three or more words)
almost never appears as a literal substring of any one node's label, so this returns "No node matching
found" even when every individual word in the phrase exists on a real, relevant node — with no fallback and
no signal that anything went wrong. Fix: when the tiered lookup finds nothing and the phrase has more than
one token, fall back to the same per-token relevance scoring query already uses (_score_nodes) and list
the top candidates by term overlap, in the same candidate-list format used by (4). A genuine single-word
miss is unaffected — a one-word probe would score identically to the substring tier already tried and has
nothing new to find, so the fallback is gated on token count.

7. Gate _score_nodes' full-query bonus to multi-token queries (graphify/serve.py)
_score_nodes has a "joined" full-query tier so a multi-word query that equals or prefixes a whole
multi-word label wins outright, since no single token in a per-token score sum could otherwise equal that
label. This tier compares the query's tokenized form (punctuation stripped) against a node's tokenized
label. For a single-token query, this can falsely promote a node whose only word-character content matches
that one token — e.g. a bare method call like .search(), which tokenizes to "search" — to the same
tier as a genuine exact match, even though the same node correctly fails the ordinary per-token exact check
a few lines below (raw label ".search" is not equal to raw query "search"). This specifically affects
_pick_seeds's per-term seed-diversity check from (1), which probes each distinct query term in isolation
via _score_nodes(G, [term]): a short method name repeated across multiple unrelated files can win that
single-term probe outright and prevent a genuinely relevant multi-word file (which only reaches the prefix
tier for the same bare word) from ever being selected as a seed. Fix: gate the joined-tier bonus on
len(norm_terms) > 1. A single-token probe has no "multi-word phrase vs. per-token bag-of-words"
distinction to make — the per-token loop immediately below already fully and correctly handles single-term
exact/prefix/substring matching via raw, non-tokenized comparison. The combined multi-word query path used
everywhere else is unaffected, since it always has more than one token.

8. Cap explain's term-overlap fallback to avoid noise floods (graphify/__main__.py)
The fallback added in (6) has no floor on how broad a "match" is useful. A query whose only vocabulary
overlap with the corpus is one generic word (e.g. a word that is also a top-level directory name in the
codebase being queried) can match a large fraction of the entire graph at the weakest possible scoring tier,
and the fallback would present an arbitrary slice of that as though it were a considered answer, when in
fact none of the matched nodes are meaningfully distinguished from one another. Fix: after scoring, if the
candidate count exceeds both an absolute floor (50 — so small graphs are never affected) and 15% of the
graph's total node count, the fallback is treated as a noise flood and the plain zero-match message is shown
instead of a misleadingly specific candidate list. Genuine large-but-real candidate lists stay well under
this threshold and are unaffected.

Testing

Full test suite: 2768 passing, 28 skipped, 1 pre-existing failure unrelated to these changes
(tests/test_cache.py::test_semantic_prune_removes_orphan_entries, a test-order-dependent flake reproducible
identically on v8 before any commit in this PR). ruff check passes with no findings.

Each change above has dedicated regression tests in tests/test_serve.py, tests/test_watch.py,
tests/test_explain_cli.py, and tests/test_path_cli.py, covering both the failure case being fixed and a
non-regression pin for the prior, still-correct behavior it must not disturb.

nokternol added 8 commits July 1, 2026 21:17
…abs#1445)

_pick_seeds' gap_ratio cutoff discards any candidate scoring below 20%
of the top score. On a multi-term natural-language query, one term's
incidental EXACT label match on a node that is otherwise unrelated to
the query's intent (e.g. a common word also used as a field name or
identifier elsewhere in the corpus) scores ~1000x higher than any
SUBSTRING match on the query's other, actually-relevant terms
(_EXACT_MATCH_BONUS vs _SUBSTRING_MATCH_BONUS). The cutoff then
silently discards every one of those substring-tier candidates as BFS
seeds, so the traversal only ever explores the neighborhood of the one
unrelated exact match, and `query` returns confidently-wrong results
with no signal that anything went wrong.

This matches Graphify-Labs#1445's reproduction exactly: a vague query that doesn't
name a target symbol seeds from unrelated "concept-dense" nodes
instead, even though the target node is present in the graph.

_pick_seeds now optionally accepts the graph and the tokenized query
terms; when supplied, it guarantees at least one seed per distinct
term that has any match at all, so one term's collision cannot starve
out the others. Ties within a term are broken by node degree, so an
isolated incidental match doesn't out-rank a real, well-connected hub
for that term. The parameters default to None and existing callers
that don't pass them see byte-identical behavior (see
test_pick_seeds_without_diversity_args_is_unchanged).

Adds a regression test reproducing the exact failure shape from
Graphify-Labs#1445 and confirms the previously-starved target node is recovered
as a seed once G/terms are supplied.

Full test suite (74 tests) and ruff both pass.
Graphify-Labs#1610)

_rebuild_code's incremental path unconditionally evicts a re-extracted
file's prior semantic nodes/edges and never regenerates them, since
this path makes no LLM call by design. That's correct when the file's
content genuinely changed, but the semantic cache is keyed on the
file's current content hash: a cache hit means this exact content was
already LLM-processed and the cached result is still valid verbatim,
so dropping it on every touch of the file is a pure availability bug,
not a correctness safeguard.

_rebuild_code now checks the semantic cache for every re-extracted
file before the eviction step and reattaches any cache hits alongside
the fresh AST nodes. This costs zero additional LLM calls (cache
lookup only) and is deterministic (reattached content is
byte-identical to what a full rebuild would have cached). Files with
no matching cache entry are unaffected — they keep today's AST-only
behavior exactly.

Full suite (2745 tests, 1 pre-existing unrelated failure reproduced on
main) and ruff pass.
_subgraph_to_text has always supported rendering a traversal's seed
nodes first, ahead of a degree-sorted expansion — but _query_graph_text
was its only caller, and never passed `seeds`, so every `graphify
query` result silently fell back to pure whole-graph degree-sort: the
most-imported/highest-degree node in the corpus (test setup, DB
schema, DI container, config, logger) led the output regardless of
relevance, and the nodes the traversal actually seeded from (the ones
_pick_seeds and Graphify-Labs#1445's per-term guarantee worked to select) could be
buried arbitrarily far down or truncated out by the token budget.

_query_graph_text now passes both `seeds=start_nodes` and
`scores=scored` through. Seeds render first, as designed. Non-seed
nodes are now ranked by the same per-node query-relevance score
_score_nodes already computes to pick seeds, falling back to degree
only for nodes with zero term overlap (exactly today's behavior for
that case) — so a node that matched a query term but has an ordinary
degree now outranks an unrelated high-degree hub, instead of losing to
it outright.

Strictly additive: `scores` defaults to None, and the "no scores"
ordering is byte-identical to before (verified with a dedicated
regression test). Full suite (2762 tests, same 1 pre-existing
unrelated failure as prior commits) and ruff pass.

Verified live against a real repo's graph.json: the same
natural-language query that previously surfaced vitest.ts/schema.ts/
container.ts/config.ts/logger.ts ahead of anything relevant now
surfaces plexProvider.ts, identityJobFactory.ts, identityResolutionJob.ts,
IdentityJobFactory, PlexProvider, mediaIdentity.test.ts — every visible
node on-topic, none of the previous generic-hub noise in the first 20
lines.
…ify-Labs#1613)

`explain` resolved to `_find_node`'s first match unconditionally,
discarding every other candidate with no signal — the exact failure
mode Graphify-Labs#1445 fixed for `query`'s BFS seeding, present all along in
`explain`'s own resolution step but never addressed. Confirmed live:
`explain "genres"` silently returned an unrelated, degree-1 Storybook
literal while `GenresResponse` (the actually relevant node per this
repo's own Trial 1 ground truth) sat one precedence tier down,
discarded — not a same-tier tie but a lone weak exact match masking a
real prefix match, since `_find_node`'s tiers apply strict precedence
(exact beats prefix beats substring) with nothing to catch a weak
winner.

`_find_node`'s internal tiers are now exposed via `_find_node_tiers`
(source_exact, exact, prefix, substring, plus whether the existing
file-level-node heuristic already resolved a multi-entry source_exact
tier). `explain` uses this to catch two ambiguity shapes instead of
picking `matches[0]` blindly:

- same-tier ties: more than one candidate in the top precedence tier.
- precedence collapse: exactly one candidate in the top tier, but it is
  weakly connected (degree <= 1) and a lower tier has candidates that
  are fully hidden by tier precedence alone.

Either case prints a numbered candidate list (label, source, degree)
instead of guessing, and exits cleanly; `--force` reproduces the old
behavior for scripts that want it. A degree-dominance escape hatch
(mirrors `path`'s existing top-vs-runner-up gap check, applied to
degree) keeps same-tier ties from over-firing: when one candidate's
degree dominates the runner-up, it resolves directly with no prompt,
found necessary during live verification against a real repo where a
DI container type name legitimately appears as 3 distinct exact-tier
nodes (the real definition plus 2 unrelated per-file type-annotation
references) — a real, dominant winner, not a genuine tie.

Full suite (2764 tests, same 1 pre-existing unrelated failure as prior
commits) and ruff pass. Verified live against a real repo's graph.json:
`explain "genres"` now surfaces both candidates; `explain "Cradle"` and
`explain "identityResolutionJob"` (previously-clean cases) are
unaffected; `explain "identity linking"` (a genuine no-match) is
unaffected.
…raphify-Labs#1614)

Port of Graphify-Labs#1613's `explain` fix into `path`'s endpoint resolution.
`path` already detected a close top-vs-runner-up score gap on either
endpoint, but only printed a stderr warning and then silently
proceeded with the top-scored candidate regardless — the same
silently-wrong-with-no-actionable-signal failure Graphify-Labs#1445 and Graphify-Labs#1613 fixed
elsewhere. Confirmed live against the real repro:
`path "filterRegistry" "useMediaLookups"` has `filterRegistry.ts`
(degree 21, the real module) and `filterRegistry.test.ts` (degree 8)
tied at the exact same score — a real ambiguity a stable sort tiebreak
happened to resolve "correctly" this time, with nothing to say it
might not next time. Both endpoints were individually resolving fine
in isolation; the misleading result came from silently accepting a
coin-flip on one of them.

Endpoint resolution now uses the same two-part gate as Graphify-Labs#1613: a close
score gap is treated as ambiguous and prints a numbered candidate list
(label, source, degree, score) instead of guessing, unless one
candidate's degree dominates the runner-up (mirrors Graphify-Labs#1613's
degree-dominance escape hatch, applied here to the pre-existing
score-gap check rather than tier membership). `--force` bypasses the
guard entirely, reproducing pre-Graphify-Labs#1614 behavior.

Full suite (2768 tests, same 1 pre-existing unrelated failure as prior
commits) and ruff pass. Verified live: the repro now lists both tied
candidates instead of silently picking one; `--force` reproduces the
old misleading result for comparison; a genuinely unambiguous path
(identityResolutionJob -> plexProvider) resolves directly with no
false-positive prompt.

Scope: this fixes endpoint disambiguation, not the separate issue that
a correctly-resolved shortest path can still route through irrelevant
test-infrastructure edges when that happens to be graph-theoretically
shortest — that's a distinct, unaddressed problem in path's traversal
itself, not its endpoint resolution.
…iling (Graphify-Labs#1616)

`graphify explain "<phrase>"` treats its whole argument as one string that
must match/prefix/substring a single node's label as a whole — so a genuine
natural-language phrase (e.g. "critic score aggregation") returns "No node
matching found" even when every individual word exists on a real, relevant
node, because no node label ever literally contains the entire multi-word
phrase. This silently dead-ends on exactly the query shape `explain` is
otherwise suggested for, with no fallback and no signal that anything went
wrong (worse than noise: a hard, silent zero).

When the tiered lookup finds nothing and the phrase has more than one token,
`explain` now falls back to the same per-token bag-of-words scoring `query`
already uses (`_score_nodes`) and lists the top candidates by term overlap,
in the same numbered-candidate format the existing ambiguity guard (Graphify-Labs#1613)
uses, instead of a bare dead end. A genuine single-word miss is unaffected —
gated on token count, since a one-word probe would score identically to the
substring tier already tried and has nothing new to find.

Regression tests: multi-word phrase with real term overlap surfaces
candidates and excludes unrelated nodes; multi-word phrase with zero overlap
still gets the honest original message; single-word miss is byte-identical
to prior behavior. Full suite (2766 tests, 1 pre-existing unrelated failure)
and ruff pass. Verified live against a real repo's graph.json: both
previously-zero `explain` queries now surface their real target
(`ratingsAggregation.ts`, `backdrops.handler.ts`) instead of nothing.
…hify-Labs#1617)

_score_nodes' "joined" full-query tier exists so a multi-word query that
equals (or prefixes) a whole multi-word label wins outright, since no single
token in a bag-of-words sum could otherwise equal that label. For a
single-token probe, this degenerates: `joined` equals the lone term, and any
node whose *tokenized* label (punctuation stripped) happens to reduce to
exactly that one word - e.g. a bare method call like `.search()`, whose only
word character content is "search" - gets promoted to the EXACT tier via the
`label_tokens` comparison, even though the same node correctly fails the
per-token loop's own raw `t == norm_label or t == bare_label` exact check a
few lines below (raw ".search" != "search").

This matters most inside `_pick_seeds`' per-term seed-diversity guarantee
(Graphify-Labs#1445), which probes each distinct query term in isolation via
`_score_nodes(G, [term])`: a short, same-named method repeated across
several unrelated files (three metadata providers each define their own
`.search()`) can win that single-term probe's EXACT tier outright and starve
out the actually-relevant multi-word file, which only reaches the PREFIX
tier for the same bare word. Reproduced live: `graphify query "how does a
change in provider settings affect what shows up in search results"` seeded
on one provider's unrelated `.search()` method and never surfaced
`search.handler.ts` at all, despite that file scoring far higher (494 vs 7)
under the query's full multi-word sentence - the bug is specific to the
single-term isolation probe, not the combined-query scoring path.

Fix: gate the joined-tier block on `len(norm_terms) > 1`. A single-token
probe has no "multi-word phrase vs per-token bag-of-words" distinction to
make in the first place - the per-token loop directly below already fully
and correctly handles single-term exact/prefix/substring matching via raw,
non-tokenized label comparison, so the bonus is both redundant and (as
shown) actively harmful when only one token is being scored. The combined
multi-word query path is unchanged, since len(norm_terms) > 1 there.

Regression tests: an isolated single-token probe now ranks the real
multi-word file above the same-named bare method; `_pick_seeds`' per-term
diversity guarantee no longer seeds the bare method over the relevant file
end-to-end. Full suite (2766 tests, 1 pre-existing unrelated failure) and
ruff pass. Verified live: search.handler.ts and its exported symbols now
appear in the traversal for the exact query that previously missed them
entirely.
…ify-Labs#1618)

Graphify-Labs#1616's term-overlap fallback (this same session) fixed `explain` hard-
failing to zero on multi-word natural-language phrases, but has its own
failure mode: when a query's only shared vocabulary with the corpus is one
generic word, every node containing that word ties at the weakest possible
bonus tier, and the fallback presents an arbitrary top-10 slice of that tie
as though it were a considered answer. Live repro: "server startup error
handling" matched 1,765 of this repo's 3,491 nodes (51%) — "server" is also
this repo's top-level backend directory name — with the real target buried
past rank 800, tied with 1,627 other nodes at the exact same floor score.
That's not a useful answer, it's close to a coin flip dressed up as one.

Fix: after scoring, if the candidate count exceeds both an absolute floor
(50) and 15% of the graph's total node count, treat it as a noise flood and
fall back to the same honest zero-match message a genuine miss gets, instead
of printing a misleadingly specific candidate list. The floor keeps this
from firing on small graphs/fixtures, where even "most of the graph matched"
can be a small, legitimate list. Genuine large-but-real candidate lists
(e.g. 31 candidates on this repo's ~3,491-node graph, an earlier fix's
verified-good case) stay well under the threshold and are unaffected.

Regression tests: a 60-of-61-node noise flood on one generic token now gets
the honest no-match message; a 20-of-21-node case (below this graph size's
threshold) still shows its candidate list normally, confirming the guard is
for degenerate floods specifically, not just "more than 10 results." Full
suite (2769 tests, all passing this run — the one known pre-existing
test-order flake did not trigger) and ruff pass. Verified live: the exact
1,765-candidate flood from earlier now returns the honest no-match message;
smaller legitimate fallbacks (critic score aggregation, backdrop image
selection) are unaffected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant