fix: correct query/explain/path resolution failures on natural-language input#1658
Open
nokternol wants to merge 8 commits into
Open
fix: correct query/explain/path resolution failures on natural-language input#1658nokternol wants to merge 8 commits into
nokternol wants to merge 8 commits into
Conversation
…abs#1445) _pick_seeds' gap_ratio cutoff discards any candidate scoring below 20% of the top score. On a multi-term natural-language query, one term's incidental EXACT label match on a node that is otherwise unrelated to the query's intent (e.g. a common word also used as a field name or identifier elsewhere in the corpus) scores ~1000x higher than any SUBSTRING match on the query's other, actually-relevant terms (_EXACT_MATCH_BONUS vs _SUBSTRING_MATCH_BONUS). The cutoff then silently discards every one of those substring-tier candidates as BFS seeds, so the traversal only ever explores the neighborhood of the one unrelated exact match, and `query` returns confidently-wrong results with no signal that anything went wrong. This matches Graphify-Labs#1445's reproduction exactly: a vague query that doesn't name a target symbol seeds from unrelated "concept-dense" nodes instead, even though the target node is present in the graph. _pick_seeds now optionally accepts the graph and the tokenized query terms; when supplied, it guarantees at least one seed per distinct term that has any match at all, so one term's collision cannot starve out the others. Ties within a term are broken by node degree, so an isolated incidental match doesn't out-rank a real, well-connected hub for that term. The parameters default to None and existing callers that don't pass them see byte-identical behavior (see test_pick_seeds_without_diversity_args_is_unchanged). Adds a regression test reproducing the exact failure shape from Graphify-Labs#1445 and confirms the previously-starved target node is recovered as a seed once G/terms are supplied. Full test suite (74 tests) and ruff both pass.
Graphify-Labs#1610) _rebuild_code's incremental path unconditionally evicts a re-extracted file's prior semantic nodes/edges and never regenerates them, since this path makes no LLM call by design. That's correct when the file's content genuinely changed, but the semantic cache is keyed on the file's current content hash: a cache hit means this exact content was already LLM-processed and the cached result is still valid verbatim, so dropping it on every touch of the file is a pure availability bug, not a correctness safeguard. _rebuild_code now checks the semantic cache for every re-extracted file before the eviction step and reattaches any cache hits alongside the fresh AST nodes. This costs zero additional LLM calls (cache lookup only) and is deterministic (reattached content is byte-identical to what a full rebuild would have cached). Files with no matching cache entry are unaffected — they keep today's AST-only behavior exactly. Full suite (2745 tests, 1 pre-existing unrelated failure reproduced on main) and ruff pass.
_subgraph_to_text has always supported rendering a traversal's seed nodes first, ahead of a degree-sorted expansion — but _query_graph_text was its only caller, and never passed `seeds`, so every `graphify query` result silently fell back to pure whole-graph degree-sort: the most-imported/highest-degree node in the corpus (test setup, DB schema, DI container, config, logger) led the output regardless of relevance, and the nodes the traversal actually seeded from (the ones _pick_seeds and Graphify-Labs#1445's per-term guarantee worked to select) could be buried arbitrarily far down or truncated out by the token budget. _query_graph_text now passes both `seeds=start_nodes` and `scores=scored` through. Seeds render first, as designed. Non-seed nodes are now ranked by the same per-node query-relevance score _score_nodes already computes to pick seeds, falling back to degree only for nodes with zero term overlap (exactly today's behavior for that case) — so a node that matched a query term but has an ordinary degree now outranks an unrelated high-degree hub, instead of losing to it outright. Strictly additive: `scores` defaults to None, and the "no scores" ordering is byte-identical to before (verified with a dedicated regression test). Full suite (2762 tests, same 1 pre-existing unrelated failure as prior commits) and ruff pass. Verified live against a real repo's graph.json: the same natural-language query that previously surfaced vitest.ts/schema.ts/ container.ts/config.ts/logger.ts ahead of anything relevant now surfaces plexProvider.ts, identityJobFactory.ts, identityResolutionJob.ts, IdentityJobFactory, PlexProvider, mediaIdentity.test.ts — every visible node on-topic, none of the previous generic-hub noise in the first 20 lines.
…ify-Labs#1613) `explain` resolved to `_find_node`'s first match unconditionally, discarding every other candidate with no signal — the exact failure mode Graphify-Labs#1445 fixed for `query`'s BFS seeding, present all along in `explain`'s own resolution step but never addressed. Confirmed live: `explain "genres"` silently returned an unrelated, degree-1 Storybook literal while `GenresResponse` (the actually relevant node per this repo's own Trial 1 ground truth) sat one precedence tier down, discarded — not a same-tier tie but a lone weak exact match masking a real prefix match, since `_find_node`'s tiers apply strict precedence (exact beats prefix beats substring) with nothing to catch a weak winner. `_find_node`'s internal tiers are now exposed via `_find_node_tiers` (source_exact, exact, prefix, substring, plus whether the existing file-level-node heuristic already resolved a multi-entry source_exact tier). `explain` uses this to catch two ambiguity shapes instead of picking `matches[0]` blindly: - same-tier ties: more than one candidate in the top precedence tier. - precedence collapse: exactly one candidate in the top tier, but it is weakly connected (degree <= 1) and a lower tier has candidates that are fully hidden by tier precedence alone. Either case prints a numbered candidate list (label, source, degree) instead of guessing, and exits cleanly; `--force` reproduces the old behavior for scripts that want it. A degree-dominance escape hatch (mirrors `path`'s existing top-vs-runner-up gap check, applied to degree) keeps same-tier ties from over-firing: when one candidate's degree dominates the runner-up, it resolves directly with no prompt, found necessary during live verification against a real repo where a DI container type name legitimately appears as 3 distinct exact-tier nodes (the real definition plus 2 unrelated per-file type-annotation references) — a real, dominant winner, not a genuine tie. Full suite (2764 tests, same 1 pre-existing unrelated failure as prior commits) and ruff pass. Verified live against a real repo's graph.json: `explain "genres"` now surfaces both candidates; `explain "Cradle"` and `explain "identityResolutionJob"` (previously-clean cases) are unaffected; `explain "identity linking"` (a genuine no-match) is unaffected.
…raphify-Labs#1614) Port of Graphify-Labs#1613's `explain` fix into `path`'s endpoint resolution. `path` already detected a close top-vs-runner-up score gap on either endpoint, but only printed a stderr warning and then silently proceeded with the top-scored candidate regardless — the same silently-wrong-with-no-actionable-signal failure Graphify-Labs#1445 and Graphify-Labs#1613 fixed elsewhere. Confirmed live against the real repro: `path "filterRegistry" "useMediaLookups"` has `filterRegistry.ts` (degree 21, the real module) and `filterRegistry.test.ts` (degree 8) tied at the exact same score — a real ambiguity a stable sort tiebreak happened to resolve "correctly" this time, with nothing to say it might not next time. Both endpoints were individually resolving fine in isolation; the misleading result came from silently accepting a coin-flip on one of them. Endpoint resolution now uses the same two-part gate as Graphify-Labs#1613: a close score gap is treated as ambiguous and prints a numbered candidate list (label, source, degree, score) instead of guessing, unless one candidate's degree dominates the runner-up (mirrors Graphify-Labs#1613's degree-dominance escape hatch, applied here to the pre-existing score-gap check rather than tier membership). `--force` bypasses the guard entirely, reproducing pre-Graphify-Labs#1614 behavior. Full suite (2768 tests, same 1 pre-existing unrelated failure as prior commits) and ruff pass. Verified live: the repro now lists both tied candidates instead of silently picking one; `--force` reproduces the old misleading result for comparison; a genuinely unambiguous path (identityResolutionJob -> plexProvider) resolves directly with no false-positive prompt. Scope: this fixes endpoint disambiguation, not the separate issue that a correctly-resolved shortest path can still route through irrelevant test-infrastructure edges when that happens to be graph-theoretically shortest — that's a distinct, unaddressed problem in path's traversal itself, not its endpoint resolution.
…iling (Graphify-Labs#1616) `graphify explain "<phrase>"` treats its whole argument as one string that must match/prefix/substring a single node's label as a whole — so a genuine natural-language phrase (e.g. "critic score aggregation") returns "No node matching found" even when every individual word exists on a real, relevant node, because no node label ever literally contains the entire multi-word phrase. This silently dead-ends on exactly the query shape `explain` is otherwise suggested for, with no fallback and no signal that anything went wrong (worse than noise: a hard, silent zero). When the tiered lookup finds nothing and the phrase has more than one token, `explain` now falls back to the same per-token bag-of-words scoring `query` already uses (`_score_nodes`) and lists the top candidates by term overlap, in the same numbered-candidate format the existing ambiguity guard (Graphify-Labs#1613) uses, instead of a bare dead end. A genuine single-word miss is unaffected — gated on token count, since a one-word probe would score identically to the substring tier already tried and has nothing new to find. Regression tests: multi-word phrase with real term overlap surfaces candidates and excludes unrelated nodes; multi-word phrase with zero overlap still gets the honest original message; single-word miss is byte-identical to prior behavior. Full suite (2766 tests, 1 pre-existing unrelated failure) and ruff pass. Verified live against a real repo's graph.json: both previously-zero `explain` queries now surface their real target (`ratingsAggregation.ts`, `backdrops.handler.ts`) instead of nothing.
…hify-Labs#1617) _score_nodes' "joined" full-query tier exists so a multi-word query that equals (or prefixes) a whole multi-word label wins outright, since no single token in a bag-of-words sum could otherwise equal that label. For a single-token probe, this degenerates: `joined` equals the lone term, and any node whose *tokenized* label (punctuation stripped) happens to reduce to exactly that one word - e.g. a bare method call like `.search()`, whose only word character content is "search" - gets promoted to the EXACT tier via the `label_tokens` comparison, even though the same node correctly fails the per-token loop's own raw `t == norm_label or t == bare_label` exact check a few lines below (raw ".search" != "search"). This matters most inside `_pick_seeds`' per-term seed-diversity guarantee (Graphify-Labs#1445), which probes each distinct query term in isolation via `_score_nodes(G, [term])`: a short, same-named method repeated across several unrelated files (three metadata providers each define their own `.search()`) can win that single-term probe's EXACT tier outright and starve out the actually-relevant multi-word file, which only reaches the PREFIX tier for the same bare word. Reproduced live: `graphify query "how does a change in provider settings affect what shows up in search results"` seeded on one provider's unrelated `.search()` method and never surfaced `search.handler.ts` at all, despite that file scoring far higher (494 vs 7) under the query's full multi-word sentence - the bug is specific to the single-term isolation probe, not the combined-query scoring path. Fix: gate the joined-tier block on `len(norm_terms) > 1`. A single-token probe has no "multi-word phrase vs per-token bag-of-words" distinction to make in the first place - the per-token loop directly below already fully and correctly handles single-term exact/prefix/substring matching via raw, non-tokenized label comparison, so the bonus is both redundant and (as shown) actively harmful when only one token is being scored. The combined multi-word query path is unchanged, since len(norm_terms) > 1 there. Regression tests: an isolated single-token probe now ranks the real multi-word file above the same-named bare method; `_pick_seeds`' per-term diversity guarantee no longer seeds the bare method over the relevant file end-to-end. Full suite (2766 tests, 1 pre-existing unrelated failure) and ruff pass. Verified live: search.handler.ts and its exported symbols now appear in the traversal for the exact query that previously missed them entirely.
…ify-Labs#1618) Graphify-Labs#1616's term-overlap fallback (this same session) fixed `explain` hard- failing to zero on multi-word natural-language phrases, but has its own failure mode: when a query's only shared vocabulary with the corpus is one generic word, every node containing that word ties at the weakest possible bonus tier, and the fallback presents an arbitrary top-10 slice of that tie as though it were a considered answer. Live repro: "server startup error handling" matched 1,765 of this repo's 3,491 nodes (51%) — "server" is also this repo's top-level backend directory name — with the real target buried past rank 800, tied with 1,627 other nodes at the exact same floor score. That's not a useful answer, it's close to a coin flip dressed up as one. Fix: after scoring, if the candidate count exceeds both an absolute floor (50) and 15% of the graph's total node count, treat it as a noise flood and fall back to the same honest zero-match message a genuine miss gets, instead of printing a misleadingly specific candidate list. The floor keeps this from firing on small graphs/fixtures, where even "most of the graph matched" can be a small, legitimate list. Genuine large-but-real candidate lists (e.g. 31 candidates on this repo's ~3,491-node graph, an earlier fix's verified-good case) stay well under the threshold and are unaffected. Regression tests: a 60-of-61-node noise flood on one generic token now gets the honest no-match message; a 20-of-21-node case (below this graph size's threshold) still shows its candidate list normally, confirming the guard is for degenerate floods specifically, not just "more than 10 results." Full suite (2769 tests, all passing this run — the one known pre-existing test-order flake did not trigger) and ruff pass. Verified live: the exact 1,765-candidate flood from earlier now returns the honest no-match message; smaller legitimate fallbacks (critic score aggregation, backdrop image selection) are unaffected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
query,explain, andpathare designed to accept natural-language input, but several parts of theresolution and ranking pipeline only worked correctly for short, exact identifier-style input. Multi-word
questions and phrases could silently resolve to the wrong node, an unrelated node, or no node at all, with
no indication that anything had gone wrong. This PR is a set of targeted fixes to that resolution pipeline —
seed selection, ranking, ambiguity handling, and fallback behavior — so that natural-language queries
against
graph.jsonreliably reach the node(s) they describe, or fail visibly instead of silently.Changes
Each change below targets a distinct, reproducible failure mode in how a natural-language query resolves to
graph nodes.
1. Guarantee per-term BFS seed diversity in
query(_pick_seeds,graphify/serve.py)_pick_seedsselected BFS seeds by combined score and discarded any candidate scoring below 20% of the topscore. On a multi-word query, one term's incidental exact match on an unrelated node (e.g. a common word
that also happens to be an identifier or field name elsewhere in the corpus) can outscore every other term's
weaker match by orders of magnitude, so the 20%-gap cutoff discarded every seed for the query's actually
relevant terms — the BFS then only ever explores the neighborhood of one unrelated node. Fix: guarantee at
least one seed per distinct query term that has any match at all, tie-broken by node degree. This does not
change single-term or already-unambiguous queries; it only recovers seeds that were previously discarded by
the gap cutoff.
2. Reattach cached semantic content on incremental AST-only rebuilds (
_rebuild_code,graphify/watch.py)The incremental code-only rebuild path evicts a re-extracted file's prior semantic nodes/edges
unconditionally and never regenerates them (no LLM call on that path by design), even when the semantic
cache already holds a valid, content-hash-matched result for the file's current bytes. Fix: check the
content-hash-keyed semantic cache for every re-extracted file before eviction and reattach any hit alongside
the fresh AST nodes. This adds no LLM cost (cache lookup only) and is byte-identical to a full rebuild's
cached output for files with a cache hit; files with no cache entry keep the existing AST-only behavior
unchanged.
3. Rank
queryoutput by relevance instead of raw degree (_subgraph_to_text,_query_graph_text,graphify/serve.py)_subgraph_to_texthas always accepted aseedsparameter to render seed nodes first, but its only caller,_query_graph_text, never passed it — everyqueryresult fell back to whole-graph degree-sort, so themost-imported/highest-degree node in the corpus (test setup, DB schema, config, logger, etc.) led the output
regardless of relevance to the query, and the nodes the seeding logic actually selected could be pushed far
down or truncated out by the token budget. Fix:
_query_graph_textnow passes bothseeds=start_nodes(seeds render first) and
scores=scored(the same per-node relevance scores already computed to selectseeds, now also used to rank the non-seed expansion). Nodes with no term overlap fall back to degree exactly
as before, so output ordering for a query with no scoring signal is unchanged.
4. Surface ambiguous or precedence-collapsed matches in
explaininstead of guessing (_find_node_tiers,graphify/serve.py;graphify/__main__.py)explain's single-node resolution already computed a full tier-ordered candidate list (exactsource-path match, exact match, prefix match, substring match) but took the first candidate unconditionally
and discarded the rest, with no signal when other candidates existed. Two related failure shapes: (a)
multiple candidates tied in the same top-precedence tier, and (b) exactly one top-tier candidate that is
weakly connected (degree ≤ 1), fully hiding a more relevant candidate one tier down purely due to tier
precedence. Fix:
_find_node_tiersexposes tier boundaries so both shapes can be detected;explainnowlists all tied/hidden candidates instead of picking one, with a degree-dominance exception (top candidate's
degree at least ~3x the runner-up's resolves directly, since that is not a real ambiguity) and a
--forceflag to bypass the guard and reproduce the prior single-candidate behavior.
5. Surface ambiguous endpoint matches in
pathinstead of a warning that is ignored(
graphify/__main__.py)pathalready computed a top-vs-runner-up score gap for each endpoint but only printed a stderr warning andthen silently proceeded with the top-scored candidate regardless. Fix: the same close-score-gap condition is
now a hard ambiguity gate — a numbered candidate list (label, source, degree, score) is shown instead of a
warning, with the same degree-dominance exception as (4) so a genuinely dominant symbol still resolves
directly, and the same
--forceflag to bypass.6. Fall back to term-overlap candidates in
explaininstead of failing silently on a multi-word phrase(
graphify/__main__.py)_find_node_tiersrequires the query's tokens, rejoined into a single string, to match/prefix/substring asingle node's label as a whole. A genuine multi-word natural-language phrase (e.g. three or more words)
almost never appears as a literal substring of any one node's label, so this returns "No node matching
found" even when every individual word in the phrase exists on a real, relevant node — with no fallback and
no signal that anything went wrong. Fix: when the tiered lookup finds nothing and the phrase has more than
one token, fall back to the same per-token relevance scoring
queryalready uses (_score_nodes) and listthe top candidates by term overlap, in the same candidate-list format used by (4). A genuine single-word
miss is unaffected — a one-word probe would score identically to the substring tier already tried and has
nothing new to find, so the fallback is gated on token count.
7. Gate
_score_nodes' full-query bonus to multi-token queries (graphify/serve.py)_score_nodeshas a "joined" full-query tier so a multi-word query that equals or prefixes a wholemulti-word label wins outright, since no single token in a per-token score sum could otherwise equal that
label. This tier compares the query's tokenized form (punctuation stripped) against a node's tokenized
label. For a single-token query, this can falsely promote a node whose only word-character content matches
that one token — e.g. a bare method call like
.search(), which tokenizes to"search"— to the sametier as a genuine exact match, even though the same node correctly fails the ordinary per-token exact check
a few lines below (raw label
".search"is not equal to raw query"search"). This specifically affects_pick_seeds's per-term seed-diversity check from (1), which probes each distinct query term in isolationvia
_score_nodes(G, [term]): a short method name repeated across multiple unrelated files can win thatsingle-term probe outright and prevent a genuinely relevant multi-word file (which only reaches the prefix
tier for the same bare word) from ever being selected as a seed. Fix: gate the joined-tier bonus on
len(norm_terms) > 1. A single-token probe has no "multi-word phrase vs. per-token bag-of-words"distinction to make — the per-token loop immediately below already fully and correctly handles single-term
exact/prefix/substring matching via raw, non-tokenized comparison. The combined multi-word query path used
everywhere else is unaffected, since it always has more than one token.
8. Cap
explain's term-overlap fallback to avoid noise floods (graphify/__main__.py)The fallback added in (6) has no floor on how broad a "match" is useful. A query whose only vocabulary
overlap with the corpus is one generic word (e.g. a word that is also a top-level directory name in the
codebase being queried) can match a large fraction of the entire graph at the weakest possible scoring tier,
and the fallback would present an arbitrary slice of that as though it were a considered answer, when in
fact none of the matched nodes are meaningfully distinguished from one another. Fix: after scoring, if the
candidate count exceeds both an absolute floor (50 — so small graphs are never affected) and 15% of the
graph's total node count, the fallback is treated as a noise flood and the plain zero-match message is shown
instead of a misleadingly specific candidate list. Genuine large-but-real candidate lists stay well under
this threshold and are unaffected.
Testing
Full test suite: 2768 passing, 28 skipped, 1 pre-existing failure unrelated to these changes
(
tests/test_cache.py::test_semantic_prune_removes_orphan_entries, a test-order-dependent flake reproducibleidentically on
v8before any commit in this PR).ruff checkpasses with no findings.Each change above has dedicated regression tests in
tests/test_serve.py,tests/test_watch.py,tests/test_explain_cli.py, andtests/test_path_cli.py, covering both the failure case being fixed and anon-regression pin for the prior, still-correct behavior it must not disturb.