fix: guarantee per-term BFS seed diversity in query (fixes #1445)#1596
Closed
nokternol wants to merge 1 commit into
Closed
fix: guarantee per-term BFS seed diversity in query (fixes #1445)#1596nokternol wants to merge 1 commit into
nokternol wants to merge 1 commit into
Conversation
…abs#1445) _pick_seeds' gap_ratio cutoff discards any candidate scoring below 20% of the top score. On a multi-term natural-language query, one term's incidental EXACT label match on a node that is otherwise unrelated to the query's intent (e.g. a common word also used as a field name or identifier elsewhere in the corpus) scores ~1000x higher than any SUBSTRING match on the query's other, actually-relevant terms (_EXACT_MATCH_BONUS vs _SUBSTRING_MATCH_BONUS). The cutoff then silently discards every one of those substring-tier candidates as BFS seeds, so the traversal only ever explores the neighborhood of the one unrelated exact match, and `query` returns confidently-wrong results with no signal that anything went wrong. This matches Graphify-Labs#1445's reproduction exactly: a vague query that doesn't name a target symbol seeds from unrelated "concept-dense" nodes instead, even though the target node is present in the graph. _pick_seeds now optionally accepts the graph and the tokenized query terms; when supplied, it guarantees at least one seed per distinct term that has any match at all, so one term's collision cannot starve out the others. Ties within a term are broken by node degree, so an isolated incidental match doesn't out-rank a real, well-connected hub for that term. The parameters default to None and existing callers that don't pass them see byte-identical behavior (see test_pick_seeds_without_diversity_args_is_unchanged). Adds a regression test reproducing the exact failure shape from Graphify-Labs#1445 and confirms the previously-starved target node is recovered as a seed once G/terms are supplied. Full test suite (74 tests) and ruff both pass.
|
Sorry, I'll get Claude to redo this PR tonight... I'm not sure how the patch got so messed up. |
Collaborator
|
Merged into |
safishamsi
added a commit
that referenced
this pull request
Jul 3, 2026
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1445.
_pick_seeds'gap_ratiocutoff discards any BFS-seed candidatescoring below 20% of the top score. On a multi-term natural-language query,
one term's incidental exact label match on a node that is otherwise
unrelated to the query's intent (e.g. a common word that also happens to be
used as a field name or identifier elsewhere in the corpus) scores ~1000x
higher than any substring match on the query's other, actually-relevant
terms (
_EXACT_MATCH_BONUSvs_SUBSTRING_MATCH_BONUS). The cutoff thensilently discards every one of those substring-tier candidates as seeds, so
the traversal only ever explores the neighborhood of the one unrelated exact
match — and
queryreturns confidently-wrong results with no signal thatanything went wrong.
This is the same shape as #1445's reproduction: a vague query that doesn't
name a target symbol seeds from unrelated "concept-dense" nodes instead, even
though the target node is present in the graph and would have matched on a
weaker (substring) tier.
Fix
_pick_seedsnow optionally accepts the graph and the tokenized query terms.When supplied, it guarantees at least one seed per distinct query term that
has any match at all, so one term's incidental collision cannot starve out
the others. Ties within a term are broken by node degree (structural
centrality), so an isolated incidental match doesn't out-rank a real,
well-connected hub for that term.
Both new parameters default to
None; existing callers that don't pass themsee byte-identical behavior — verified by a new test
(
test_pick_seeds_without_diversity_args_is_unchanged) alongside the existing#897test suite, which is unmodified and still passing.Testing
test_pick_seeds_diversity_recovers_starved_term, reproducing theexact failure shape from BFS path drift on vague Chinese queries: 40-node corpus, 12 unrelated nodes returned for "rate_limit_check 怎么用?" #1445 with a minimal graph (one node that's an
exact match for an unrelated term, one node that's a substring match for
the actually-relevant term via an edge from a third node) — confirms the
target is discarded without the fix and recovered with it.
74 passed(uv run --group dev pytest tests/test_serve.py -q).ruff checkclean on both changed files.Scope note
While investigating this, I found (and locally patched, but am not
including here) two related but separable issues on the same code path,
happy to open as follow-ups if useful:
_bfs(max(50, p99_degree)) that, once seeddiversity is fixed and BFS starts from more seeds, lets shared
infrastructure nodes below that floor flood the output. The right general
fix isn't obvious (a repo-specific floor is a blunt hack), so I left it out
of this PR rather than guess at the right heuristic.
keywords-style node metadata field so enrichment can attachparaphrase/alias surface beyond a bare identifier label — this is a real
feature addition (plus a matching change to the trigram prefilter in
_node_search_text, which silently excludes any field not indexed there),not a bug fix, so it felt out of scope for a PR against this issue
specifically.
Happy to adjust the approach if you'd prefer a different fix shape (e.g.
tuning
gap_ratioper-term instead of a hard per-term guarantee, or a cap ontotal diversity seeds for very long queries).