Skip to content

fix: guarantee per-term BFS seed diversity in query (fixes #1445)#1596

Closed
nokternol wants to merge 1 commit into
Graphify-Labs:v8from
nokternol:fix/seed-selection-single-term-collision
Closed

fix: guarantee per-term BFS seed diversity in query (fixes #1445)#1596
nokternol wants to merge 1 commit into
Graphify-Labs:v8from
nokternol:fix/seed-selection-single-term-collision

Conversation

@nokternol

Copy link
Copy Markdown
Contributor

Summary

Fixes #1445. _pick_seeds' gap_ratio cutoff discards any BFS-seed candidate
scoring below 20% of the top score. On a multi-term natural-language query,
one term's incidental exact label match on a node that is otherwise
unrelated to the query's intent (e.g. a common word that also happens to be
used as a field name or identifier elsewhere in the corpus) scores ~1000x
higher than any substring match on the query's other, actually-relevant
terms (_EXACT_MATCH_BONUS vs _SUBSTRING_MATCH_BONUS). The cutoff then
silently discards every one of those substring-tier candidates as seeds, so
the traversal only ever explores the neighborhood of the one unrelated exact
match — and query returns confidently-wrong results with no signal that
anything went wrong.

This is the same shape as #1445's reproduction: a vague query that doesn't
name a target symbol seeds from unrelated "concept-dense" nodes instead, even
though the target node is present in the graph and would have matched on a
weaker (substring) tier.

Fix

_pick_seeds now optionally accepts the graph and the tokenized query terms.
When supplied, it guarantees at least one seed per distinct query term that
has any match at all, so one term's incidental collision cannot starve out
the others. Ties within a term are broken by node degree (structural
centrality), so an isolated incidental match doesn't out-rank a real,
well-connected hub for that term.

Both new parameters default to None; existing callers that don't pass them
see byte-identical behavior — verified by a new test
(test_pick_seeds_without_diversity_args_is_unchanged) alongside the existing
#897 test suite, which is unmodified and still passing.

Testing

Scope note

While investigating this, I found (and locally patched, but am not
including here) two related but separable issues on the same code path,
happy to open as follow-ups if useful:

  1. A hub-degree threshold in _bfs (max(50, p99_degree)) that, once seed
    diversity is fixed and BFS starts from more seeds, lets shared
    infrastructure nodes below that floor flood the output. The right general
    fix isn't obvious (a repo-specific floor is a blunt hack), so I left it out
    of this PR rather than guess at the right heuristic.
  2. A keywords-style node metadata field so enrichment can attach
    paraphrase/alias surface beyond a bare identifier label — this is a real
    feature addition (plus a matching change to the trigram prefilter in
    _node_search_text, which silently excludes any field not indexed there),
    not a bug fix, so it felt out of scope for a PR against this issue
    specifically.

Happy to adjust the approach if you'd prefer a different fix shape (e.g.
tuning gap_ratio per-term instead of a hard per-term guarantee, or a cap on
total diversity seeds for very long queries).

…abs#1445)

_pick_seeds' gap_ratio cutoff discards any candidate scoring below 20%
of the top score. On a multi-term natural-language query, one term's
incidental EXACT label match on a node that is otherwise unrelated to
the query's intent (e.g. a common word also used as a field name or
identifier elsewhere in the corpus) scores ~1000x higher than any
SUBSTRING match on the query's other, actually-relevant terms
(_EXACT_MATCH_BONUS vs _SUBSTRING_MATCH_BONUS). The cutoff then
silently discards every one of those substring-tier candidates as BFS
seeds, so the traversal only ever explores the neighborhood of the one
unrelated exact match, and `query` returns confidently-wrong results
with no signal that anything went wrong.

This matches Graphify-Labs#1445's reproduction exactly: a vague query that doesn't
name a target symbol seeds from unrelated "concept-dense" nodes
instead, even though the target node is present in the graph.

_pick_seeds now optionally accepts the graph and the tokenized query
terms; when supplied, it guarantees at least one seed per distinct
term that has any match at all, so one term's collision cannot starve
out the others. Ties within a term are broken by node degree, so an
isolated incidental match doesn't out-rank a real, well-connected hub
for that term. The parameters default to None and existing callers
that don't pass them see byte-identical behavior (see
test_pick_seeds_without_diversity_args_is_unchanged).

Adds a regression test reproducing the exact failure shape from
Graphify-Labs#1445 and confirms the previously-starved target node is recovered
as a seed once G/terms are supplied.

Full test suite (74 tests) and ruff both pass.
@mark-butterworth-lookinglocal

Copy link
Copy Markdown

Sorry, I'll get Claude to redo this PR tonight... I'm not sure how the patch got so messed up.

@nokternol nokternol changed the base branch from main to v8 July 2, 2026 19:23
@safishamsi

Copy link
Copy Markdown
Collaborator

Merged into v8 as d56ee83 (your authorship). Verified with a #1445-shaped graph: a multi-term query that used to collapse to a single incidental leaf (['home']) now seeds the whole relevant cluster (['home','alert','dash','grid','kpi']). Backward-compatible (G/terms are optional kwargs), and the degree tie-break is a good call so an isolated incidental match doesn't out-rank a real hub. Also partially addresses the seed-hijack in #1602 (this is the seed-selection half; the scoring/term-coverage half is still open there). Full suite 2855. Ships next release — thanks.

safishamsi added a commit that referenced this pull request Jul 3, 2026
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@safishamsi safishamsi closed this Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BFS path drift on vague Chinese queries: 40-node corpus, 12 unrelated nodes returned for "rate_limit_check 怎么用?"

3 participants