fix: guarantee per-term BFS seed diversity in query (fixes #1445) by nokternol · Pull Request #1596 · Graphify-Labs/graphify

nokternol · 2026-07-01T20:17:59Z

Summary

Fixes #1445. _pick_seeds' gap_ratio cutoff discards any BFS-seed candidate
scoring below 20% of the top score. On a multi-term natural-language query,
one term's incidental exact label match on a node that is otherwise
unrelated to the query's intent (e.g. a common word that also happens to be
used as a field name or identifier elsewhere in the corpus) scores ~1000x
higher than any substring match on the query's other, actually-relevant
terms (_EXACT_MATCH_BONUS vs _SUBSTRING_MATCH_BONUS). The cutoff then
silently discards every one of those substring-tier candidates as seeds, so
the traversal only ever explores the neighborhood of the one unrelated exact
match — and query returns confidently-wrong results with no signal that
anything went wrong.

This is the same shape as #1445's reproduction: a vague query that doesn't
name a target symbol seeds from unrelated "concept-dense" nodes instead, even
though the target node is present in the graph and would have matched on a
weaker (substring) tier.

Fix

_pick_seeds now optionally accepts the graph and the tokenized query terms.
When supplied, it guarantees at least one seed per distinct query term that
has any match at all, so one term's incidental collision cannot starve out
the others. Ties within a term are broken by node degree (structural
centrality), so an isolated incidental match doesn't out-rank a real,
well-connected hub for that term.

Both new parameters default to None; existing callers that don't pass them
see byte-identical behavior — verified by a new test
(test_pick_seeds_without_diversity_args_is_unchanged) alongside the existing
#897 test suite, which is unmodified and still passing.

Testing

Added test_pick_seeds_diversity_recovers_starved_term, reproducing the
exact failure shape from BFS path drift on vague Chinese queries: 40-node corpus, 12 unrelated nodes returned for "rate_limit_check 怎么用？" #1445 with a minimal graph (one node that's an
exact match for an unrelated term, one node that's a substring match for
the actually-relevant term via an edge from a third node) — confirms the
target is discarded without the fix and recovered with it.
Full suite: 74 passed (uv run --group dev pytest tests/test_serve.py -q).
ruff check clean on both changed files.

Scope note

While investigating this, I found (and locally patched, but am not
including here) two related but separable issues on the same code path,
happy to open as follow-ups if useful:

A hub-degree threshold in _bfs (max(50, p99_degree)) that, once seed
diversity is fixed and BFS starts from more seeds, lets shared
infrastructure nodes below that floor flood the output. The right general
fix isn't obvious (a repo-specific floor is a blunt hack), so I left it out
of this PR rather than guess at the right heuristic.
A keywords-style node metadata field so enrichment can attach
paraphrase/alias surface beyond a bare identifier label — this is a real
feature addition (plus a matching change to the trigram prefilter in
_node_search_text, which silently excludes any field not indexed there),
not a bug fix, so it felt out of scope for a PR against this issue
specifically.

Happy to adjust the approach if you'd prefer a different fix shape (e.g.
tuning gap_ratio per-term instead of a hard per-term guarantee, or a cap on
total diversity seeds for very long queries).

…abs#1445) _pick_seeds' gap_ratio cutoff discards any candidate scoring below 20% of the top score. On a multi-term natural-language query, one term's incidental EXACT label match on a node that is otherwise unrelated to the query's intent (e.g. a common word also used as a field name or identifier elsewhere in the corpus) scores ~1000x higher than any SUBSTRING match on the query's other, actually-relevant terms (_EXACT_MATCH_BONUS vs _SUBSTRING_MATCH_BONUS). The cutoff then silently discards every one of those substring-tier candidates as BFS seeds, so the traversal only ever explores the neighborhood of the one unrelated exact match, and `query` returns confidently-wrong results with no signal that anything went wrong. This matches Graphify-Labs#1445's reproduction exactly: a vague query that doesn't name a target symbol seeds from unrelated "concept-dense" nodes instead, even though the target node is present in the graph. _pick_seeds now optionally accepts the graph and the tokenized query terms; when supplied, it guarantees at least one seed per distinct term that has any match at all, so one term's collision cannot starve out the others. Ties within a term are broken by node degree, so an isolated incidental match doesn't out-rank a real, well-connected hub for that term. The parameters default to None and existing callers that don't pass them see byte-identical behavior (see test_pick_seeds_without_diversity_args_is_unchanged). Adds a regression test reproducing the exact failure shape from Graphify-Labs#1445 and confirms the previously-starved target node is recovered as a seed once G/terms are supplied. Full test suite (74 tests) and ruff both pass.

mark-butterworth-lookinglocal · 2026-07-02T05:17:29Z

Sorry, I'll get Claude to redo this PR tonight... I'm not sure how the patch got so messed up.

safishamsi · 2026-07-03T09:18:52Z

Merged into v8 as d56ee83 (your authorship). Verified with a #1445-shaped graph: a multi-term query that used to collapse to a single incidental leaf (['home']) now seeds the whole relevant cluster (['home','alert','dash','grid','kpi']). Backward-compatible (G/terms are optional kwargs), and the degree tie-break is a good call so an isolated incidental match doesn't out-rank a real hub. Also partially addresses the seed-hijack in #1602 (this is the seed-selection half; the scoring/term-coverage half is still open there). Full suite 2855. Ships next release — thanks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

nokternol mentioned this pull request Jul 1, 2026

docs(experiments): does a knowledge graph beat plain docs for AI-assisted planning? nokternol/maintainarr#19

Open

nokternol changed the base branch from main to v8 July 2, 2026 19:23

safishamsi added a commit that referenced this pull request Jul 3, 2026

docs(changelog): note per-term BFS seed diversity (#1596)

412a29d

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

safishamsi closed this Jul 3, 2026

safishamsi mentioned this pull request Jul 3, 2026

query: single generic term exact-matching a short leaf label hijacks seed selection in multi-term queries #1602

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: guarantee per-term BFS seed diversity in query (fixes #1445)#1596

fix: guarantee per-term BFS seed diversity in query (fixes #1445)#1596
nokternol wants to merge 1 commit into
Graphify-Labs:v8from
nokternol:fix/seed-selection-single-term-collision

nokternol commented Jul 1, 2026

Uh oh!

mark-butterworth-lookinglocal commented Jul 2, 2026

Uh oh!

safishamsi commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

nokternol commented Jul 1, 2026

Summary

Fix

Testing

Scope note

Uh oh!

mark-butterworth-lookinglocal commented Jul 2, 2026

Uh oh!

safishamsi commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants