Summary
Pass 2 fuzzy dedup in graphify/dedup.py merges two distinct symbols when their normalized labels are long (>= 12 chars) and differ by a small edit, e.g. a trailing plural s, or one name being a strict prefix-extension of the other. The guards that prevent this for short labels (_is_variant_pair, _short_label_blocked) both early-out for labels >= 12 chars, so long near-twins are never protected. The absorbed symbol disappears from graph.json and its edges (callers, imports, etc.) are silently reattached to the surviving node, corrupting call/dependency data.
Minimal repro (uses graphify's own dedup functions)
from rapidfuzz.distance import JaroWinkler
from graphify.dedup import _norm, _is_variant_pair, _short_label_blocked, _MERGE_THRESHOLD
a = _norm("getActiveSession()") # 'getactivesession' (16 chars)
b = _norm("getActiveSessions()") # 'getactivesessions' (17) - a distinct batch/plural function
score = JaroWinkler.normalized_similarity(a, b) * 100
print("JW:", round(score, 2)) # 98.82
print("MERGE_THRESHOLD:", _MERGE_THRESHOLD) # 92.0 -> over threshold, merges
print("is_variant_pair blocks?:", _is_variant_pair(a, b)) # False (guard is < 12 chars only)
print("short_label_blocked?:", _short_label_blocked(a, b, score)) # False (guard is < 12 chars only)
score (98.82) >= 92 and neither guard applies (both return early at max(len(a), len(b)) >= 12), so the pair is unioned in Pass 2. The same-file partition check (the if norm_label == neighbor_norm: branch) only fires when the normalized labels are identical, so it does not catch this either, even when both symbols are defined in the same file.
Effect
Two genuinely different functions in the same file, e.g. a single-item function and its batch sibling whose names differ only by a trailing s, collapse into one node. The plural is absent from graph.json; queries for it return nothing, and the singular's callers/dependents wrongly include the plural's call sites. The merge is silent, only an aggregate Deduplicated N node(s) (... M fuzzy) line is printed, so an unknown fraction of those M fuzzy merges may be false positives of this kind.
Expected
Distinct symbols whose normalized names differ in length, especially where one is a strict prefix-extension of the other (getActiveSession is a prefix of getActiveSessions; parseConfig vs parseConfigFile), should not be merged. Such pairs are almost never duplicates.
Suggested fix
Either:
-
Apply the length-difference guard to all label lengths, not only < 12 (drop the if max(len(a), len(b)) >= 12: return False early-out in _short_label_blocked); or
-
Add a targeted check in Pass 2 before the union, skip the merge when one normalized label is a strict prefix-extension of the other:
lo, hi = sorted((norm_label, neighbor_norm), key=len)
if hi != lo and hi.startswith(lo):
continue
Prefix-extension pairs are essentially never duplicates, so this is low-risk.
Workaround status
There appears to be no config knob to disable or tune fuzzy dedup. --dedup-llm only adjudicates pairs scoring in [75, 92); pairs like this score ~98 and are auto-merged in Pass 2 before the LLM tiebreaker runs, so it does not help. Affected users currently have no workaround short of patching dedup.py.
Environment
Summary
Pass 2 fuzzy dedup in
graphify/dedup.pymerges two distinct symbols when their normalized labels are long (>= 12 chars) and differ by a small edit, e.g. a trailing plurals, or one name being a strict prefix-extension of the other. The guards that prevent this for short labels (_is_variant_pair,_short_label_blocked) both early-out for labels >= 12 chars, so long near-twins are never protected. The absorbed symbol disappears fromgraph.jsonand its edges (callers, imports, etc.) are silently reattached to the surviving node, corrupting call/dependency data.Minimal repro (uses graphify's own dedup functions)
score (98.82) >= 92and neither guard applies (both return early atmax(len(a), len(b)) >= 12), so the pair is unioned in Pass 2. The same-file partition check (theif norm_label == neighbor_norm:branch) only fires when the normalized labels are identical, so it does not catch this either, even when both symbols are defined in the same file.Effect
Two genuinely different functions in the same file, e.g. a single-item function and its batch sibling whose names differ only by a trailing
s, collapse into one node. The plural is absent fromgraph.json; queries for it return nothing, and the singular's callers/dependents wrongly include the plural's call sites. The merge is silent, only an aggregateDeduplicated N node(s) (... M fuzzy)line is printed, so an unknown fraction of thoseMfuzzy merges may be false positives of this kind.Expected
Distinct symbols whose normalized names differ in length, especially where one is a strict prefix-extension of the other (
getActiveSessionis a prefix ofgetActiveSessions;parseConfigvsparseConfigFile), should not be merged. Such pairs are almost never duplicates.Suggested fix
Either:
Apply the length-difference guard to all label lengths, not only
< 12(drop theif max(len(a), len(b)) >= 12: return Falseearly-out in_short_label_blocked); orAdd a targeted check in Pass 2 before the union, skip the merge when one normalized label is a strict prefix-extension of the other:
Prefix-extension pairs are essentially never duplicates, so this is low-risk.
Workaround status
There appears to be no config knob to disable or tune fuzzy dedup.
--dedup-llmonly adjudicates pairs scoring in[75, 92); pairs like this score ~98 and are auto-merged in Pass 2 before the LLM tiebreaker runs, so it does not help. Affected users currently have no workaround short of patchingdedup.py.Environment