Summary
Cross-file type-annotation (and other bare-name) references create phantom duplicate nodes instead of resolving to the single canonical definition. A class defined once but referenced via type annotations in N other files appears as 1 + N nodes in the graph, inflating node counts and god-node/centrality rankings, and splitting edges across duplicates.
Observed on graphifyy==0.8.44 (AST/structural extraction, no LLM key set), Python, but the faulty code path is shared across all 6 language extractors.
Minimal reproduction
pkg/thing.py
class Thing:
def run(self):
return 1
pkg/a.py
from pkg.thing import Thing
def use_a(obj: Thing) -> Thing:
return obj
pkg/b.py
from pkg.thing import Thing
def use_b(obj: Thing) -> Thing:
return obj
from graphify.extract import extract
from pathlib import Path
r = extract([Path('pkg/thing.py'), Path('pkg/a.py'), Path('pkg/b.py')], cache_root=Path('.'))
print(sorted(n['id'] for n in r['nodes'] if n['label'] == 'Thing'))
Actual (0.8.44):
['pkg_a_py_thing', 'pkg_b_py_thing', 'pkg_thing_thing'] # 3 nodes
Note pkg_a_py_thing / pkg_b_py_thing — the file path with extension baked into the id, the signature of the bug.
Expected:
['pkg_thing_thing'] # 1 canonical node; the annotation refs become edges to it
Root cause
ensure_named_node (defined identically as a nested closure in all 6 per-language extractors) builds the bare cross-file reference node via add_node, which always stamps source_file = str_path (the using file):
def ensure_named_node(name: str, line: int) -> str:
nid = _make_id(stem, name)
if nid in seen_ids:
return nid
nid = _make_id(name)
if nid not in seen_ids:
add_node(nid, name, line) # <-- stamps source_file = using file
return nid
A sourced stub looks like a real definition, which breaks the two passes that are supposed to clean it up:
_disambiguate_colliding_node_ids sees the same bare id (thing) emitted from two files with two different source_files, treats them as distinct same-named symbols, and scatters them into per-file ids _make_id(source_key, old_id) → pkg_a_py_thing, pkg_b_py_thing.
_rewire_unique_stub_nodes — which would merge a stub into the unique real definition by label — only considers no-source stubs (if node.get("source_file"): ... continue), so the now-sourced phantoms are never merged.
This contradicts the documented design intent in _resolve_cross_file_* (extract.py ~9038-9046): "Cross-file type references resolve by bare name and fall back to a no-source 'shadow' stub. _rewire_unique_stub_nodes repairs that." The fallback simply isn't sourceless.
Suggested fix
Make ensure_named_node's bare fallback emit a no-source shadow stub (matching the documented intent), so disambiguation leaves it alone (source_keys == {""}, len < 2) and _rewire_unique_stub_nodes merges it into the unique canonical def:
def ensure_named_node(name: str, line: int) -> str:
nid = _make_id(stem, name)
if nid in seen_ids:
return nid
nid = _make_id(name)
if nid not in seen_ids:
seen_ids.add(nid)
nodes.append({
"id": nid,
"label": name,
"file_type": "code",
"source_file": "",
"source_location": "",
})
return nid
Applied to all 6 closures, this fixes the repro (3 → 1 node). On a real 348-file Python project it dropped the graph from 3771 → 3574 nodes (−197 phantoms merged repo-wide) with no over-merge: multi-definition names (main(), __init__, etc.) stay distinct because _rewire only merges when exactly one real def exists. Orphan count stayed low (8) and top god-node rankings were unchanged except the previously-split classes consolidated to their correct single node.
Related (not duplicates)
Summary
Cross-file type-annotation (and other bare-name) references create phantom duplicate nodes instead of resolving to the single canonical definition. A class defined once but referenced via type annotations in N other files appears as 1 + N nodes in the graph, inflating node counts and god-node/centrality rankings, and splitting edges across duplicates.
Observed on
graphifyy==0.8.44(AST/structural extraction, no LLM key set), Python, but the faulty code path is shared across all 6 language extractors.Minimal reproduction
Actual (0.8.44):
Note
pkg_a_py_thing/pkg_b_py_thing— the file path with extension baked into the id, the signature of the bug.Expected:
Root cause
ensure_named_node(defined identically as a nested closure in all 6 per-language extractors) builds the bare cross-file reference node viaadd_node, which always stampssource_file = str_path(the using file):A sourced stub looks like a real definition, which breaks the two passes that are supposed to clean it up:
_disambiguate_colliding_node_idssees the same bare id (thing) emitted from two files with two differentsource_files, treats them as distinct same-named symbols, and scatters them into per-file ids_make_id(source_key, old_id)→pkg_a_py_thing,pkg_b_py_thing._rewire_unique_stub_nodes— which would merge a stub into the unique real definition by label — only considers no-source stubs (if node.get("source_file"): ... continue), so the now-sourced phantoms are never merged.This contradicts the documented design intent in
_resolve_cross_file_*(extract.py ~9038-9046): "Cross-file type references resolve by bare name and fall back to a no-source 'shadow' stub._rewire_unique_stub_nodesrepairs that." The fallback simply isn't sourceless.Suggested fix
Make
ensure_named_node's bare fallback emit a no-source shadow stub (matching the documented intent), so disambiguation leaves it alone (source_keys == {""}, len < 2) and_rewire_unique_stub_nodesmerges it into the unique canonical def:Applied to all 6 closures, this fixes the repro (3 → 1 node). On a real 348-file Python project it dropped the graph from 3771 → 3574 nodes (−197 phantoms merged repo-wide) with no over-merge: multi-definition names (
main(),__init__, etc.) stay distinct because_rewireonly merges when exactly one real def exists. Orphan count stayed low (8) and top god-node rankings were unchanged except the previously-split classes consolidated to their correct single node.Related (not duplicates)