Skip to content

Cross-file type-annotation refs create phantom duplicate nodes (sourced stub blocks _rewire_unique_stub_nodes) #1402

Description

@ZedUserdesign

Summary

Cross-file type-annotation (and other bare-name) references create phantom duplicate nodes instead of resolving to the single canonical definition. A class defined once but referenced via type annotations in N other files appears as 1 + N nodes in the graph, inflating node counts and god-node/centrality rankings, and splitting edges across duplicates.

Observed on graphifyy==0.8.44 (AST/structural extraction, no LLM key set), Python, but the faulty code path is shared across all 6 language extractors.

Minimal reproduction

pkg/thing.py
  class Thing:
      def run(self):
          return 1

pkg/a.py
  from pkg.thing import Thing
  def use_a(obj: Thing) -> Thing:
      return obj

pkg/b.py
  from pkg.thing import Thing
  def use_b(obj: Thing) -> Thing:
      return obj
from graphify.extract import extract
from pathlib import Path
r = extract([Path('pkg/thing.py'), Path('pkg/a.py'), Path('pkg/b.py')], cache_root=Path('.'))
print(sorted(n['id'] for n in r['nodes'] if n['label'] == 'Thing'))

Actual (0.8.44):

['pkg_a_py_thing', 'pkg_b_py_thing', 'pkg_thing_thing']   # 3 nodes

Note pkg_a_py_thing / pkg_b_py_thing — the file path with extension baked into the id, the signature of the bug.

Expected:

['pkg_thing_thing']   # 1 canonical node; the annotation refs become edges to it

Root cause

ensure_named_node (defined identically as a nested closure in all 6 per-language extractors) builds the bare cross-file reference node via add_node, which always stamps source_file = str_path (the using file):

def ensure_named_node(name: str, line: int) -> str:
    nid = _make_id(stem, name)
    if nid in seen_ids:
        return nid
    nid = _make_id(name)
    if nid not in seen_ids:
        add_node(nid, name, line)   # <-- stamps source_file = using file
    return nid

A sourced stub looks like a real definition, which breaks the two passes that are supposed to clean it up:

  1. _disambiguate_colliding_node_ids sees the same bare id (thing) emitted from two files with two different source_files, treats them as distinct same-named symbols, and scatters them into per-file ids _make_id(source_key, old_id)pkg_a_py_thing, pkg_b_py_thing.
  2. _rewire_unique_stub_nodes — which would merge a stub into the unique real definition by label — only considers no-source stubs (if node.get("source_file"): ... continue), so the now-sourced phantoms are never merged.

This contradicts the documented design intent in _resolve_cross_file_* (extract.py ~9038-9046): "Cross-file type references resolve by bare name and fall back to a no-source 'shadow' stub. _rewire_unique_stub_nodes repairs that." The fallback simply isn't sourceless.

Suggested fix

Make ensure_named_node's bare fallback emit a no-source shadow stub (matching the documented intent), so disambiguation leaves it alone (source_keys == {""}, len < 2) and _rewire_unique_stub_nodes merges it into the unique canonical def:

def ensure_named_node(name: str, line: int) -> str:
    nid = _make_id(stem, name)
    if nid in seen_ids:
        return nid
    nid = _make_id(name)
    if nid not in seen_ids:
        seen_ids.add(nid)
        nodes.append({
            "id": nid,
            "label": name,
            "file_type": "code",
            "source_file": "",
            "source_location": "",
        })
    return nid

Applied to all 6 closures, this fixes the repro (3 → 1 node). On a real 348-file Python project it dropped the graph from 3771 → 3574 nodes (−197 phantoms merged repo-wide) with no over-merge: multi-definition names (main(), __init__, etc.) stay distinct because _rewire only merges when exactly one real def exists. Orphan count stayed low (8) and top god-node rankings were unchanged except the previously-split classes consolidated to their correct single node.

Related (not duplicates)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions