Summary
Incremental --update leaves ghost nodes when a file is deleted from the corpus. The deleted file's nodes survive the prune and remain in graph.json indefinitely, because build_merge's prune-matching compares absolute paths (from detect_incremental) against the relative source_file keys stored on nodes.
Version: graphifyy==0.9.3
Root cause
The --update runbook (in the shipped skill, references/update.md) calls:
G = build_merge(
[new_extraction],
graph_path='graphify-out/graph.json',
prune_sources=prune, # <-- no root= passed
)
prune comes from detect_incremental(...)['deleted_files'], which returns absolute paths (e.g. /repo/HANDOFF.md). Stored nodes keep a relative source_file (e.g. HANDOFF.md).
In build_merge, the prune set is built as {p, _norm_source_file(p, _root_str)}. But because root was never threaded through, _root_str is None, and _norm_source_file(abs, None) returns the absolute path unchanged — so neither entry matches the relative node key:
from graphify.build import _norm_source_file
_norm_source_file('/repo/HANDOFF.md', None) # -> '/repo/HANDOFF.md' (no match vs 'HANDOFF.md')
_norm_source_file('/repo/HANDOFF.md', '/repo') # -> 'HANDOFF.md' (matches)
The console still prints Pruned N node(s) from M deleted source file(s) (M counts the inputs, not actual matches), so the failure is silent.
Reproduction
- Build a graph on a repo containing a doc file, e.g.
HANDOFF.md.
- Delete
HANDOFF.md.
- Run
/graphify <repo> --update (or the build_merge call from references/update.md).
- Observe:
graph.json still contains all HANDOFF.md-sourced nodes; [n for n in nodes if n['source_file']=='HANDOFF.md'] is non-empty.
Suggested fix
Either (or both):
- Runbook: pass
root in references/update.md's build_merge call, consistent with the rest of the runbook which already threads root='INPUT_PATH':
G = build_merge([new_extraction], graph_path='graphify-out/graph.json',
prune_sources=prune, root='INPUT_PATH')
- Library hardening: in
build_merge, when root is None, default the prune-normalization root to the graph's scan root (e.g. graph_path.parent.parent or the saved .graphify_root) so absolute prune_sources always relativize to the stored key form.
Impact
Any --update that deletes files accumulates stale nodes/edges across runs, silently degrading graph accuracy over time. Workaround for now: after an --update with deletions, manually drop nodes whose source_file no longer exists on disk and rebuild.
Summary
Incremental
--updateleaves ghost nodes when a file is deleted from the corpus. The deleted file's nodes survive the prune and remain ingraph.jsonindefinitely, becausebuild_merge's prune-matching compares absolute paths (fromdetect_incremental) against the relativesource_filekeys stored on nodes.Version:
graphifyy==0.9.3Root cause
The
--updaterunbook (in the shipped skill,references/update.md) calls:prunecomes fromdetect_incremental(...)['deleted_files'], which returns absolute paths (e.g./repo/HANDOFF.md). Stored nodes keep a relativesource_file(e.g.HANDOFF.md).In
build_merge, the prune set is built as{p, _norm_source_file(p, _root_str)}. But becauserootwas never threaded through,_root_strisNone, and_norm_source_file(abs, None)returns the absolute path unchanged — so neither entry matches the relative node key:The console still prints
Pruned N node(s) from M deleted source file(s)(M counts the inputs, not actual matches), so the failure is silent.Reproduction
HANDOFF.md.HANDOFF.md./graphify <repo> --update(or thebuild_mergecall fromreferences/update.md).graph.jsonstill contains allHANDOFF.md-sourced nodes;[n for n in nodes if n['source_file']=='HANDOFF.md']is non-empty.Suggested fix
Either (or both):
rootinreferences/update.md'sbuild_mergecall, consistent with the rest of the runbook which already threadsroot='INPUT_PATH':build_merge, whenrootisNone, default the prune-normalization root to the graph's scan root (e.g.graph_path.parent.parentor the saved.graphify_root) so absoluteprune_sourcesalways relativize to the stored key form.Impact
Any
--updatethat deletes files accumulates stale nodes/edges across runs, silently degrading graph accuracy over time. Workaround for now: after an--updatewith deletions, manually drop nodes whosesource_fileno longer exists on disk and rebuild.