Symptom
After running graphify on a docs corpus from the repo root with no explicit path argument (so graphify defaults to .), the resulting graph.json contains a roughly 50/50 mix of absolute and relative source_file values on nodes, links, and hyperedges.
Concrete numbers from one of our runs (2,587-node graph over a markdown/docs corpus):
| Element |
Absolute |
Relative |
| nodes |
1,292 |
1,295 |
| links |
1,479 |
— |
| hyperedges |
55 |
— |
Absolute samples (note: these are not paths we passed in — graphify was invoked from the repo root with no path arg, so the cwd is the only thing it knew about):
/Users/dennis/Documents/projects/documents/document-linking-investigation.md
/Users/dennis/Documents/projects/documents/CLAUDE.md
All affected nodes are file_type: document (.md files) — i.e., they go through the semantic-extraction subagent path, not the AST path. AST-extracted nodes appear to be consistently relative.
Impact
Downstream consumers that key off source_file for exact-string matching break silently. For us, the MCP traversal seeder matches chunk paths (which are always relative) against source_file, so every absolute-path node/link/hyperedge becomes invisible to graph traversal — roughly half the graph in our case.
Likely cause
The subagent prompt in the SKILL/README instructs each chunk worker to emit \"source_file\":\"relative/path\", but with no enforcement the LLM is free to echo whatever it received in FILE_LIST. In practice ~half the time it does. The fix probably belongs in the extractor write path — normalize source_file to be relative to the corpus root before serializing into graph.json — rather than relying on the subagent to obey the instruction.
Workaround we're using
A post-processing script that rewrites graph.json after every build, stripping the docs-root prefix. Effective but obviously a band-aid — happy to upstream if useful, but a guarantee at write time would be the right fix.
Let me know if a repro corpus would help.
Symptom
After running graphify on a docs corpus from the repo root with no explicit path argument (so graphify defaults to
.), the resultinggraph.jsoncontains a roughly 50/50 mix of absolute and relativesource_filevalues on nodes, links, and hyperedges.Concrete numbers from one of our runs (2,587-node graph over a markdown/docs corpus):
Absolute samples (note: these are not paths we passed in — graphify was invoked from the repo root with no path arg, so the cwd is the only thing it knew about):
All affected nodes are
file_type: document(.mdfiles) — i.e., they go through the semantic-extraction subagent path, not the AST path. AST-extracted nodes appear to be consistently relative.Impact
Downstream consumers that key off
source_filefor exact-string matching break silently. For us, the MCP traversal seeder matches chunk paths (which are always relative) againstsource_file, so every absolute-path node/link/hyperedge becomes invisible to graph traversal — roughly half the graph in our case.Likely cause
The subagent prompt in the SKILL/README instructs each chunk worker to emit
\"source_file\":\"relative/path\", but with no enforcement the LLM is free to echo whatever it received inFILE_LIST. In practice ~half the time it does. The fix probably belongs in the extractor write path — normalizesource_fileto be relative to the corpus root before serializing intograph.json— rather than relying on the subagent to obey the instruction.Workaround we're using
A post-processing script that rewrites
graph.jsonafter every build, stripping the docs-root prefix. Effective but obviously a band-aid — happy to upstream if useful, but a guarantee at write time would be the right fix.Let me know if a repro corpus would help.