Skip to content

Semantic extractors emit mixed absolute/relative source_file paths #932

Description

@cornwe19

Symptom

After running graphify on a docs corpus from the repo root with no explicit path argument (so graphify defaults to .), the resulting graph.json contains a roughly 50/50 mix of absolute and relative source_file values on nodes, links, and hyperedges.

Concrete numbers from one of our runs (2,587-node graph over a markdown/docs corpus):

Element Absolute Relative
nodes 1,292 1,295
links 1,479
hyperedges 55

Absolute samples (note: these are not paths we passed in — graphify was invoked from the repo root with no path arg, so the cwd is the only thing it knew about):

/Users/dennis/Documents/projects/documents/document-linking-investigation.md
/Users/dennis/Documents/projects/documents/CLAUDE.md

All affected nodes are file_type: document (.md files) — i.e., they go through the semantic-extraction subagent path, not the AST path. AST-extracted nodes appear to be consistently relative.

Impact

Downstream consumers that key off source_file for exact-string matching break silently. For us, the MCP traversal seeder matches chunk paths (which are always relative) against source_file, so every absolute-path node/link/hyperedge becomes invisible to graph traversal — roughly half the graph in our case.

Likely cause

The subagent prompt in the SKILL/README instructs each chunk worker to emit \"source_file\":\"relative/path\", but with no enforcement the LLM is free to echo whatever it received in FILE_LIST. In practice ~half the time it does. The fix probably belongs in the extractor write path — normalize source_file to be relative to the corpus root before serializing into graph.json — rather than relying on the subagent to obey the instruction.

Workaround we're using

A post-processing script that rewrites graph.json after every build, stripping the docs-root prefix. Effective but obviously a band-aid — happy to upstream if useful, but a guarantee at write time would be the right fix.

Let me know if a repro corpus would help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions