Semantic extractors emit mixed absolute/relative source_file paths

## Symptom

After running graphify on a docs corpus from the repo root with no explicit path argument (so graphify defaults to `.`), the resulting `graph.json` contains a roughly 50/50 mix of absolute and relative `source_file` values on nodes, links, and hyperedges.

Concrete numbers from one of our runs (2,587-node graph over a markdown/docs corpus):

| Element     | Absolute | Relative |
|-------------|---------:|---------:|
| nodes       |    1,292 |    1,295 |
| links       |    1,479 |        — |
| hyperedges  |       55 |        — |

Absolute samples (note: these are not paths we passed in — graphify was invoked from the repo root with no path arg, so the cwd is the only thing it knew about):

```
/Users/dennis/Documents/projects/documents/document-linking-investigation.md
/Users/dennis/Documents/projects/documents/CLAUDE.md
```

All affected nodes are `file_type: document` (`.md` files) — i.e., they go through the semantic-extraction subagent path, not the AST path. AST-extracted nodes appear to be consistently relative.

## Impact

Downstream consumers that key off `source_file` for exact-string matching break silently. For us, the MCP traversal seeder matches chunk paths (which are always relative) against `source_file`, so every absolute-path node/link/hyperedge becomes invisible to graph traversal — roughly half the graph in our case.

## Likely cause

The subagent prompt in the SKILL/README instructs each chunk worker to emit `\"source_file\":\"relative/path\"`, but with no enforcement the LLM is free to echo whatever it received in `FILE_LIST`. In practice ~half the time it does. The fix probably belongs in the extractor write path — normalize `source_file` to be relative to the corpus root before serializing into `graph.json` — rather than relying on the subagent to obey the instruction.

## Workaround we're using

A post-processing script that rewrites `graph.json` after every build, stripping the docs-root prefix. Effective but obviously a band-aid — happy to upstream if useful, but a guarantee at write time would be the right fix.

Let me know if a repro corpus would help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Semantic extractors emit mixed absolute/relative source_file paths #932

Symptom

Impact

Likely cause

Workaround we're using

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Semantic extractors emit mixed absolute/relative source_file paths #932

Description

Symptom

Impact

Likely cause

Workaround we're using

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions