Version: graphifyy 0.8.51 · macOS · Python 3.14
Summary
The semantic extraction cache (graphify-out/cache/semantic/) accumulates orphaned entries without bound. When a document changes, extract writes a new {file_hash}.json and leaves the old one behind forever. The AST cache has a cleanup pass (_cleanup_stale_ast_entries), but there is no equivalent for the semantic cache, and the only semantic-cache removal available is clear_cache, which wipes everything and forces a full cold re-extraction.
Why it matters
This is invisible for users who gitignore the cache, but it's a real problem for the documented pattern of committing the content-addressed semantic cache (publish graph.json + cache, exclude manifest.json) so a fresh clone/CI gets warm rebuilds — see #769, which describes exactly that setup. The committed cache then grows indefinitely as docs are edited. Observed: 152 cache/semantic/ entries for 124 live docs after a few editing rounds (28 orphans), with nothing to reclaim them short of a full wipe.
Reproduction
graphify extract . --backend <backend> on a corpus with documents → note the cache/semantic/ file count.
- Edit one document; re-extract.
cache/semantic/ now holds both the old and new {hash}.json for that document; the old entry is never removed.
Root cause
In cache.py:
save_semantic_cache writes one entry per file keyed by file_hash(content + relpath).
_cleanup_stale_ast_entries (called from the AST path) has no semantic counterpart.
clear_cache is all-or-nothing (deletes ast/, semantic/, and legacy entries).
So a changed document's prior entry is orphaned, and nothing selectively reclaims it.
Suggested fix
A selective semantic prune, analogous to the AST cleanup: given the current detected document set, delete cache/semantic/*.json whose stem isn't in the live hash set
{ file_hash(doc, root)
for kind in ("document", "paper", "image", "video")
for doc in detect(root)["files"].get(kind, []) }
It could run at the end of extract (the live set is already known there) or as a graphify cache-prune verb. The failure mode is benign — deleting a still-live entry just re-extracts that one document on the next run.
We implemented exactly this downstream using detect() + cache.file_hash() + cache.cache_dir(root, "semantic"), and it keeps the committed cache equal to the live document set. Happy to send a PR if that's welcome.
Version: graphifyy 0.8.51 · macOS · Python 3.14
Summary
The semantic extraction cache (
graphify-out/cache/semantic/) accumulates orphaned entries without bound. When a document changes,extractwrites a new{file_hash}.jsonand leaves the old one behind forever. The AST cache has a cleanup pass (_cleanup_stale_ast_entries), but there is no equivalent for the semantic cache, and the only semantic-cache removal available isclear_cache, which wipes everything and forces a full cold re-extraction.Why it matters
This is invisible for users who gitignore the cache, but it's a real problem for the documented pattern of committing the content-addressed semantic cache (publish
graph.json+ cache, excludemanifest.json) so a fresh clone/CI gets warm rebuilds — see #769, which describes exactly that setup. The committed cache then grows indefinitely as docs are edited. Observed: 152cache/semantic/entries for 124 live docs after a few editing rounds (28 orphans), with nothing to reclaim them short of a full wipe.Reproduction
graphify extract . --backend <backend>on a corpus with documents → note thecache/semantic/file count.cache/semantic/now holds both the old and new{hash}.jsonfor that document; the old entry is never removed.Root cause
In
cache.py:save_semantic_cachewrites one entry per file keyed byfile_hash(content + relpath)._cleanup_stale_ast_entries(called from the AST path) has no semantic counterpart.clear_cacheis all-or-nothing (deletesast/,semantic/, and legacy entries).So a changed document's prior entry is orphaned, and nothing selectively reclaims it.
Suggested fix
A selective semantic prune, analogous to the AST cleanup: given the current detected document set, delete
cache/semantic/*.jsonwhose stem isn't in the live hash setIt could run at the end of
extract(the live set is already known there) or as agraphify cache-pruneverb. The failure mode is benign — deleting a still-live entry just re-extracts that one document on the next run.We implemented exactly this downstream using
detect()+cache.file_hash()+cache.cache_dir(root, "semantic"), and it keeps the committed cache equal to the live document set. Happy to send a PR if that's welcome.