Semantic cache is never pruned — orphan entries accumulate unbounded (AST cache is pruned, semantic isn't)

**Version:** graphifyy 0.8.51 · macOS · Python 3.14

## Summary

The semantic extraction cache (`graphify-out/cache/semantic/`) accumulates orphaned entries without bound. When a document changes, `extract` writes a new `{file_hash}.json` and leaves the old one behind forever. The AST cache has a cleanup pass (`_cleanup_stale_ast_entries`), but there is no equivalent for the semantic cache, and the only semantic-cache removal available is `clear_cache`, which wipes everything and forces a full cold re-extraction.

## Why it matters

This is invisible for users who gitignore the cache, but it's a real problem for the documented pattern of **committing the content-addressed semantic cache** (publish `graph.json` + cache, exclude `manifest.json`) so a fresh clone/CI gets warm rebuilds — see #769, which describes exactly that setup. The committed cache then grows indefinitely as docs are edited. Observed: 152 `cache/semantic/` entries for 124 live docs after a few editing rounds (28 orphans), with nothing to reclaim them short of a full wipe.

## Reproduction

1. `graphify extract . --backend <backend>` on a corpus with documents → note the `cache/semantic/` file count.
2. Edit one document; re-extract.
3. `cache/semantic/` now holds both the old and new `{hash}.json` for that document; the old entry is never removed.

## Root cause

In `cache.py`:
- `save_semantic_cache` writes one entry per file keyed by `file_hash(content + relpath)`.
- `_cleanup_stale_ast_entries` (called from the AST path) has no semantic counterpart.
- `clear_cache` is all-or-nothing (deletes `ast/`, `semantic/`, and legacy entries).

So a changed document's prior entry is orphaned, and nothing selectively reclaims it.

## Suggested fix

A selective semantic prune, analogous to the AST cleanup: given the current detected document set, delete `cache/semantic/*.json` whose stem isn't in the live hash set

```
{ file_hash(doc, root)
  for kind in ("document", "paper", "image", "video")
  for doc in detect(root)["files"].get(kind, []) }
```

It could run at the end of `extract` (the live set is already known there) or as a `graphify cache-prune` verb. The failure mode is benign — deleting a still-live entry just re-extracts that one document on the next run.

We implemented exactly this downstream using `detect()` + `cache.file_hash()` + `cache.cache_dir(root, "semantic")`, and it keeps the committed cache equal to the live document set. Happy to send a PR if that's welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Semantic cache is never pruned — orphan entries accumulate unbounded (AST cache is pruned, semantic isn't) #1527

Summary

Why it matters

Reproduction

Root cause

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Semantic cache is never pruned — orphan entries accumulate unbounded (AST cache is pruned, semantic isn't) #1527

Description

Summary

Why it matters

Reproduction

Root cause

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions