Skip to content

Semantic cache is never pruned — orphan entries accumulate unbounded (AST cache is pruned, semantic isn't) #1527

Description

@mwolter805

Version: graphifyy 0.8.51 · macOS · Python 3.14

Summary

The semantic extraction cache (graphify-out/cache/semantic/) accumulates orphaned entries without bound. When a document changes, extract writes a new {file_hash}.json and leaves the old one behind forever. The AST cache has a cleanup pass (_cleanup_stale_ast_entries), but there is no equivalent for the semantic cache, and the only semantic-cache removal available is clear_cache, which wipes everything and forces a full cold re-extraction.

Why it matters

This is invisible for users who gitignore the cache, but it's a real problem for the documented pattern of committing the content-addressed semantic cache (publish graph.json + cache, exclude manifest.json) so a fresh clone/CI gets warm rebuilds — see #769, which describes exactly that setup. The committed cache then grows indefinitely as docs are edited. Observed: 152 cache/semantic/ entries for 124 live docs after a few editing rounds (28 orphans), with nothing to reclaim them short of a full wipe.

Reproduction

  1. graphify extract . --backend <backend> on a corpus with documents → note the cache/semantic/ file count.
  2. Edit one document; re-extract.
  3. cache/semantic/ now holds both the old and new {hash}.json for that document; the old entry is never removed.

Root cause

In cache.py:

  • save_semantic_cache writes one entry per file keyed by file_hash(content + relpath).
  • _cleanup_stale_ast_entries (called from the AST path) has no semantic counterpart.
  • clear_cache is all-or-nothing (deletes ast/, semantic/, and legacy entries).

So a changed document's prior entry is orphaned, and nothing selectively reclaims it.

Suggested fix

A selective semantic prune, analogous to the AST cleanup: given the current detected document set, delete cache/semantic/*.json whose stem isn't in the live hash set

{ file_hash(doc, root)
  for kind in ("document", "paper", "image", "video")
  for doc in detect(root)["files"].get(kind, []) }

It could run at the end of extract (the live set is already known there) or as a graphify cache-prune verb. The failure mode is benign — deleting a still-live entry just re-extracts that one document on the next run.

We implemented exactly this downstream using detect() + cache.file_hash() + cache.cache_dir(root, "semantic"), and it keeps the committed cache equal to the live document set. Happy to send a PR if that's welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions