Skip to content

fix(detect): detect Office source edits and stop re-parsing unchanged files (#1649, #1656)#1660

Open
TPAteeq wants to merge 2 commits into
Graphify-Labs:v8from
TPAteeq:fix/office-source-change-detection
Open

fix(detect): detect Office source edits and stop re-parsing unchanged files (#1649, #1656)#1660
TPAteeq wants to merge 2 commits into
Graphify-Labs:v8from
TPAteeq:fix/office-source-change-detection

Conversation

@TPAteeq

@TPAteeq TPAteeq commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Fixes #1649
Fixes #1656

Summary

#1649 and #1656 both live in convert_office_file and pull in opposite
directions, but a single change — a source-content fingerprint — resolves both.

The single-fingerprint approach

convert_office_file now fingerprints the source by its raw bytes (md5).
Hashing the raw bytes does not unzip/parse the OOXML container, so it stays cheap —
that is the whole point. The fingerprint is recorded in the sidecar header:

<!-- converted from report.docx | source-md5: 3b1e… -->

On a later run:

This threads the needle that a naïve "always re-convert" would miss: it neither freezes
edits (#1649) nor re-parses unchanged sources (#1656/#1226).

PDF / word-count re-parse (#1656)

Word counts are now cached in the manifest entry and reused for unchanged files, so
PDFs are no longer re-parsed for their word count on every incremental run:

  • save_manifest seeds/refreshes each entry's word_count, reusing the previous
    value whenever the content hash is unchanged
    and only (re)parsing a genuinely
    new/changed file (video files are skipped, mirroring detect()).
  • detect() accepts the previous run's per-file counts and reuses the cached count for
    any file whose mtime is unchanged instead of re-parsing.
  • detect_incremental loads the manifest first, hands those cached counts down to
    detect(), and proceeds as before.

total_words stays correct in every path (unchanged files contribute their cached
count; changed files are recomputed), so the benchmark command's corpus_words
keeps working — including via the skill's --update flow, which propagates
total_words into .graphify_detect.json.

Backward compatible: legacy manifests without word_count load unchanged (the existing
_normalise_entry dict passthrough preserves the new field, and old entries simply
recompute once on the next run).

Tests

Added to tests/test_detect.py:

  • Edited vs. unchanged .docx and .xlsx via convert_office_file — asserts the
    converter re-parses after an edit and is not re-invoked when the source is unchanged.
  • Legacy (pre-fingerprint) sidecar is upgraded once, then reused.
  • End-to-end: an edited Office source re-enters detect_incremental().new_files, while
    an unchanged one stays in unchanged_files.
  • An unchanged PDF is not re-parsed (extract_pdf_text not called again) on a second
    incremental run.

Test results

uv run pytest tests/test_detect.py tests/test_incremental.py tests/test_office_limits.py tests/test_manifest_ingest.py
# 153 passed

Full suite is green except one pre-existing, environment-only failure
(test_extract.py::test_collect_files_skips_hidden) caused solely by running inside a
.claude/worktrees/… checkout — it fails identically on the base commit with this
branch's changes stashed and is unrelated to this diff.

TPAteeq and others added 2 commits July 5, 2026 00:01
…word counts (Graphify-Labs#1649, Graphify-Labs#1656)

convert_office_file now fingerprints the SOURCE by its raw bytes (md5,
which does NOT unzip/parse the OOXML container, so it stays cheap) and
records it in the sidecar header. On a later run it re-parses and
rewrites the sidecar only when that fingerprint differs — so an edited
.docx/.xlsx re-enters --update (Graphify-Labs#1649), while an unchanged one is never
re-parsed and its mtime never churns (Graphify-Labs#1656, preserving the Graphify-Labs#1226
no-churn guarantee). The single fingerprint resolves both issues, which
previously pulled in opposite directions (the Graphify-Labs#1226 early-return skipped
the write but still parsed every run, and keyed on the source PATH, not
its CONTENT, so edits were silently frozen).

Word counts are now cached in the manifest entry and reused for
unchanged files, so PDFs are no longer re-parsed for their word count on
every incremental run (Graphify-Labs#1656). save_manifest seeds/refreshes the count
(reusing the previous value when the content hash is unchanged); detect()
reuses it when the file's mtime is unchanged. total_words stays correct,
so the benchmark command's corpus_words keeps working. Legacy manifests
without word_count recompute once (backward compatible via the existing
dict passthrough in _normalise_entry).

Tests: edited/unchanged .docx and .xlsx via convert_office_file; legacy
sidecar upgrade; edited Office file re-entering detect_incremental; and
an unchanged PDF not re-parsed on a second incremental run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aphify-Labs#1649, Graphify-Labs#1656)

Address code-review findings on the source-fingerprint / word-count-cache
work without altering the reviewed-correct core design:

- Anchor the sidecar fingerprint regex to the trailing
  ` | source-md5: <fp> -->` delimiter/terminator so a source filename that
  itself contains a "source-md5: <hex>" substring can no longer be captured
  as the fingerprint (which would make the real fingerprint never match, so
  the file re-parsed + rewrote + re-queued on every run). Regression test
  with a pathological filename asserts an unchanged source is parsed once.
- Future-proof the save_manifest word_count cache: content_unchanged now
  keys off the prior hash of the matching kind, so a semantic-only manifest
  (ast_hash never populated) can actually reuse its cached count. The
  kind="both"/"ast" paths are unchanged (still key off ast_hash), so a real
  content change still recomputes.
- Add the missing CHANGELOG `## Unreleased` entry covering both issues.

Preserves the Graphify-Labs#1226 no-churn guarantee and one-time-double-parse behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant