Optimize ingest processing#1349
Open
navels wants to merge 2 commits into
Open
Conversation
- Ignore stale ingest events for files already moved or deleted after processing. - Deduplicate repeated watcher events so polling/NFS fallback does not reprocess the same path indefinitely. - Fast-exit missing ingest paths before initializing the full application stack. - Defer database reconnect, duplicate-cache invalidation, and duplicate-scan scheduling until the import batch has gone quiet. - Harden internal post-import endpoint calls with bounded retries and clearer failure logging. - Add unit, smoke, and shell coverage for stale events, batch follow-up, and internal endpoint behavior.
- Increase the recent-event TTL used by the shell test so immediate duplicate assertions do not race one-second timestamp boundaries.
new-usemame
added a commit
to new-usemame/Calibre-Web-NextGen
that referenced
this pull request
May 18, 2026
…follow-up (#231) Backport of upstream crocodilestick/Calibre-Web-Automated#1349 by @navels into our community-maintained CWA build. Original PR has been open since 2026-05-13 with no upstream review activity; users get the ingest-reliability improvements today rather than waiting for upstream pace. The PR adds five orthogonal improvements: (1) stale ingest events for already-moved/deleted files fast-exit before the heavy startup path, eliminating the NFS/polling-fallback re-emit loop; (2) the s6 ingest-service script tracks per-file processing markers so the watcher can't reprocess the same path while it's in-flight; (3) module-import-time work in scripts/ingest_processor.py defers behind initialize_runtime() so the script can fast-exit without paying cps/CWA_DB/EPUBFixer/audiobook/requests import cost; (4) per-book refresh_cwa_session + invalidate_duplicate_cache + schedule_debounced_duplicate_scan calls consolidate into a single run_post_batch_follow_up() that fires once after the dirty-marker batch goes quiet; (5) bounded retries on the cwa-internal POST endpoints replace single-shot best-effort calls. Reconciliation against fork-original work was non-trivial — full per-file rationale in the fixup commit body. The /cwa-internal/reconnect-db endpoint keeps fork PR #199's synchronous CalibreDB.refresh_for_new_data() (going back to TaskReconnectDatabase via WorkerThread.add would re-introduce the engine-disposal race that caused fork #192); ingest_processor.py keeps PR #199's _run_calibredb_add_with_retry + metadata_db_write_lock() and PR #208's deferred lock-acquire; the s6 service script preserves PR #37/#56's s6-setuidgid abc privilege drop in the new run_processor_with_timeout() helper; PR #210's debounce-default tests updated to no-op the per-script pins for ingest_processor (the value lives server-side now at /cwa-internal/queue-duplicate-scan). Live-verified end-to-end on cwn-local: missing-target fast-exit doesn't initialize runtime or acquire lock; post-batch follow-up fires exactly once after the dirty-marker batch quiets and refreshes the duplicate cache cleanly (groups=2 max_book_id=12 logged); real ingest of alice-in-wonderland.epub triggered the s6 service Post-batch follow-up exactly once (not per book), no errors. 6 new unit tests + 35 smoke tests + 165 shell tests added by @navels all pass; 65 adjacent regression tests stay green. Credit: @navels — both upstream commits authored by Lee Nave <navels@gmail.com> are preserved on the squash; the reconciliation commit landed under new-usemame identity. Dual-audience credit comment will be posted on CWA#1349 after the release tag publishes.
|
Backported into Calibre-Web-NextGen v4.0.74 at |
|
Small follow-up in v4.0.75 at |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improved ingest reliability by ignoring stale watcher events, avoiding repeated processing of already-moved files, and batching post-import refresh work until imports have settled.
Changes
Testing
Added unit and smoke tests
Manual NFS-backed import test with 10000 EPUBs; ingest drained, UI stayed responsive, and post-batch follow-up ran once.
Same test without this patch: various read timed out, BusyError, database is locked, disk I/O errors.
Related Issues
Likely helps with:
Disclaimer: I used Codex (gpt-5.5) and Claude (claude-sonnet-4-6) for various parts of this work and I reviewed it myself before submitting this PR.