Skip to content

Optimize ingest processing#1349

Open
navels wants to merge 2 commits into
crocodilestick:mainfrom
navels:fix/optimize-ingest-handling
Open

Optimize ingest processing#1349
navels wants to merge 2 commits into
crocodilestick:mainfrom
navels:fix/optimize-ingest-handling

Conversation

@navels
Copy link
Copy Markdown

@navels navels commented May 13, 2026

Summary

Improved ingest reliability by ignoring stale watcher events, avoiding repeated processing of already-moved files, and batching post-import refresh work until imports have settled.

Changes

  • Ignore stale ingest events for files already moved or deleted after processing.
  • Deduplicate repeated watcher events so polling/NFS fallback does not reprocess the same path indefinitely.
  • Fast-exit missing ingest paths before initializing the full application stack.
  • Defer database reconnect, duplicate-cache invalidation, and duplicate-scan scheduling until the import batch has gone quiet.
  • Harden internal post-import endpoint calls with bounded retries and clearer failure logging.
  • Add unit, smoke, and shell coverage for stale events, batch follow-up, and internal endpoint behavior.

Testing

Disclaimer: I used Codex (gpt-5.5) and Claude (claude-sonnet-4-6) for various parts of this work and I reviewed it myself before submitting this PR.

navels added 2 commits May 12, 2026 18:00
- Ignore stale ingest events for files already moved or deleted after processing.
- Deduplicate repeated watcher events so polling/NFS fallback does not reprocess the same path indefinitely.
- Fast-exit missing ingest paths before initializing the full application stack.
- Defer database reconnect, duplicate-cache invalidation, and duplicate-scan scheduling until the import batch has gone quiet.
- Harden internal post-import endpoint calls with bounded retries and clearer failure logging.
- Add unit, smoke, and shell coverage for stale events, batch follow-up, and internal endpoint behavior.
- Increase the recent-event TTL used by the shell test so immediate duplicate assertions do not race one-second timestamp boundaries.
@navels navels changed the title Fix stale ingest events and batch post-import refreshes Optimize ingest processing May 14, 2026
new-usemame added a commit to new-usemame/Calibre-Web-NextGen that referenced this pull request May 18, 2026
…follow-up (#231)

Backport of upstream crocodilestick/Calibre-Web-Automated#1349 by @navels into our community-maintained CWA build. Original PR has been open since 2026-05-13 with no upstream review activity; users get the ingest-reliability improvements today rather than waiting for upstream pace.

The PR adds five orthogonal improvements: (1) stale ingest events for already-moved/deleted files fast-exit before the heavy startup path, eliminating the NFS/polling-fallback re-emit loop; (2) the s6 ingest-service script tracks per-file processing markers so the watcher can't reprocess the same path while it's in-flight; (3) module-import-time work in scripts/ingest_processor.py defers behind initialize_runtime() so the script can fast-exit without paying cps/CWA_DB/EPUBFixer/audiobook/requests import cost; (4) per-book refresh_cwa_session + invalidate_duplicate_cache + schedule_debounced_duplicate_scan calls consolidate into a single run_post_batch_follow_up() that fires once after the dirty-marker batch goes quiet; (5) bounded retries on the cwa-internal POST endpoints replace single-shot best-effort calls.

Reconciliation against fork-original work was non-trivial — full per-file rationale in the fixup commit body. The /cwa-internal/reconnect-db endpoint keeps fork PR #199's synchronous CalibreDB.refresh_for_new_data() (going back to TaskReconnectDatabase via WorkerThread.add would re-introduce the engine-disposal race that caused fork #192); ingest_processor.py keeps PR #199's _run_calibredb_add_with_retry + metadata_db_write_lock() and PR #208's deferred lock-acquire; the s6 service script preserves PR #37/#56's s6-setuidgid abc privilege drop in the new run_processor_with_timeout() helper; PR #210's debounce-default tests updated to no-op the per-script pins for ingest_processor (the value lives server-side now at /cwa-internal/queue-duplicate-scan).

Live-verified end-to-end on cwn-local: missing-target fast-exit doesn't initialize runtime or acquire lock; post-batch follow-up fires exactly once after the dirty-marker batch quiets and refreshes the duplicate cache cleanly (groups=2 max_book_id=12 logged); real ingest of alice-in-wonderland.epub triggered the s6 service Post-batch follow-up exactly once (not per book), no errors. 6 new unit tests + 35 smoke tests + 165 shell tests added by @navels all pass; 65 adjacent regression tests stay green.

Credit: @navels — both upstream commits authored by Lee Nave <navels@gmail.com> are preserved on the squash; the reconciliation commit landed under new-usemame identity. Dual-audience credit comment will be posted on CWA#1349 after the release tag publishes.
@new-usemame
Copy link
Copy Markdown

Backported into Calibre-Web-NextGen v4.0.74 at f1ae6b1. Drop-in image for users hitting this: ghcr.io/new-usemame/calibre-web-nextgen:latest. Thanks @navels.

@new-usemame
Copy link
Copy Markdown

Small follow-up in v4.0.75 at be94b62: our fork-original cps.services.* imports (a metadata write-lock + plugin loader) were at module top in the backport — they triggered cps/__init__.py even on the missing-target path. Moved both into _load_optional_cps_modules() so the fast-exit stays genuinely fast. Verified on a deployed instance: missing-target invocation now logs only Skipping missing ingest target, no Flask app init.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants