feat(duplicates): incremental index — backport CWA #1353 (@navels)#232
Merged
Conversation
Replaces the O(N) full-library duplicate scan with an O(1) maintained index. Imports, metadata edits, and deletions update the index incrementally instead of triggering a full re-scan; the existing Duplicates page becomes the user-facing entry point for the one-time post-upgrade baseline scan. Backport of CWA upstream #1353 by @navels. Reconciliation against fork- divergent work: - Preserves PR #199 metadata.db flock (`_run_calibredb_add_with_retry` + `metadata_db_write_lock()`) — required for fork #192 on mergerfs/SMB/NFS. - Preserves PR #210 debounce default 30 + clamp floor 10 (vs upstream's 60/5) — locked in by 13 regression pins. - Preserves PR #212 SQLite UDF connect-event listener (no `TaskReconnectDatabase` re-import; STAT_FINISH_SUCCESS et al. taken for the new `_duplicate_full_scan_running()` helper that powers the manual-scan vs ingest interlock). - Preserves v4.0.74 fast-exit pattern (`_is_missing_ingest_target` before lock acquisition) and v4.0.75 polish (`_load_fork_cps_imports` deferred cps imports for fast missing-target exit). - Preserves PR #100 batched cache invalidation semantics — switches the mechanism from per-batch `invalidate_duplicate_cache()` to per-batch `_queue_duplicate_scan_after_change(book_ids)` (still exactly one call per batch, just scoped to the affected book IDs now). New in ingest_processor.py: - `mark_ingest_batch_active()` / `clear_ingest_batch_active()` — active- marker file (CWA_INGEST_BATCH_ACTIVE_FILE) the duplicate-cache path consults to block manual scans during ingest. - `duplicate_full_scan_running()` + `wait_for_duplicate_full_scan_to_finish()` — ingest waits for any in-flight full scan before touching the library; prevents read-during-write races between manual scans and import. - `run_duplicate_scan_for_books([book_id])` — per-book incremental scan via the new /cwa-internal/run-duplicate-scan endpoint, called synchronously during `add_book_to_library` and `add_format_to_book`. Each book's duplicate-key row is upserted in O(1) instead of triggering a full scan. - `run_post_batch_follow_up()` simplified to /cwa-internal/reconnect-db only — the incremental updates above mean the old per-batch invalidate-cache + queue-debounced-scan calls are redundant. 173 new + updated unit tests: - 21 tests in test_duplicate_index.py covering fingerprint stability, key-part normalization, upsert/delete semantics, group queries with dismissed filtering, rebuild logic, baseline checks across pending/active states. - 4 tests in test_duplicate_delete_index_maintenance.py for the indexed delete path (key removal + cache refresh + dismissed merge). - 17 tests in test_duplicate_scan_index_rewire.py for the new scan task using the index instead of full-library iteration. - 11 tests in test_duplicate_scan_queue_settings.py for the queue-duplicate-scan endpoint + debounce semantics. - 3 updated tests in test_ingest_batch_dirty.py for the active marker + wait-for-full-scan path. - 8 updated tests in test_helper.py for new helper paths. - 3 existing PR #100 regression pins in test_duplicate_manager_race_fix.py updated to check `_queue_duplicate_scan_after_change` (the new mechanism preserving the same one-call-per-batch invariant). Schema migration: new `cwa_duplicate_book_keys` table with composite index on (criteria_fingerprint, duplicate_key). Created idempotently via CREATE TABLE IF NOT EXISTS — re-runs are no-ops. Inspired-by @navels in crocodilestick/Calibre-Web-Automated#1353.
new-usemame
added a commit
that referenced
this pull request
May 18, 2026
This was referenced May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
On libraries with 10k+ books, opening the Duplicates page or letting an after-import duplicate scan run held the SQLite connection through a full-library scan. With our deadlock vector eliminated in v4.0.66 (PR #212), the scan was no longer crashing — but it was still slow, blocking other requests while it ran, and rebuilding the entire duplicate group set on every change.
Root cause
Stock CWA's duplicate detection is O(N) over the book table: every scan rebuilds groups from scratch by re-normalising title/author/series/publisher/format for every book and bucketing into duplicate groups. The cache is invalidated on every import/edit/delete and the next view triggers another full scan.
Fix
@navels' upstream CWA #1353 replaces the O(N) scan with an O(1) maintained index:
cwa_duplicate_book_keysSQL table — one row per book with normalized keys + a criteria-aware fingerprint.cps/duplicate_index.py(+578 LOC) — upsert/delete book keys, rebuild on settings change, query duplicate groups directly from indexed rows.add_book_to_library/add_format_to_book) now callsrun_duplicate_scan_for_books([book_id])inline — single-book index upsert in milliseconds instead of full-library scan.cwa_ingest_batch_activemarker file +wait_for_duplicate_full_scan_to_finish()— ingest waits for any running full scan; manual full scans block during active ingest.Resolution Preview, success/error modals).Fork reconciliation
Backported on top of significant fork divergence in the ingest pipeline:
metadata_db_write_lock()+_run_calibredb_add_with_retry) — required for fork issue [bug] Auto-import locks database, crashes web service #192 on mergerfs/SMB/NFS where SQLite's fcntl locks silently fail.TaskReconnectDatabasere-import). AdoptsSTAT_FINISH_SUCCESS, STAT_FAIL, STAT_ENDED, STAT_CANCELLEDto power @navels' new_duplicate_full_scan_running()worker-state check.s6-setuidgid abcprivilege drop on the ingest worker._is_missing_ingest_targetbefore lock acquisition +_load_fork_cps_importsdeferred cps imports). Reordered active-marker calls inadd_book_to_libraryso flock acquisition still wraps the calibredb call, with the active marker + wait-for-full-scan check happening first.delete_selected_books/merge_list_bookfrom per-batchinvalidate_duplicate_cache()to per-batch_queue_duplicate_scan_after_change(book_ids)— still exactly one call per batch (not per-book), just scoped to the affected book IDs now.delete_book_from_table's skip-flag guard tightened toif not skip_cache_invalidation and not refreshed_duplicate_cacheto avoid double-invalidate when the new indexed refresh handled it.run_post_batch_follow_up()to/cwa-internal/reconnect-dbonly — the deferredinvalidate-cache+queue-duplicate-scancalls from #1349 are redundant in the new architecture (per-book incremental scans happen inline during ingest).Verification
173 unit tests pass (21 new + 4 + 17 + 11 in dedicated index test files, 3 updated in
test_ingest_batch_dirty.pyfor the active marker + wait-for-full-scan path, 8 updated intest_helper.py, 3 updated PR #100 regression pins intest_duplicate_manager_race_fix.py).Live container exercise on cwn-local:
cwa_duplicate_book_keystable + index created on first container start; subsequent restarts are no-ops.alert()) renders KEEP/DELETE per-book breakdown with timestamps + formats in 0.02s for 2 groups.POST /cwa-internal/run-duplicate-scan{"book_ids":[5,12,14,15,16]}returns{"message":"Duplicate scan completed: 1 new groups","result_count":1,"success":true}.POST /cwa-internal/duplicate-scan-statusreturns{"full_scan_running":false,"success":true}.Cross-cutting sweep:
metadata_db_write_lock,_run_calibredb_add_with_retry,s6-setuidgid abc, debounce 30 + clamp 10 across all 5 sites.Confidence
~95% — verified end-to-end on cwn-local with the user-visible flow (baseline scan, Duplicates page, Preview modal) working as designed. The architectural change is internal to duplicate detection; adjacent subsystems (Kobo sync, OPDS, edit-book flow) verified unaffected via route smoke + log scan.
Inspired-by @navels in crocodilestick/Calibre-Web-Automated#1353