feat(duplicates): incremental index — backport CWA #1353 (@navels) by new-usemame · Pull Request #232 · new-usemame/Calibre-Web-NextGen

new-usemame · 2026-05-18T05:25:43Z

Symptom

On libraries with 10k+ books, opening the Duplicates page or letting an after-import duplicate scan run held the SQLite connection through a full-library scan. With our deadlock vector eliminated in v4.0.66 (PR #212), the scan was no longer crashing — but it was still slow, blocking other requests while it ran, and rebuilding the entire duplicate group set on every change.

Root cause

Stock CWA's duplicate detection is O(N) over the book table: every scan rebuilds groups from scratch by re-normalising title/author/series/publisher/format for every book and bucketing into duplicate groups. The cache is invalidated on every import/edit/delete and the next view triggers another full scan.

Fix

@navels' upstream CWA #1353 replaces the O(N) scan with an O(1) maintained index:

New cwa_duplicate_book_keys SQL table — one row per book with normalized keys + a criteria-aware fingerprint.
New cps/duplicate_index.py (+578 LOC) — upsert/delete book keys, rebuild on settings change, query duplicate groups directly from indexed rows.
Each book import (add_book_to_library / add_format_to_book) now calls run_duplicate_scan_for_books([book_id]) inline — single-book index upsert in milliseconds instead of full-library scan.
Manual scans and ingest interlock via a new cwa_ingest_batch_active marker file + wait_for_duplicate_full_scan_to_finish() — ingest waits for any running full scan; manual full scans block during active ingest.
Duplicates page renders a one-time-baseline-scan notice on fresh installs; replaces alert dialogs with Bootstrap-modal feedback (Resolution Preview, success/error modals).

Fork reconciliation

Backported on top of significant fork divergence in the ingest pipeline:

Preserves PR fix(ingest): coordinate metadata.db writes; stop concurrent-ingest crash (#192) #199 metadata.db flock (metadata_db_write_lock() + _run_calibredb_add_with_retry) — required for fork issue [bug] Auto-import locks database, crashes web service #192 on mergerfs/SMB/NFS where SQLite's fcntl locks silently fail.
Preserves PR fix(duplicate-scan): revert debounce default 5 -> 30 (CWA fe60df7 regression) #210 debounce default 30 + clamp floor 10 across all 5 mirrored sites (schema, 3 Python fallbacks, template). Upstream's 60/5 is strictly safer in stock CWA but our PR fix(db): register SQLite UDFs once via connect event listener (eliminate deadlock vector) #212 already eliminated the deadlock vector that motivated the conservative default; existing fork users on 30 stay at 30, no new migration.
Preserves PR fix(db): register SQLite UDFs once via connect event listener (eliminate deadlock vector) #212 SQLite UDF connect-event listener (no TaskReconnectDatabase re-import). Adopts STAT_FINISH_SUCCESS, STAT_FAIL, STAT_ENDED, STAT_CANCELLED to power @navels' new _duplicate_full_scan_running() worker-state check.
Preserves PR Backport upstream #1290: Run ingest service python workers as abc, not root #37/validate + re-land #37: ingest workers as abc + chown migration #56 s6-setuidgid abc privilege drop on the ingest worker.
Preserves v4.0.74 / v4.0.75 fast-exit pattern (_is_missing_ingest_target before lock acquisition + _load_fork_cps_imports deferred cps imports). Reordered active-marker calls in add_book_to_library so flock acquisition still wraps the calibredb call, with the active marker + wait-for-full-scan check happening first.
Preserves PR Backport upstream #1095: Fix race condition in Duplicate Manager causing UI freeze/crash #100 batched cache invalidation semantics. Switches the mechanism in delete_selected_books / merge_list_book from per-batch invalidate_duplicate_cache() to per-batch _queue_duplicate_scan_after_change(book_ids) — still exactly one call per batch (not per-book), just scoped to the affected book IDs now. delete_book_from_table's skip-flag guard tightened to if not skip_cache_invalidation and not refreshed_duplicate_cache to avoid double-invalidate when the new indexed refresh handled it.
Simplifies run_post_batch_follow_up() to /cwa-internal/reconnect-db only — the deferred invalidate-cache + queue-duplicate-scan calls from #1349 are redundant in the new architecture (per-book incremental scans happen inline during ingest).

Verification

173 unit tests pass (21 new + 4 + 17 + 11 in dedicated index test files, 3 updated in test_ingest_batch_dirty.py for the active marker + wait-for-full-scan path, 8 updated in test_helper.py, 3 updated PR #100 regression pins in test_duplicate_manager_race_fix.py).

Live container exercise on cwn-local:

Schema migration applied idempotently: cwa_duplicate_book_keys table + index created on first container start; subsequent restarts are no-ops.
Baseline full scan: 15 books indexed with normalized title + composite duplicate_key + criteria fingerprint.
Duplicates page: identifies "Alice's Adventures in Wonderland" (5 duplicates) + "The Republic" (2 duplicates) from the index.
"Run Full Duplicate Scan" baseline notice shows correctly on fresh installs; disappears after first scan.
Resolution Preview modal (replaces alert()) renders KEEP/DELETE per-book breakdown with timestamps + formats in 0.02s for 2 groups.
Sidebar "Duplicates: 2" badge updates from the indexed cache.
Per-book incremental scan via POST /cwa-internal/run-duplicate-scan {"book_ids":[5,12,14,15,16]} returns {"message":"Duplicate scan completed: 1 new groups","result_count":1,"success":true}.
POST /cwa-internal/duplicate-scan-status returns {"full_scan_running":false,"success":true}.
No console errors, no traceback in container logs.

Cross-cutting sweep:

OPDS (200), /health (200), /tasks (302), /admin/view (302), /duplicates (200 authed), /duplicates/status (200 authed) all responding.
Fork-critical patterns preserved (verified via grep): metadata_db_write_lock, _run_calibredb_add_with_retry, s6-setuidgid abc, debounce 30 + clamp 10 across all 5 sites.

Confidence

~95% — verified end-to-end on cwn-local with the user-visible flow (baseline scan, Duplicates page, Preview modal) working as designed. The architectural change is internal to duplicate detection; adjacent subsystems (Kobo sync, OPDS, edit-book flow) verified unaffected via route smoke + log scan.

Inspired-by @navels in crocodilestick/Calibre-Web-Automated#1353

@navels

Replaces the O(N) full-library duplicate scan with an O(1) maintained index. Imports, metadata edits, and deletions update the index incrementally instead of triggering a full re-scan; the existing Duplicates page becomes the user-facing entry point for the one-time post-upgrade baseline scan. Backport of CWA upstream #1353 by @navels. Reconciliation against fork- divergent work: - Preserves PR #199 metadata.db flock (`_run_calibredb_add_with_retry` + `metadata_db_write_lock()`) — required for fork #192 on mergerfs/SMB/NFS. - Preserves PR #210 debounce default 30 + clamp floor 10 (vs upstream's 60/5) — locked in by 13 regression pins. - Preserves PR #212 SQLite UDF connect-event listener (no `TaskReconnectDatabase` re-import; STAT_FINISH_SUCCESS et al. taken for the new `_duplicate_full_scan_running()` helper that powers the manual-scan vs ingest interlock). - Preserves v4.0.74 fast-exit pattern (`_is_missing_ingest_target` before lock acquisition) and v4.0.75 polish (`_load_fork_cps_imports` deferred cps imports for fast missing-target exit). - Preserves PR #100 batched cache invalidation semantics — switches the mechanism from per-batch `invalidate_duplicate_cache()` to per-batch `_queue_duplicate_scan_after_change(book_ids)` (still exactly one call per batch, just scoped to the affected book IDs now). New in ingest_processor.py: - `mark_ingest_batch_active()` / `clear_ingest_batch_active()` — active- marker file (CWA_INGEST_BATCH_ACTIVE_FILE) the duplicate-cache path consults to block manual scans during ingest. - `duplicate_full_scan_running()` + `wait_for_duplicate_full_scan_to_finish()` — ingest waits for any in-flight full scan before touching the library; prevents read-during-write races between manual scans and import. - `run_duplicate_scan_for_books([book_id])` — per-book incremental scan via the new /cwa-internal/run-duplicate-scan endpoint, called synchronously during `add_book_to_library` and `add_format_to_book`. Each book's duplicate-key row is upserted in O(1) instead of triggering a full scan. - `run_post_batch_follow_up()` simplified to /cwa-internal/reconnect-db only — the incremental updates above mean the old per-batch invalidate-cache + queue-debounced-scan calls are redundant. 173 new + updated unit tests: - 21 tests in test_duplicate_index.py covering fingerprint stability, key-part normalization, upsert/delete semantics, group queries with dismissed filtering, rebuild logic, baseline checks across pending/active states. - 4 tests in test_duplicate_delete_index_maintenance.py for the indexed delete path (key removal + cache refresh + dismissed merge). - 17 tests in test_duplicate_scan_index_rewire.py for the new scan task using the index instead of full-library iteration. - 11 tests in test_duplicate_scan_queue_settings.py for the queue-duplicate-scan endpoint + debounce semantics. - 3 updated tests in test_ingest_batch_dirty.py for the active marker + wait-for-full-scan path. - 8 updated tests in test_helper.py for new helper paths. - 3 existing PR #100 regression pins in test_duplicate_manager_race_fix.py updated to check `_queue_duplicate_scan_after_change` (the new mechanism preserving the same one-call-per-batch invariant). Schema migration: new `cwa_duplicate_book_keys` table with composite index on (criteria_fingerprint, duplicate_key). Created idempotently via CREATE TABLE IF NOT EXISTS — re-runs are no-ops. Inspired-by @navels in crocodilestick/Calibre-Web-Automated#1353.

…@navels

…@navels backport)

new-usemame added the needs-review Operator merges after manual review label May 18, 2026

new-usemame merged commit 3047deb into main May 18, 2026
9 of 10 checks passed

new-usemame deleted the backport/cwa-1353-duplicate-index branch May 18, 2026 05:34

new-usemame added a commit that referenced this pull request May 18, 2026

docs(changes): record #232 squash SHA + add row for v4.0.76 (CWA #1353 …

bf75bd2

…@navels backport)

This was referenced May 19, 2026

feat(health): probe s6 services + bound healthcheck curl + close gevent keepalives #251

Merged

Upstream PR merge request: Imrpove duplicate scanning performance #207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(duplicates): incremental index — backport CWA #1353 (@navels)#232

feat(duplicates): incremental index — backport CWA #1353 (@navels)#232
new-usemame merged 1 commit into
mainfrom
backport/cwa-1353-duplicate-index

new-usemame commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

new-usemame commented May 18, 2026

Symptom

Root cause

Fix

Fork reconciliation

Verification

Confidence

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant