Skip to content

feat(duplicates): incremental index — backport CWA #1353 (@navels)#232

Merged
new-usemame merged 1 commit into
mainfrom
backport/cwa-1353-duplicate-index
May 18, 2026
Merged

feat(duplicates): incremental index — backport CWA #1353 (@navels)#232
new-usemame merged 1 commit into
mainfrom
backport/cwa-1353-duplicate-index

Conversation

@new-usemame
Copy link
Copy Markdown
Owner

Symptom

On libraries with 10k+ books, opening the Duplicates page or letting an after-import duplicate scan run held the SQLite connection through a full-library scan. With our deadlock vector eliminated in v4.0.66 (PR #212), the scan was no longer crashing — but it was still slow, blocking other requests while it ran, and rebuilding the entire duplicate group set on every change.

Root cause

Stock CWA's duplicate detection is O(N) over the book table: every scan rebuilds groups from scratch by re-normalising title/author/series/publisher/format for every book and bucketing into duplicate groups. The cache is invalidated on every import/edit/delete and the next view triggers another full scan.

Fix

@navels' upstream CWA #1353 replaces the O(N) scan with an O(1) maintained index:

  1. New cwa_duplicate_book_keys SQL table — one row per book with normalized keys + a criteria-aware fingerprint.
  2. New cps/duplicate_index.py (+578 LOC) — upsert/delete book keys, rebuild on settings change, query duplicate groups directly from indexed rows.
  3. Each book import (add_book_to_library / add_format_to_book) now calls run_duplicate_scan_for_books([book_id]) inline — single-book index upsert in milliseconds instead of full-library scan.
  4. Manual scans and ingest interlock via a new cwa_ingest_batch_active marker file + wait_for_duplicate_full_scan_to_finish() — ingest waits for any running full scan; manual full scans block during active ingest.
  5. Duplicates page renders a one-time-baseline-scan notice on fresh installs; replaces alert dialogs with Bootstrap-modal feedback (Resolution Preview, success/error modals).

Fork reconciliation

Backported on top of significant fork divergence in the ingest pipeline:

Verification

173 unit tests pass (21 new + 4 + 17 + 11 in dedicated index test files, 3 updated in test_ingest_batch_dirty.py for the active marker + wait-for-full-scan path, 8 updated in test_helper.py, 3 updated PR #100 regression pins in test_duplicate_manager_race_fix.py).

Live container exercise on cwn-local:

  • Schema migration applied idempotently: cwa_duplicate_book_keys table + index created on first container start; subsequent restarts are no-ops.
  • Baseline full scan: 15 books indexed with normalized title + composite duplicate_key + criteria fingerprint.
  • Duplicates page: identifies "Alice's Adventures in Wonderland" (5 duplicates) + "The Republic" (2 duplicates) from the index.
  • "Run Full Duplicate Scan" baseline notice shows correctly on fresh installs; disappears after first scan.
  • Resolution Preview modal (replaces alert()) renders KEEP/DELETE per-book breakdown with timestamps + formats in 0.02s for 2 groups.
  • Sidebar "Duplicates: 2" badge updates from the indexed cache.
  • Per-book incremental scan via POST /cwa-internal/run-duplicate-scan {"book_ids":[5,12,14,15,16]} returns {"message":"Duplicate scan completed: 1 new groups","result_count":1,"success":true}.
  • POST /cwa-internal/duplicate-scan-status returns {"full_scan_running":false,"success":true}.
  • No console errors, no traceback in container logs.

Cross-cutting sweep:

  • OPDS (200), /health (200), /tasks (302), /admin/view (302), /duplicates (200 authed), /duplicates/status (200 authed) all responding.
  • Fork-critical patterns preserved (verified via grep): metadata_db_write_lock, _run_calibredb_add_with_retry, s6-setuidgid abc, debounce 30 + clamp 10 across all 5 sites.

Confidence

~95% — verified end-to-end on cwn-local with the user-visible flow (baseline scan, Duplicates page, Preview modal) working as designed. The architectural change is internal to duplicate detection; adjacent subsystems (Kobo sync, OPDS, edit-book flow) verified unaffected via route smoke + log scan.

Inspired-by @navels in crocodilestick/Calibre-Web-Automated#1353

Replaces the O(N) full-library duplicate scan with an O(1) maintained index.
Imports, metadata edits, and deletions update the index incrementally instead
of triggering a full re-scan; the existing Duplicates page becomes the
user-facing entry point for the one-time post-upgrade baseline scan.

Backport of CWA upstream #1353 by @navels. Reconciliation against fork-
divergent work:
- Preserves PR #199 metadata.db flock (`_run_calibredb_add_with_retry` +
  `metadata_db_write_lock()`) — required for fork #192 on mergerfs/SMB/NFS.
- Preserves PR #210 debounce default 30 + clamp floor 10 (vs upstream's
  60/5) — locked in by 13 regression pins.
- Preserves PR #212 SQLite UDF connect-event listener (no `TaskReconnectDatabase`
  re-import; STAT_FINISH_SUCCESS et al. taken for the new
  `_duplicate_full_scan_running()` helper that powers the manual-scan
  vs ingest interlock).
- Preserves v4.0.74 fast-exit pattern (`_is_missing_ingest_target` before
  lock acquisition) and v4.0.75 polish (`_load_fork_cps_imports` deferred
  cps imports for fast missing-target exit).
- Preserves PR #100 batched cache invalidation semantics — switches the
  mechanism from per-batch `invalidate_duplicate_cache()` to per-batch
  `_queue_duplicate_scan_after_change(book_ids)` (still exactly one call
  per batch, just scoped to the affected book IDs now).

New in ingest_processor.py:
- `mark_ingest_batch_active()` / `clear_ingest_batch_active()` — active-
  marker file (CWA_INGEST_BATCH_ACTIVE_FILE) the duplicate-cache path
  consults to block manual scans during ingest.
- `duplicate_full_scan_running()` + `wait_for_duplicate_full_scan_to_finish()`
  — ingest waits for any in-flight full scan before touching the library;
  prevents read-during-write races between manual scans and import.
- `run_duplicate_scan_for_books([book_id])` — per-book incremental scan
  via the new /cwa-internal/run-duplicate-scan endpoint, called synchronously
  during `add_book_to_library` and `add_format_to_book`. Each book's
  duplicate-key row is upserted in O(1) instead of triggering a full
  scan.
- `run_post_batch_follow_up()` simplified to /cwa-internal/reconnect-db
  only — the incremental updates above mean the old per-batch
  invalidate-cache + queue-debounced-scan calls are redundant.

173 new + updated unit tests:
- 21 tests in test_duplicate_index.py covering fingerprint stability,
  key-part normalization, upsert/delete semantics, group queries with
  dismissed filtering, rebuild logic, baseline checks across pending/active
  states.
- 4 tests in test_duplicate_delete_index_maintenance.py for the indexed
  delete path (key removal + cache refresh + dismissed merge).
- 17 tests in test_duplicate_scan_index_rewire.py for the new scan task
  using the index instead of full-library iteration.
- 11 tests in test_duplicate_scan_queue_settings.py for the
  queue-duplicate-scan endpoint + debounce semantics.
- 3 updated tests in test_ingest_batch_dirty.py for the active marker +
  wait-for-full-scan path.
- 8 updated tests in test_helper.py for new helper paths.
- 3 existing PR #100 regression pins in test_duplicate_manager_race_fix.py
  updated to check `_queue_duplicate_scan_after_change` (the new mechanism
  preserving the same one-call-per-batch invariant).

Schema migration: new `cwa_duplicate_book_keys` table with composite
index on (criteria_fingerprint, duplicate_key). Created idempotently via
CREATE TABLE IF NOT EXISTS — re-runs are no-ops.

Inspired-by @navels in crocodilestick/Calibre-Web-Automated#1353.
@new-usemame new-usemame added the needs-review Operator merges after manual review label May 18, 2026
@new-usemame new-usemame merged commit 3047deb into main May 18, 2026
9 of 10 checks passed
@new-usemame new-usemame deleted the backport/cwa-1353-duplicate-index branch May 18, 2026 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-review Operator merges after manual review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant