Skip to content

feat(embeddings): storage layer — state.db schema + vector search + query cache#38

Merged
plind-junior merged 18 commits into
vouchdev:mainfrom
dripsmvcp:feat/semantic-search-phase-2-storage
May 21, 2026
Merged

feat(embeddings): storage layer — state.db schema + vector search + query cache#38
plind-junior merged 18 commits into
vouchdev:mainfrom
dripsmvcp:feat/semantic-search-phase-2-storage

Conversation

@dripsmvcp

@dripsmvcp dripsmvcp commented May 20, 2026

Copy link
Copy Markdown
Contributor

Stacked on Phase 1 (feat/semantic-search). Base the PR against that branch, not against main, so the diff stays focused on storage. Once Phase 1 merges, GitHub will retarget this automatically.

Summary

Phase 2 of the semantic search feature: the storage layer. Adds three new tables to .vouch/state.db (embedding_index, query_embedding_cache, embedding_dupes), the put/get/search helpers in vouch.index_db, and a query-embedding LRU cache. Still no user-visible behavior change — nothing in KBStore.put_* or kb.search reaches this code yet. Phase 3 wires the write hooks; Phase 4 wires the read path.

Tracks docs/superpowers/specs/2026-05-20-semantic-search-design.md §4 (Storage layout) and Tasks 7-10 of docs/superpowers/plans/2026-05-20-semantic-search.md.

What's in this PR

5 commits, one per task plus a small style cleanup:

Commit Purpose
2d8ffa5 Schema additions to state.db + put_embedding / get_embedding / set_embedding_meta / get_embedding_meta helpers
cc2a7e0 search_embedding(kb_dir, query_vec, kinds, limit, min_score) — pure NumPy brute-force cosine path
3480507 sqlite-vec ANN path layered on top, with idempotent _load_sqlite_vec(conn) extension loader and graceful NumPy fallback when the extension is unavailable
35b7549 vouch.embeddings.cachecache_query_vec / lookup_query_vec / query_cache_size / query_cache_clear / query_cache_stats; bounded LRU eviction by last_used_at
5177248 style: cleanup for ruff E402 / SIM105 / E702 / I001 violations introduced by the verbatim plan code
File Change
src/vouch/index_db.py Extended SCHEMA (3 new tables, 1 index); appended ~140 LOC of embedding helpers (put/get/search/meta + sqlite-vec loader)
src/vouch/embeddings/cache.py (new) Query-embedding cache, ~80 LOC
tests/embeddings/test_storage.py (new) 12 tests covering schema, idempotent put, brute-force search, sqlite-vec/NumPy parity, cache round-trip, LRU eviction

Design notes

One ANN path with a NumPy escape hatch. _load_sqlite_vec(conn) does best-effort extension loading and is idempotent — repeat calls on the same connection return cached state. If the extension is present, search_embedding uses vec_distance_cosine directly in SQL with ORDER BY + LIMIT pushdown; if not (or if it raises OperationalError), the function falls back to loading all candidate rows and computing cosine in NumPy. The result shape is identical: list[tuple[kind, id, snippet, score]] to match index_db.search (FTS5) for drop-in compatibility.

Content-hash idempotency. put_embedding uses INSERT OR REPLACE keyed on (kind, id) — re-encoding the same artifact text is a no-op at the storage layer (Phase 3 will add the skip-if-unchanged decision based on content_hash before even calling encode).

Model identity is first-class. set_embedding_meta / get_embedding_meta persist the model name, version, and dim into the existing index_meta key/value table. Phase 6's migration module reads these to detect a model swap and trigger a backfill warning.

Query cache bounded by last_used_at. cache_query_vec(..., max_entries=1024) runs an explicit eviction sweep after each insert. Cheap and predictable; no background task or LRU dict-in-memory.

Test Plan

  • .venv/bin/pytest tests/embeddings -v20 passed, 4 deselected (5 schema + 3 brute-force + 2 sqlite-vec + 2 cache = 12 new; 8 from Phase 1 still pass)
  • .venv/bin/python -m ruff check src/vouch/index_db.py src/vouch/embeddings/cache.py tests/embeddings/test_storage.pyclean
  • .venv/bin/pytest --ignore=tests/test_sessions.py -q → no regression in pre-existing tests
  • sqlite-vec ANN code path is exercised in CI when the extension is installed (the _load_sqlite_vec test asserts only loader idempotency; the actual ANN cosine vs NumPy parity test passes under both paths via the same search_embedding entry point)

Summary by CodeRabbit

  • New Features

    • Embedding-based semantic search with multi-model support, hybrid/rerank modes, and a query embedding cache for faster repeated queries
  • Documentation

    • Detailed semantic-search design spec covering integration modes, storage/migration/backfill, CLI/tool parity, rollout, and acceptance criteria
  • Chores

    • Optional dependency groups for embeddings/reranking; integration tests skipped by default; typing overrides for optional libs
  • Tests

    • Extensive unit and integration tests for embedders, storage, search, cache, and migration
  • Bug Fixes

    • Skip schema validation for raw source content to avoid spurious failures

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an embedding-based semantic search subsystem: design spec and packaging, a pluggable Embedder registry with three adapters, SQLite-backed query caching, extended index DB embedding tables and search APIs (sqlite_vec or Python fallback), and unit/integration tests.

Changes

Embedding-Based Semantic Search

Layer / File(s) Summary
Design specification and dependency configuration
docs/superpowers/specs/2026-05-20-semantic-search-design.md, pyproject.toml
Design spec outlines embedding-based retrieval with FTS5 fallback, storage schema, CLI/MCP/JSONL parity, and rollout plan. Dependency groups (embeddings, embeddings-fast, rerank) and pytest/mypy config enable optional model installation and test filtering.
Embedder foundation and adapter registry
src/vouch/embeddings/base.py
Abstract Embedder base class with encode(text) and encode_batch(texts) contracts; content_hash() for deduplication; and register()/get_embedder() factory registry.
Package init & adapter wiring
src/vouch/embeddings/__init__.py
Re-exports registry API and conditionally imports available adapters to trigger auto-registration without hard dependency failures.
Concrete embedding adapter implementations
src/vouch/embeddings/st_mpnet.py, src/vouch/embeddings/st_minilm.py, src/vouch/embeddings/fastembed_bge.py
Three pluggable adapters: ST-MPNet (768-dim, default), ST-MiniLM (384-dim), and FastEmbed BGE (384-dim). Each implements lazy model loading, per-text and batch encoding with L2 normalization, and auto-registers with the adapter factory.
Query embedding cache layer
src/vouch/embeddings/cache.py
SQLite-backed LRU cache for query embeddings keyed by SHA-256 hash. Provides cache_query_vec() with automatic eviction of least-recently-used entries, lookup_query_vec() with hit-count tracking, and cache stats/clear utilities.
Storage and semantic search integration
src/vouch/index_db.py
Extends .vouch/state.db schema with embedding_index, query_embedding_cache, and embedding_dupes tables. Implements put_embedding() and get_embedding() for vector persistence; search_embedding() performs cosine-similarity search via sqlite_vec extension (with Python fallback); metadata and loader helpers complete the API.
Unit tests for embeddings core API
tests/embeddings/__init__.py, tests/embeddings/_fakes.py, tests/embeddings/test_core.py
MockEmbedder test double generates deterministic vectors from SHA-256 hashing. Unit tests validate content_hash() stability, Embedder abstract enforcement, registry resolution, and mock encoder determinism.
Integration and storage tests
tests/embeddings/test_integration.py, tests/embeddings/test_storage.py, tests/embeddings/conftest.py
Integration tests load and verify real adapters (ST-MPNet semantic ranking, ST-MiniLM shape, FastEmbed output). Storage tests validate schema creation, embedding round-trips, metadata persistence, top-k search ordering, kind filtering, and LRU cache bounds. conftest.py skips the directory when numpy is absent.
Bundle validation tweak
src/vouch/bundle.py
Skip schema/model validation for opaque sources/<sha>/content artifacts while continuing to validate sources/<sha>/meta.yaml.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant IndexDB
  participant SQLite
  participant sqlite_vec
  Client->>IndexDB: search_embedding(query_vec, kinds, limit)
  IndexDB->>SQLite: attempt load sqlite_vec extension
  alt sqlite_vec available
    IndexDB->>sqlite_vec: SQL cosine scoring(query_vec)
    sqlite_vec-->>IndexDB: top-k rows
    IndexDB-->>Client: results with scores
  else fallback
    IndexDB->>SQLite: fetch embedding_index candidate rows
    IndexDB->>IndexDB: compute cosine scores in Python
    IndexDB-->>Client: sorted results
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Poem

🐇 I hashed and normalized, line by line,
Vectors tucked in SQLite, snug and fine,
Adapters wake, the cache hums low,
Queries fly, and results now know,
Hopping search paths, I sing—vouch, go!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: implementation of the storage layer for embeddings with three key components (state.db schema, vector search, query cache).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
tests/embeddings/test_storage.py (1)

167-172: ⚡ Quick win

Assert eviction correctness, not just cache size.

This currently validates only capacity, so a non-LRU implementation could still pass. Add key-level checks to prove oldest entries are evicted.

Proposed test tightening
 def test_query_cache_lru_eviction(kb_dir: Path) -> None:
-    from vouch.embeddings.cache import cache_query_vec, query_cache_size
+    from vouch.embeddings.cache import cache_query_vec, lookup_query_vec, query_cache_size
     v = np.ones(4, dtype=np.float32)
     for i in range(5):
         cache_query_vec(kb_dir, query=f"q{i}", vec=v, max_entries=3)
     assert query_cache_size(kb_dir) <= 3
+    assert lookup_query_vec(kb_dir, query="q0") is None
+    assert lookup_query_vec(kb_dir, query="q1") is None
+    assert lookup_query_vec(kb_dir, query="q2") is not None
+    assert lookup_query_vec(kb_dir, query="q3") is not None
+    assert lookup_query_vec(kb_dir, query="q4") is not None
src/vouch/index_db.py (1)

40-50: 🏗️ Heavy lift

This sqlite-vec path is interim brute-force SQL; vec0 indexed search is already planned.

embedding_index is currently a plain table using vec_distance_cosine(...) ORDER BY, which is the manual KNN pattern—equivalent to O(n) full scan. However, the approved semantic search design spec (section 4) already calls for migrating to sqlite-vec vec0 virtual tables (CREATE VIRTUAL TABLE <kind>_vecs USING vec0(embedding float[768])), which enable indexed ANN. This is tracked as part of the broader embedding integration milestone; the current code is a transitional fallback until that refactor lands.

Also applies to: 294-300

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 40 - 50, The embedding_index table
currently uses a brute-force KNN pattern (vec_distance_cosine ORDER BY) causing
O(n) scans; replace this interim table with sqlite-vec indexed ANN when ready by
adding vec0 virtual tables per kind (e.g., CREATE VIRTUAL TABLE <kind>_vecs
USING vec0(embedding float[dim])) and migrate inserts/queries to write/read from
those virtual tables instead of embedding_index; update any code paths that
reference embedding_index and vec_distance_cosine (search logic and
insert/update routines) to use the new <kind>_vecs table and vec0 query API,
while keeping embedding_index only as a transitional fallback until full
migration is complete.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/specs/2026-05-20-semantic-search-design.md`:
- Around line 33-47: The fenced code block that lists "src/vouch/embeddings/"
lacks a language tag which triggers markdownlint MD040; update the opening fence
from ``` to ```text (or another appropriate language) for the block containing
"src/vouch/embeddings/" so the linter recognizes it and CI noise is avoided.
Locate the triple-backtick block that begins immediately before the
"src/vouch/embeddings/" listing and add the language identifier to the opening
fence.

In `@src/vouch/embeddings/__init__.py`:
- Around line 20-33: Replace the bare try/except/pass import blocks with
contextlib.suppress(ImportError): blocks to satisfy Ruff SIM105; import
contextlib (or from contextlib import suppress) at the top of the module and
wrap each import of st_mpnet, st_minilm, and fastembed_bge inside a
suppress(ImportError) context manager so the imports are quietly ignored but the
pattern no longer uses empty except clauses.

In `@src/vouch/embeddings/cache.py`:
- Around line 15-16: The cache key currently uses only the query hash in
_query_key, which can return embeddings from a different
model/version/dimension; update key construction to incorporate the embedder
identity by including model, model_version (or version), and dim (embedding
dimensionality) alongside the query hash (e.g., include these values in the
string hashed by _query_key or create a new _cache_key(query, model,
model_version, dim)). Alternatively, persist model/model_version/dim as separate
columns when storing embeddings and add explicit checks in the lookup code (the
retrieval path around lines 19-21) to reject/ignore cached rows whose
model/version/dim do not match the current embedder before returning results.
Ensure you reference and update the functions that compute and use the cache key
(_query_key and the lookup/store code) so mismatched shapes are never returned.
- Line 24: The code sets last_used_at using
dt.datetime.now(dt.UTC).isoformat(timespec="seconds"), which loses sub-second
precision and breaks LRU eviction ordering; update all occurrences (e.g., the
assignment to now and any places referenced by last_used_at) to include
sub-second precision—either use isoformat(timespec="microseconds") or store an
integer epoch with microsecond resolution
(int(datetime.now(tz=dt.UTC).timestamp() * 1_000_000)) consistently across the
functions that set last_used_at so eviction order remains deterministic under
bursty access.

In `@src/vouch/index_db.py`:
- Around line 40-47: search_embedding() is currently scoring all rows for
requested kinds and ignores stored embedding metadata, which risks
mismatched-dimension errors and polluted rankings; update both search paths in
search_embedding() to JOIN or WHERE against the active embedding metadata
(model, model_version, dim) stored for the DB and add WHERE clauses like model =
?, model_version = ?, dim = ? so q @ vec only runs on matching-dim vectors. Also
update set_embedding_meta() to either delete/purge stale embedding_index rows
whose model/model_version/dim do not match the newly set metadata or mark them
invalid, so subsequent searches remain constrained to current metadata;
reference the embedding_index table and the functions search_embedding() and
set_embedding_meta() when making these changes.

In `@tests/embeddings/_fakes.py`:
- Around line 32-42: The mock vector generator unpacks raw SHA256 bytes as
float32 (struct.unpack("<f", ...)) which can produce extremely large/NaN
intermediate values; change the numeric mapping to a bounded, stable range
instead: for each 4-byte chunk (the loop using seed, h, j and writing into
out[i]), convert the 4 bytes to an unsigned 32-bit integer (e.g., via
int.from_bytes(..., "little")), then scale that integer into a stable float
range like [-1.0, 1.0] or [0.0, 1.0] before assigning into out[i] (cast to
np.float32 if needed); keep the existing normalization step and loop structure
(references: seed, h, j, out, i, self.dim, np.linalg.norm) but replace the
struct.unpack("<f", ...) usage with the bounded scaling to avoid overflow
warnings during norm computation.

---

Nitpick comments:
In `@src/vouch/index_db.py`:
- Around line 40-50: The embedding_index table currently uses a brute-force KNN
pattern (vec_distance_cosine ORDER BY) causing O(n) scans; replace this interim
table with sqlite-vec indexed ANN when ready by adding vec0 virtual tables per
kind (e.g., CREATE VIRTUAL TABLE <kind>_vecs USING vec0(embedding float[dim]))
and migrate inserts/queries to write/read from those virtual tables instead of
embedding_index; update any code paths that reference embedding_index and
vec_distance_cosine (search logic and insert/update routines) to use the new
<kind>_vecs table and vec0 query API, while keeping embedding_index only as a
transitional fallback until full migration is complete.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 81d92f16-988c-4fbf-a818-b1ffdd4f15a4

📥 Commits

Reviewing files that changed from the base of the PR and between 056e88c and 5177248.

📒 Files selected for processing (15)
  • docs/superpowers/plans/2026-05-20-semantic-search.md
  • docs/superpowers/specs/2026-05-20-semantic-search-design.md
  • pyproject.toml
  • src/vouch/embeddings/__init__.py
  • src/vouch/embeddings/base.py
  • src/vouch/embeddings/cache.py
  • src/vouch/embeddings/fastembed_bge.py
  • src/vouch/embeddings/st_minilm.py
  • src/vouch/embeddings/st_mpnet.py
  • src/vouch/index_db.py
  • tests/embeddings/__init__.py
  • tests/embeddings/_fakes.py
  • tests/embeddings/test_core.py
  • tests/embeddings/test_integration.py
  • tests/embeddings/test_storage.py

Comment on lines +33 to +47
```
src/vouch/embeddings/
__init__.py # public API: encode, search, register
base.py # Embedder ABC + adapter registry
st_mpnet.py # default impl (sentence-transformers all-mpnet-base-v2)
st_minilm.py # alternative impl
fastembed_bge.py # alternative impl (no-torch path via fastembed)
cache.py # query embedding LRU + content-hash skip cache
rerank.py # cross-encoder reranker (ms-marco-MiniLM-L6-v2)
hyde.py # Hypothetical Document Embedding query expansion
dedup.py # cosine-threshold duplicate detection at ingest
fusion.py # RRF, weighted-sum, normalized-cosine fusion strategies
eval.py # recall@k / MRR / nDCG harness
migration.py # model-identity check + backfill orchestration
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language tag to the fenced code block.

Line 33 triggers markdownlint MD040 (fenced-code-language). This can cause doc-lint CI noise/failures depending on pipeline strictness.

Proposed fix
-```
+```text
 src/vouch/embeddings/
   __init__.py                # public API: encode, search, register
   base.py                    # Embedder ABC + adapter registry
   st_mpnet.py                # default impl (sentence-transformers all-mpnet-base-v2)
   st_minilm.py               # alternative impl
   fastembed_bge.py           # alternative impl (no-torch path via fastembed)
   cache.py                   # query embedding LRU + content-hash skip cache
   rerank.py                  # cross-encoder reranker (ms-marco-MiniLM-L6-v2)
   hyde.py                    # Hypothetical Document Embedding query expansion
   dedup.py                   # cosine-threshold duplicate detection at ingest
   fusion.py                  # RRF, weighted-sum, normalized-cosine fusion strategies
   eval.py                    # recall@k / MRR / nDCG harness
   migration.py               # model-identity check + backfill orchestration
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>

[warning] 33-33: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/superpowers/specs/2026-05-20-semantic-search-design.md around lines 33

  • 47, The fenced code block that lists "src/vouch/embeddings/" lacks a language
    tag which triggers markdownlint MD040; update the opening fence from ``` to
"src/vouch/embeddings/" so the linter recognizes it and CI noise is avoided.
Locate the triple-backtick block that begins immediately before the
"src/vouch/embeddings/" listing and add the language identifier to the opening
fence.

Comment thread src/vouch/embeddings/__init__.py Outdated
Comment on lines +15 to +16
def _query_key(query: str) -> str:
return hashlib.sha256(query.encode("utf-8")).hexdigest()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Namespace cache entries by model/version/dimension.

Right now the cache key is only the query hash. If the configured embedder changes model, version, or dimensionality, the same query will reuse a stale vector; that can either fail downstream on shape mismatches or silently rank against the wrong embedding space. Include (model, model_version, dim) in the key, or persist those columns and reject mismatches on lookup.

Also applies to: 19-21

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/embeddings/cache.py` around lines 15 - 16, The cache key currently
uses only the query hash in _query_key, which can return embeddings from a
different model/version/dimension; update key construction to incorporate the
embedder identity by including model, model_version (or version), and dim
(embedding dimensionality) alongside the query hash (e.g., include these values
in the string hashed by _query_key or create a new _cache_key(query, model,
model_version, dim)). Alternatively, persist model/model_version/dim as separate
columns when storing embeddings and add explicit checks in the lookup code (the
retrieval path around lines 19-21) to reject/ignore cached rows whose
model/version/dim do not match the current embedder before returning results.
Ensure you reference and update the functions that compute and use the cache key
(_query_key and the lookup/store code) so mismatched shapes are never returned.

) -> None:
h = _query_key(query)
blob = np.asarray(vec, dtype=np.float32).tobytes()
now = dt.datetime.now(dt.UTC).isoformat(timespec="seconds")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use sub-second precision for last_used_at.

timespec="seconds" gives every access in the same second the same sort key, so eviction order becomes arbitrary under normal bursty usage. Switching to microseconds or an integer epoch timestamp preserves the LRU contract much more reliably.

Suggested change
-    now = dt.datetime.now(dt.UTC).isoformat(timespec="seconds")
+    now = dt.datetime.now(dt.UTC).isoformat(timespec="microseconds")
...
-    now = dt.datetime.now(dt.UTC).isoformat(timespec="seconds")
+    now = dt.datetime.now(dt.UTC).isoformat(timespec="microseconds")

Also applies to: 37-43, 48-59

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/embeddings/cache.py` at line 24, The code sets last_used_at using
dt.datetime.now(dt.UTC).isoformat(timespec="seconds"), which loses sub-second
precision and breaks LRU eviction ordering; update all occurrences (e.g., the
assignment to now and any places referenced by last_used_at) to include
sub-second precision—either use isoformat(timespec="microseconds") or store an
integer epoch with microsecond resolution
(int(datetime.now(tz=dt.UTC).timestamp() * 1_000_000)) consistently across the
functions that set last_used_at so eviction order remains deterministic under
bursty access.

Comment thread src/vouch/index_db.py
Comment on lines +40 to +47
CREATE TABLE IF NOT EXISTS embedding_index (
kind TEXT NOT NULL,
id TEXT NOT NULL,
vec BLOB NOT NULL,
content_hash TEXT NOT NULL,
model TEXT NOT NULL,
model_version TEXT NOT NULL,
dim INTEGER NOT NULL,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Filter searches to the active embedding metadata.

You persist model, model_version, and dim, but search_embedding() ignores all three and scores every row of the requested kinds. After a model upgrade or partial reindex, mixed rows will either blow up on q @ vec when dims differ or pollute ranking when dims happen to match. Constrain both search paths to the active metadata, or purge stale rows when set_embedding_meta() changes it.

Also applies to: 228-245, 290-313

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 40 - 47, search_embedding() is currently
scoring all rows for requested kinds and ignores stored embedding metadata,
which risks mismatched-dimension errors and polluted rankings; update both
search paths in search_embedding() to JOIN or WHERE against the active embedding
metadata (model, model_version, dim) stored for the DB and add WHERE clauses
like model = ?, model_version = ?, dim = ? so q @ vec only runs on matching-dim
vectors. Also update set_embedding_meta() to either delete/purge stale
embedding_index rows whose model/model_version/dim do not match the newly set
metadata or mark them invalid, so subsequent searches remain constrained to
current metadata; reference the embedding_index table and the functions
search_embedding() and set_embedding_meta() when making these changes.

Comment thread tests/embeddings/_fakes.py Outdated
@dripsmvcp dripsmvcp force-pushed the feat/semantic-search-phase-2-storage branch from 5177248 to 13e709d Compare May 20, 2026 05:26

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pyproject.toml (2)

21-22: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove duplicate click dependency declaration.

Both lines declare click with different version constraints (>=8.4.0,<9 vs >=8.1,<9). This duplication can cause build-tool errors or unpredictable dependency resolution.

🔧 Proposed fix
-    "click>=8.4.0,<9",
     "click>=8.1,<9",

Keep the less restrictive constraint (>=8.1) unless there's a specific requirement for 8.4.0+.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pyproject.toml` around lines 21 - 22, Remove the duplicate click dependency
entry in pyproject.toml: keep a single "click" requirement and remove the other
line (choose the less restrictive constraint ">=8.1,<9" unless you have a hard
requirement for ">=8.4.0,<9"); update the dependency list so only one "click"
entry remains to avoid conflicting constraints.

37-37: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Relax sentence-transformers version constraint to allow v4+ and v5+.

The constraint >=2.7,<4 blocks v4.x and v5.x (currently at 5.5.0). The codebase uses only the basic API (SentenceTransformer() initialization and .encode() with standard parameters), which is stable across v4 and v5. The v5.0 breaking changes (sparse encoding support, new encode_query/encode_document methods) do not affect this usage. Update the constraint to at least >=2.7,<6 to allow users to benefit from newer releases, after confirming compatibility with v4.x through testing.

Also applies to: 48-48

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pyproject.toml` at line 37, Update the sentence-transformers version
constraint in pyproject.toml from "sentence-transformers>=2.7,<4" to
"sentence-transformers>=2.7,<6" (and update the other occurrence on the file,
referenced in the comment) so v4/v5 releases are allowed; keep the minimum at
2.7 and run quick compatibility checks against SentenceTransformer()
initialization and .encode() usage to confirm no changes are needed.
🧹 Nitpick comments (3)
src/vouch/index_db.py (2)

61-67: 💤 Low value

Consider adding a primary key or unique constraint to embedding_dupes.

The embedding_dupes table lacks a primary key, allowing duplicate rows for the same (kind, id, near_id) pair. If this is intentional for audit logging, this is fine. Otherwise, consider adding PRIMARY KEY (kind, id, near_id) to prevent redundant records.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 61 - 67, The embedding_dupes table
currently has no primary key, which allows duplicate (kind,id,near_id) rows;
update the CREATE TABLE for embedding_dupes to enforce uniqueness by adding a
PRIMARY KEY (kind, id, near_id) or a UNIQUE(kind, id, near_id) constraint so
duplicates are prevented; if audit logging of duplicates is intended instead,
add an explicit autoincrement id column (e.g., dup_id) as PRIMARY KEY and keep
(kind, id, near_id) indexed or made UNIQUE as appropriate.

248-271: 💤 Low value

Connection-id caching can give false positives after connection reuse.

_sqlite_vec_loaded tracks connections by id(conn). If a connection is closed and Python reuses that object id for a new connection, the cache will incorrectly report the extension as loaded. This is unlikely with the short-lived open_db context manager pattern but could surface under different usage patterns.

Consider clearing the id from _sqlite_vec_loaded when the connection closes, or using a connection attribute instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 248 - 271, The current cache
`_sqlite_vec_loaded` uses id(conn) which can be reused and give false positives;
change the cache to track the actual connection object (e.g., use
weakref.WeakSet or set an attribute on the connection) and update
`_load_sqlite_vec` to check that connection-specific marker instead of id(conn);
specifically replace uses of `_sqlite_vec_loaded` and `id(conn)` with a WeakSet
of sqlite3.Connection objects or `setattr(conn, "_sqlite_vec_loaded", True)`,
test presence with `getattr(conn, "_sqlite_vec_loaded", False)`, and set it
after successful `sqlite_vec.load(conn)` while leaving the existing extension
enable/disable and exception handling around `conn.enable_load_extension` and
`sqlite_vec.load` intact.
tests/embeddings/test_storage.py (1)

85-92: ⚡ Quick win

Clarify vector normalization expectations.

The test normalizes near explicitly (line 92) but leaves query and far as-is. While query and far happen to be unit vectors already, the inconsistency may confuse readers about whether the search API requires pre-normalized vectors.

♻️ Proposed fix: normalize all vectors explicitly
 def test_search_embedding_returns_topk(kb_dir: Path) -> None:
     from vouch.embeddings.base import content_hash as ch
 
     query = np.zeros(8, dtype=np.float32)
     query[0] = 1.0
+    query /= float(np.linalg.norm(query))
     near = np.zeros(8, dtype=np.float32)
     near[0] = 0.99
     near[1] = 0.05
     far = np.zeros(8, dtype=np.float32)
     far[7] = 1.0
+    far /= float(np.linalg.norm(far))
     near /= float(np.linalg.norm(near))

Alternatively, add a comment clarifying that query and far are already unit vectors:

# query and far are already unit vectors (magnitude=1.0); near requires normalization
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/embeddings/test_storage.py` around lines 85 - 92, The test
inconsistently normalizes vectors: only near is normalized while query and far
are left implicit; update the test to explicitly normalize query, near, and far
(e.g., divide each by float(np.linalg.norm(...))) so all three vectors are
clearly unit vectors before use, referencing the variables query, near, and far
and replacing the single near normalization with normalization calls for all
three vectors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/embeddings/test_storage.py`:
- Line 7: The test module currently does a hard import of numpy ("import numpy
as np") which causes CI failures when numpy isn't installed; replace that
top-level import with a runtime skip using pytest.importorskip('numpy') (or wrap
in try/except and call pytest.skip) so tests are skipped when numpy is
unavailable, or alternatively add a module-level pytest marker like pytestmark =
pytest.mark.requires_embeddings and ensure CI installs the embeddings extra;
target the import in tests/embeddings/test_storage.py (the current "import numpy
as np" line) and adjust it to use pytest.importorskip('numpy') or add the marker
as described.

---

Outside diff comments:
In `@pyproject.toml`:
- Around line 21-22: Remove the duplicate click dependency entry in
pyproject.toml: keep a single "click" requirement and remove the other line
(choose the less restrictive constraint ">=8.1,<9" unless you have a hard
requirement for ">=8.4.0,<9"); update the dependency list so only one "click"
entry remains to avoid conflicting constraints.
- Line 37: Update the sentence-transformers version constraint in pyproject.toml
from "sentence-transformers>=2.7,<4" to "sentence-transformers>=2.7,<6" (and
update the other occurrence on the file, referenced in the comment) so v4/v5
releases are allowed; keep the minimum at 2.7 and run quick compatibility checks
against SentenceTransformer() initialization and .encode() usage to confirm no
changes are needed.

---

Nitpick comments:
In `@src/vouch/index_db.py`:
- Around line 61-67: The embedding_dupes table currently has no primary key,
which allows duplicate (kind,id,near_id) rows; update the CREATE TABLE for
embedding_dupes to enforce uniqueness by adding a PRIMARY KEY (kind, id,
near_id) or a UNIQUE(kind, id, near_id) constraint so duplicates are prevented;
if audit logging of duplicates is intended instead, add an explicit
autoincrement id column (e.g., dup_id) as PRIMARY KEY and keep (kind, id,
near_id) indexed or made UNIQUE as appropriate.
- Around line 248-271: The current cache `_sqlite_vec_loaded` uses id(conn)
which can be reused and give false positives; change the cache to track the
actual connection object (e.g., use weakref.WeakSet or set an attribute on the
connection) and update `_load_sqlite_vec` to check that connection-specific
marker instead of id(conn); specifically replace uses of `_sqlite_vec_loaded`
and `id(conn)` with a WeakSet of sqlite3.Connection objects or `setattr(conn,
"_sqlite_vec_loaded", True)`, test presence with `getattr(conn,
"_sqlite_vec_loaded", False)`, and set it after successful
`sqlite_vec.load(conn)` while leaving the existing extension enable/disable and
exception handling around `conn.enable_load_extension` and `sqlite_vec.load`
intact.

In `@tests/embeddings/test_storage.py`:
- Around line 85-92: The test inconsistently normalizes vectors: only near is
normalized while query and far are left implicit; update the test to explicitly
normalize query, near, and far (e.g., divide each by float(np.linalg.norm(...)))
so all three vectors are clearly unit vectors before use, referencing the
variables query, near, and far and replacing the single near normalization with
normalization calls for all three vectors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 537d08ea-48fd-4476-ab92-cd562b11073e

📥 Commits

Reviewing files that changed from the base of the PR and between 5177248 and 13e709d.

📒 Files selected for processing (5)
  • pyproject.toml
  • src/vouch/embeddings/__init__.py
  • src/vouch/embeddings/cache.py
  • src/vouch/index_db.py
  • tests/embeddings/test_storage.py


from pathlib import Path

import numpy as np

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: numpy import causes test collection failures in CI.

All pipeline failures show ModuleNotFoundError: No module named 'numpy'. Since the PR introduces optional dependency groups and embeddings are optional, numpy is likely not installed in the base CI environment.

🔧 Suggested fixes (choose one approach)

Option 1: Skip tests when numpy unavailable (recommended if embeddings remain optional)

+pytest.importorskip("numpy", reason="embeddings tests require numpy")
+
 from pathlib import Path
 
 import numpy as np

Option 2: Update CI workflow to install optional embeddings dependencies

Ensure CI installs the embeddings extra when running this test suite:

pip install -e ".[embeddings,test]"

Option 3: Mark tests requiring optional dependencies

Add a pytest marker at module or test level:

pytestmark = pytest.mark.requires_embeddings

and configure CI to install embeddings dependencies for tests with that marker.

🧰 Tools
🪛 GitHub Actions: ci / 1_test (py3.13).txt

[error] 7-7: Pytest test collection failed due to ImportError: ModuleNotFoundError: No module named 'numpy'.

🪛 GitHub Actions: ci / 3_test (py3.11).txt

[error] 7-7: Test collection failed (ImportError). Module 'numpy' is missing: ModuleNotFoundError: No module named 'numpy'.

🪛 GitHub Actions: ci / test (py3.11)

[error] 7-7: Pytest import error during test collection: ModuleNotFoundError: No module named 'numpy'.

🪛 GitHub Actions: ci / test (py3.12)

[error] 7-7: pytest collection failed due to ImportError: ModuleNotFoundError: No module named 'numpy'.

🪛 GitHub Actions: ci / test (py3.13)

[error] 7-7: pytest collection failed due to ImportError: ModuleNotFoundError: No module named 'numpy'

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/embeddings/test_storage.py` at line 7, The test module currently does a
hard import of numpy ("import numpy as np") which causes CI failures when numpy
isn't installed; replace that top-level import with a runtime skip using
pytest.importorskip('numpy') (or wrap in try/except and call pytest.skip) so
tests are skipped when numpy is unavailable, or alternatively add a module-level
pytest marker like pytestmark = pytest.mark.requires_embeddings and ensure CI
installs the embeddings extra; target the import in
tests/embeddings/test_storage.py (the current "import numpy as np" line) and
adjust it to use pytest.importorskip('numpy') or add the marker as described.

@plind-junior

Copy link
Copy Markdown
Collaborator

Schema looks good. Storing model identity per row + raw float32 bytes is exactly right (way better than the JSON approach in #36). The NumPy fallback when sqlite-vec isn't loadable is a nice touch.

One thing I want to flag before merging:

In search_embedding your SQL does WHERE ... AND score >= ? where score is an alias from the SELECT. That works on SQLite 3.39+ but older versions reject it, and since you catch OperationalError and fall back to NumPy, you'd get a silent perf cliff on older SQLite. Either inline the expression in the WHERE clause, or pin a minimum SQLite version in pyproject. Mostly I just want to make sure you've tested on the lowest SQLite you support.

Smaller stuff:

  • _load_sqlite_vec caches based on id(conn), but Python recycles object IDs after GC, so this could short-circuit a load on a brand new connection. Probably safer to drop the cache, re-loading is cheap.
  • reset() only clears the FTS tables, not the new embedding tables. Either rename it reset_fts() or extend it.
  • cache_query_vec bumps hit_count on inserts too, which mixes "encoded N times" with "served from cache N times." Doc which one you meant.
  • query_embedding_cache could use an index on last_used_at if the cap ever grows beyond 1024.

Tests are solid. Ship it once the SQLite version thing is confirmed.

@dripsmvcp dripsmvcp force-pushed the feat/semantic-search-phase-2-storage branch from 13e709d to 1d20dc7 Compare May 20, 2026 13:04
dripsmvcp added 11 commits May 20, 2026 22:08
Approved design for feat/semantic-search. Embedding (sentence-
transformers all-mpnet-base-v2) becomes the primary search backend
with FTS5 as fallback. Synchronous-at-write indexing across all six
artifact types (claim, page, source, entity, relation, evidence).

Maximally functional scope (~3000 LOC): pluggable model adapter
registry, sqlite-vec ANN with NumPy fallback, cross-encoder rerank,
HyDE query expansion, ingest-time duplicate detection, model-identity
migration, recall/MRR/nDCG eval harness, and full CLI/MCP/JSONL parity.

The writing-plans step will turn the rollout order in section 16
into concrete phased tasks.
32-task TDD plan for the semantic-search feature: foundation, storage,
write hooks across all six artifact types, semantic-primary search
integration in MCP/JSONL/CLI, RRF fusion + hybrid, cross-encoder rerank,
HyDE expansion, ingest-time duplicate detection, model-identity
migration, recall/MRR/nDCG scorer, end-to-end integration test, and
user docs.

Each task: failing test, minimal implementation, passing test, commit.
MockEmbedder test double keeps the unit suite fast; the real model is
exercised only under @pytest.mark.integration.

The evaluation module is named scorer.py rather than eval.py to avoid
shadowing the Python builtin and to keep static analysers quiet; the
user-facing CLI subcommand remains `vouch eval embedding` (the Click
group is registered under the name "eval", with the Python identifier
eval_group).
CI ran ruff which flagged SIM105 on the three try/except ImportError
guard blocks in src/vouch/embeddings/__init__.py. Replace each with
`contextlib.suppress(ImportError)` -- same semantics, satisfies ruff.

Also add mypy overrides for numpy / sqlite_vec / sentence_transformers /
fastembed so the type check passes in the base [dev] CI install where
those optional extras aren't present. The embedding code paths that
import these libraries are only reached when the extras are installed
at runtime; for the static type check the missing stubs are noise.
CI runs `pip install -e '.[dev]'` which deliberately excludes the
optional `[embeddings]` extras. tests/embeddings/test_*.py modules
import numpy at top level, so pytest collection fails with
ModuleNotFoundError before any test can be deselected.

Add tests/embeddings/conftest.py with `pytest.importorskip("numpy")`
so the entire embeddings test directory skips gracefully when numpy
is absent. Once `pip install vouch[embeddings]` (or numpy itself) is
present, the skip is a no-op and the tests run normally.
`_validate_content` in bundle.py uses the path's first directory
component to look up a Pydantic validator. For sources, both
`sources/<sha>/meta.yaml` (the Source model) and
`sources/<sha>/content` (raw opaque bytes) hit the same "sources"
key, so the validator was being run on the raw content bytes and
failing with "1 validation error for Source".

Add an early return for any non-meta.yaml path under `sources/` so
opaque content bytes are not Pydantic-validated. Only the Source
metadata file is checked, which matches the original intent of the
PR vouchdev#13 validation.

Unblocks three pre-existing test_bundle.py failures inherited from
upstream main.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/vouch/index_db.py (1)

89-94: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

reset() does not match its “Drop everything” contract.

Line 90 says “Drop everything”, but only FTS tables are cleared. Embedding tables (embedding_index, query_embedding_cache, embedding_dupes) remain populated, which leaves stale semantic state after reset.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 89 - 94, The reset() function currently
only clears FTS tables; update it (in src/vouch/index_db.py inside reset and
using open_db) to also remove all embedding-related state by executing DELETE
(or DROP/CREATE) for embedding_index, query_embedding_cache, and embedding_dupes
so no stale semantic state remains; ensure the same connection.executescript
call includes those table names (and any other non-FTS tables that must be
cleared) so the function truly “drops everything.”
♻️ Duplicate comments (4)
src/vouch/embeddings/cache.py (2)

24-24: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use sub-second precision for last_used_at to preserve LRU ordering.

timespec="seconds" causes ties during bursty access, making eviction order non-deterministic.

Also applies to: 48-48

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/embeddings/cache.py` at line 24, The current timestamp generation
uses dt.datetime.now(dt.UTC).isoformat(timespec="seconds") which drops
sub-second precision and can break deterministic LRU eviction; update both
occurrences (the now assignment used to set last_used_at) to include sub-second
precision, e.g., use timespec="microseconds" (or "milliseconds" if you prefer)
so last_used_at preserves ordering across bursty accesses; locate the
dt.datetime.now(dt.UTC).isoformat(...) calls (the now variable assignments that
populate last_used_at) and replace timespec="seconds" with
timespec="microseconds" in both places.

15-16: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Namespace cache keys by embedding space, not query text alone.

Using only query hash allows stale vectors to be reused across model/version/dim changes. That can silently corrupt ranking or break on shape mismatch downstream. Include model identity in the key (or persist model/version/dim in cache rows and validate on lookup).

Also applies to: 22-33, 47-53

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/embeddings/cache.py` around lines 15 - 16, The current _query_key
function only hashes the query text, which can cause reuse of stale embeddings
across model/version/dimension changes; update the key strategy to include model
identity (e.g., model name/version and embedding dimension) when generating
cache keys—either change _query_key to accept model_id/model_name and dim and
incorporate them into the SHA256 input, or persist model/version/dim alongside
cached vectors and validate those fields on lookup before returning a cached
embedding; ensure all places that call _query_key (and other cache key builders
indicated around lines 22-33 and 47-53) are updated to pass the model
identity/dim so keys are namespaced per embedding space.
tests/embeddings/test_storage.py (1)

7-8: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard optional dependency import to avoid test-collection failures.

Top-level import numpy as np will fail collection in environments without embeddings extras. Use pytest.importorskip("numpy") (or equivalent marker strategy) at module import time so the suite skips cleanly when optional deps are absent.

#!/usr/bin/env bash
set -euo pipefail

# Verify whether numpy is declared in runtime/test extras and whether CI installs those extras.
rg -n "numpy|embeddings|optional-dependencies|extras_require" pyproject.toml setup.py
rg -n "pip install|\\.[^\"']*embeddings|pytest" .github/workflows
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/embeddings/test_storage.py` around lines 7 - 8, Top-level "import numpy
as np" in the tests/embeddings module causes test collection to fail when the
optional embeddings extras aren't installed; replace the direct import with a
guarded import using pytest.importorskip("numpy") at module import time (e.g.,
call pytest.importorskip("numpy") before any use of numpy) so the whole module
is skipped cleanly when numpy is absent and avoids collection errors.
src/vouch/index_db.py (1)

295-299: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Constrain embedding search to active model/version/dim metadata.

Both SQL and NumPy paths score all rows of selected kinds, ignoring stored embedding metadata. Mixed embedding spaces can degrade ranking and can throw runtime errors on dimension mismatch in q @ vec. Filter search rows to active (model, model_version, dim) (or purge stale rows when meta changes).

Also applies to: 306-314

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 295 - 299, The SQL and NumPy search paths
currently score all rows for the requested kinds and ignore embedding metadata;
restrict both to only rows matching the active embedding metadata (model,
model_version, dim). Update the SQL query that uses "FROM embedding_index ...
vec_distance_cosine" to add WHERE conditions for model = ? AND model_version = ?
AND dim = ? and pass the active model/model_version/dim as parameters (in
addition to the existing placeholders and score/limit), and in the NumPy path
(the code that computes q @ vec) filter the retrieved rows by their stored
model/model_version/dim before computing dot products (or drop/stale-purge rows
that don't match). Ensure you reference the embedding_index rows'
model/model_version/dim fields in both code paths so mixed embedding spaces
cannot be compared.
🧹 Nitpick comments (1)
src/vouch/embeddings/cache.py (1)

27-33: ⚡ Quick win

Clarify/adjust hit_count semantics (insert vs cache-hit).

hit_count is incremented both on write and on lookup-hit, so query_cache_stats()["hits"] currently mixes “encoded+stored” with “served-from-cache”. If this metric is intended to represent cache effectiveness, increment only on lookup hits (or split into two counters).

Also applies to: 57-58, 79-83

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/embeddings/cache.py` around lines 27 - 33, The INSERT/REPLACE SQL
for the query_embedding_cache currently adds +1 to hit_count on writes, causing
query_cache_stats()["hits"] to conflate store operations with cache hits; change
the upsert SQL in src/vouch/embeddings/cache.py (the statement referencing
"INSERT OR REPLACE INTO query_embedding_cache" and the query_embedding_cache
table) to preserve existing hit_count on write (e.g., COALESCE((SELECT hit_count
FROM query_embedding_cache WHERE query_hash=?), 0) without +1) or initialize to
0 for new rows, and ensure only the lookup path (the code that currently
increments hit_count on cache lookups — referenced by the lookup/update block
around the other occurrences at lines noted 57-58 and 79-83) performs the +1
increment; update the logic used by query_cache_stats() to rely on the
lookup-only increment so "hits" reflects served-from-cache only.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vouch/index_db.py`:
- Around line 253-254: The current id()-based cache (_sqlite_vec_loaded) is
unsafe; replace it with a WeakSet or remove it: change the declaration of
_sqlite_vec_loaded from set[int] to weakref.WeakSet(), import weakref, and
update checks from "if id(conn) in _sqlite_vec_loaded" to "if conn in
_sqlite_vec_loaded" and additions from "_sqlite_vec_loaded.add(id(conn))" to
"_sqlite_vec_loaded.add(conn)"; alternatively, remove all mentions of
_sqlite_vec_loaded (lines checking/adding) so the loader code in the function
that accepts conn always runs (loader is idempotent), ensuring behavior is
correct without relying on object id reuse.
- Around line 295-300: The SQL uses the SELECT alias "score" in the WHERE clause
inside the query built in src/vouch/index_db.py (the call that passes
(q.tobytes(), *kinds, min_score, limit) and calls vec_distance_cosine(vec, ?)),
which is non-standard and can fail in some SQLite configs and triggers the slow
Python fallback; fix it by rewriting the query to compute the score in a
subquery/CTE or by inlining the cosine expression in the WHERE clause (e.g.,
SELECT ... FROM (SELECT ..., 1.0 - vec_distance_cosine(vec, ?) AS score FROM
embedding_index WHERE kind IN (...) ) AS sub WHERE score >= ? ORDER BY score
DESC LIMIT ?), keeping the same parameter ordering for q, kinds, min_score,
limit and still using vec_distance_cosine(vec, ?) so the DB does the filtering
instead of Python post-filtering.

---

Outside diff comments:
In `@src/vouch/index_db.py`:
- Around line 89-94: The reset() function currently only clears FTS tables;
update it (in src/vouch/index_db.py inside reset and using open_db) to also
remove all embedding-related state by executing DELETE (or DROP/CREATE) for
embedding_index, query_embedding_cache, and embedding_dupes so no stale semantic
state remains; ensure the same connection.executescript call includes those
table names (and any other non-FTS tables that must be cleared) so the function
truly “drops everything.”

---

Duplicate comments:
In `@src/vouch/embeddings/cache.py`:
- Line 24: The current timestamp generation uses
dt.datetime.now(dt.UTC).isoformat(timespec="seconds") which drops sub-second
precision and can break deterministic LRU eviction; update both occurrences (the
now assignment used to set last_used_at) to include sub-second precision, e.g.,
use timespec="microseconds" (or "milliseconds" if you prefer) so last_used_at
preserves ordering across bursty accesses; locate the
dt.datetime.now(dt.UTC).isoformat(...) calls (the now variable assignments that
populate last_used_at) and replace timespec="seconds" with
timespec="microseconds" in both places.
- Around line 15-16: The current _query_key function only hashes the query text,
which can cause reuse of stale embeddings across model/version/dimension
changes; update the key strategy to include model identity (e.g., model
name/version and embedding dimension) when generating cache keys—either change
_query_key to accept model_id/model_name and dim and incorporate them into the
SHA256 input, or persist model/version/dim alongside cached vectors and validate
those fields on lookup before returning a cached embedding; ensure all places
that call _query_key (and other cache key builders indicated around lines 22-33
and 47-53) are updated to pass the model identity/dim so keys are namespaced per
embedding space.

In `@src/vouch/index_db.py`:
- Around line 295-299: The SQL and NumPy search paths currently score all rows
for the requested kinds and ignore embedding metadata; restrict both to only
rows matching the active embedding metadata (model, model_version, dim). Update
the SQL query that uses "FROM embedding_index ... vec_distance_cosine" to add
WHERE conditions for model = ? AND model_version = ? AND dim = ? and pass the
active model/model_version/dim as parameters (in addition to the existing
placeholders and score/limit), and in the NumPy path (the code that computes q @
vec) filter the retrieved rows by their stored model/model_version/dim before
computing dot products (or drop/stale-purge rows that don't match). Ensure you
reference the embedding_index rows' model/model_version/dim fields in both code
paths so mixed embedding spaces cannot be compared.

In `@tests/embeddings/test_storage.py`:
- Around line 7-8: Top-level "import numpy as np" in the tests/embeddings module
causes test collection to fail when the optional embeddings extras aren't
installed; replace the direct import with a guarded import using
pytest.importorskip("numpy") at module import time (e.g., call
pytest.importorskip("numpy") before any use of numpy) so the whole module is
skipped cleanly when numpy is absent and avoids collection errors.

---

Nitpick comments:
In `@src/vouch/embeddings/cache.py`:
- Around line 27-33: The INSERT/REPLACE SQL for the query_embedding_cache
currently adds +1 to hit_count on writes, causing query_cache_stats()["hits"] to
conflate store operations with cache hits; change the upsert SQL in
src/vouch/embeddings/cache.py (the statement referencing "INSERT OR REPLACE INTO
query_embedding_cache" and the query_embedding_cache table) to preserve existing
hit_count on write (e.g., COALESCE((SELECT hit_count FROM query_embedding_cache
WHERE query_hash=?), 0) without +1) or initialize to 0 for new rows, and ensure
only the lookup path (the code that currently increments hit_count on cache
lookups — referenced by the lookup/update block around the other occurrences at
lines noted 57-58 and 79-83) performs the +1 increment; update the logic used by
query_cache_stats() to rely on the lookup-only increment so "hits" reflects
served-from-cache only.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 6e96ac29-9924-4992-a9fe-b2c5725e2bfa

📥 Commits

Reviewing files that changed from the base of the PR and between 13e709d and 1d20dc7.

📒 Files selected for processing (4)
  • src/vouch/embeddings/cache.py
  • src/vouch/index_db.py
  • tests/embeddings/conftest.py
  • tests/embeddings/test_storage.py

Comment thread src/vouch/index_db.py
Comment on lines +253 to +254
if id(conn) in _sqlite_vec_loaded:
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Can Python reuse object IDs after an object is garbage collected, and is id() stable only for an object's lifetime?

💡 Result:

Yes. Python’s id() is stable only while the object is alive (reachable). If an object is garbage-collected, Python may reuse the same id() value for a later, different object with a non-overlapping lifetime [1][2]. What Python guarantees: - id(object) returns an integer that is “unique and constant for this object during its lifetime” [1][2]. - “Two objects with non-overlapping lifetimes may have the same id() value” [1][2]. About reuse after garbage collection: - Python objects become garbage-collected when they are unreachable; implementations are allowed to collect them at different times, and you should not depend on immediate finalization [3]. - When the original object is collected and its memory is available, a later object may be allocated at the same address (CPython implementation detail), causing id() to repeat [1][2]. So, id() is not a lifetime-unique identifier across time in a program; it’s only stable for the lifetime of the specific object instance [1][2]. If you need a stable ID beyond that, use something like a UUID stored on the object rather than id().

Citations:


🏁 Script executed:

cat -n src/vouch/index_db.py | sed -n '245,280p'

Repository: vouchdev/vouch

Length of output: 1258


Remove the sqlite-vec load state cache or switch to a stable cache key.

Python's id() can be reused after garbage collection; when a connection is garbage-collected and a new one is allocated at the same memory address, the code will incorrectly skip loader execution for the new connection. Replace _sqlite_vec_loaded with either a WeakSet to track connections by object identity, or remove the cache entirely if performance impact is acceptable (loader can be called multiple times safely).

Lines affected
  • Line 248: _sqlite_vec_loaded: set[int] = set()
  • Lines 253–254: if id(conn) in _sqlite_vec_loaded: return True
  • Lines 270–271: _sqlite_vec_loaded.add(id(conn))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 253 - 254, The current id()-based cache
(_sqlite_vec_loaded) is unsafe; replace it with a WeakSet or remove it: change
the declaration of _sqlite_vec_loaded from set[int] to weakref.WeakSet(), import
weakref, and update checks from "if id(conn) in _sqlite_vec_loaded" to "if conn
in _sqlite_vec_loaded" and additions from "_sqlite_vec_loaded.add(id(conn))" to
"_sqlite_vec_loaded.add(conn)"; alternatively, remove all mentions of
_sqlite_vec_loaded (lines checking/adding) so the loader code in the function
that accepts conn always runs (loader is idempotent), ensuring behavior is
correct without relying on object id reuse.

Comment thread src/vouch/index_db.py Outdated
@dripsmvcp dripsmvcp force-pushed the feat/semantic-search-phase-2-storage branch from 1d20dc7 to 540b9ab Compare May 20, 2026 13:11
dripsmvcp added 6 commits May 20, 2026 22:14
Per CodeRabbit on PR vouchdev#37: decoding sha256 chunks as raw float32 (via
struct.unpack("<f", ...)) produced NaN/Inf for ~1 in 2^7 chunks
because most 4-byte patterns aren't finite IEEE754 floats. Norm
overflow then made unit normalization unreliable and broke cosine
similarity asserts in downstream phases (the dedup test in Phase 6
already had to work around this with a custom _HashEmbedder).

Decode each chunk as an unsigned 32-bit integer, scale to [-1.0, 1.0],
accumulate in float64 for exact norm, and cast to float32 on return.
Deterministic, finite, well-behaved.
@dripsmvcp dripsmvcp force-pushed the feat/semantic-search-phase-2-storage branch from 540b9ab to 3722ee4 Compare May 20, 2026 13:16

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (6)
src/vouch/index_db.py (3)

248-254: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid caching sqlite-vec load state by id(conn).

id() values can be reused after GC, so a new connection may be incorrectly treated as already loaded. Use a weakref.WeakSet[sqlite3.Connection] or drop this cache and always attempt load.

Also applies to: 270-271

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 248 - 254, The current cache
_sqlite_vec_loaded uses id(conn) which can be recycled; update _load_sqlite_vec
to stop using id(conn) keys and instead track connections with a
weakref.WeakSet[sqlite3.Connection] (or remove the cache entirely and always
attempt to load), replacing the int set _sqlite_vec_loaded with a WeakSet and
checking membership by connection object; update any other uses around the same
module (the other occurrences of _sqlite_vec_loaded) to use the new WeakSet to
avoid false positives when connection objects are GC'd and ids get reused.

295-300: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t filter on SELECT alias score in WHERE.

Using score alias in the same SELECT is non-portable and can trigger OperationalError, silently downgrading to Python brute-force. Rewrite with a subquery/CTE (or inline the expression in WHERE) so DB-side filtering is reliable.

Also applies to: 303-304

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 295 - 300, The SQL uses the SELECT alias
"score" in the WHERE clause (in the query built around the embedding_index
selection near the code that passes (q.tobytes(), *kinds, min_score, limit)),
which is non-portable; change the query in the function/method that builds this
statement (and the similar one around lines 303-304) to either use a
subquery/CTE or inline the expression 1.0 - vec_distance_cosine(vec, ?) directly
in the WHERE (e.g., wrap the SELECT as a subquery/CTE that computes score and
then filter on score in the outer WHERE), keeping the same parameter order
(q.tobytes(), *kinds, min_score, limit) so DB-side filtering works reliably.

295-299: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Constrain search rows to active embedding metadata (model/version/dim).

Current search scans by kind only. Mixed embeddings (post-upgrade/partial reindex) can produce shape errors or cross-space ranking noise. Filter both SQL and fallback paths to active metadata, or purge stale rows when meta changes.

Also applies to: 305-307, 311-314

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 295 - 299, The SQL query against
embedding_index (the multi-line string building SELECT with vec_distance_cosine
and placeholders) currently filters only by kind; update that query to also
filter by the active embedding metadata fields (model, version, dim) by adding
AND clauses (e.g., AND model = ? AND version = ? AND dim = ?) and bind those
values alongside the existing params, and apply the same metadata filtering to
the other two search query sites noted (around lines 305-307 and 311-314) as
well as the fallback in-memory or Python-path that currently uses kind-only
filtering (ensure the fallback filters rows by model/version/dim before
computing score), so mixed-shape/stale rows are excluded or purged when metadata
changes.
tests/embeddings/test_storage.py (1)

7-8: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard numpy import to avoid collection-time CI failures when optional deps are absent.

Use pytest.importorskip("numpy") before importing numpy (or ensure CI always installs embeddings extras). Current hard import can fail the entire suite during collection.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/embeddings/test_storage.py` around lines 7 - 8, The test file currently
does a hard import of numpy which can cause collection-time failures; change the
import to guard with pytest.importorskip by calling pytest.importorskip("numpy")
before attempting to import numpy (or replace the direct import with
pytest.importorskip followed by import numpy as np) so tests are skipped when
the optional dependency is missing; update the top of
tests/embeddings/test_storage.py around the numpy import to use
pytest.importorskip to prevent CI collection failures.
src/vouch/embeddings/cache.py (2)

15-16: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Namespace the cache key by embedding identity, not query text alone.

Hashing only query allows stale vectors to be reused after model/version/dim changes. Include embedder identity in the key (or store/check model metadata on lookup) to prevent cross-model cache poisoning.

Also applies to: 46-53

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/embeddings/cache.py` around lines 15 - 16, The cache key currently
hashes only the query in _query_key, which allows reuse across different
embedder models/versions; change _query_key to take an embedder identity (e.g.,
model name/version or embedder.id) and include that identity in the hashed input
so keys are namespaced by embedder, and update every cache lookup/storage site
to pass that embedder identity; alternatively, persist embedder/model metadata
with each cached entry and validate it on lookup — but ensure functions that
compute keys (start with _query_key and any other key-generating helpers
referenced around lines 46-53) are updated to use the embedder identity.

24-24: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use sub-second precision for last_used_at to preserve true LRU ordering.

timespec="seconds" can produce ties during bursty access and evict arbitrarily.

Also applies to: 48-48

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/embeddings/cache.py` at line 24, The timestamp assigned to the
variable now (used for last_used_at) uses timespec="seconds", which causes ties
under bursty access; update the isoformat call to include sub-second precision
(e.g., timespec="microseconds" or "milliseconds") so last_used_at preserves true
LRU ordering, and apply the same change to the other occurrence that sets now in
this module.
🧹 Nitpick comments (3)
tests/embeddings/test_integration.py (1)

38-53: ⚡ Quick win

Add normalization assertions for ST-MiniLM and FastEmbed integration checks.

These tests currently validate only shape/dtype; adding unit-norm assertions here will catch adapter regressions that impact cosine-based ranking behavior.

Proposed test hardening
 `@pytest.mark.integration`
 def test_st_minilm_loads_and_encodes() -> None:
     from vouch.embeddings.st_minilm import STMinilmEmbedder
     e = STMinilmEmbedder()
     vec = e.encode("hello world")
+    assert isinstance(vec, np.ndarray)
     assert vec.shape == (384,)
     assert vec.dtype == np.float32
+    assert abs(float(np.linalg.norm(vec)) - 1.0) < 1e-3
@@
 `@pytest.mark.integration`
 def test_fastembed_bge_loads_and_encodes() -> None:
     pytest.importorskip("fastembed")
     from vouch.embeddings.fastembed_bge import FastembedBgeEmbedder
     e = FastembedBgeEmbedder()
     vec = e.encode("hello world")
+    assert isinstance(vec, np.ndarray)
     assert vec.shape == (384,)
     assert vec.dtype == np.float32
+    assert abs(float(np.linalg.norm(vec)) - 1.0) < 1e-3
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/embeddings/test_integration.py` around lines 38 - 53, Update the two
tests to also assert that returned embeddings are unit-normalized: after calling
STMinilmEmbedder.encode in test_st_minilm_loads_and_encodes and
FastembedBgeEmbedder.encode in test_fastembed_bge_loads_and_encodes compute the
L2 norm (e.g., via np.linalg.norm(vec)) and assert it is approximately 1.0
within a small tolerance (use pytest.approx or a small abs tolerance) in
addition to the existing shape and dtype checks so regressions that break
cosine-based ranking are caught.
src/vouch/bundle.py (1)

211-227: ⚡ Quick win

Consider a more explicit check for the content file.

The current logic skips validation for any file in sources/ that doesn't end with /meta.yaml. While this is correct given the documented two-file structure (meta.yaml and content), an explicit check for /content would be clearer and more defensive:

if subdir == "sources" and path.endswith("/content"):
    return

This makes the intent explicit—skip validation only for the opaque content file—and prevents accidental validation skips if the file structure evolves.

🔍 Proposed refinement
     subdir = path.split("/")[0]
     # Source artifacts have two file kinds:
     #   sources/<sha>/meta.yaml  -- the Source pydantic model (validate)
     #   sources/<sha>/content    -- the raw source bytes (skip validation)
     # The opaque content file isn't a pydantic model, so model_validate
     # on raw bytes raises spuriously.
-    if subdir == "sources" and not path.endswith("/meta.yaml"):
+    if subdir == "sources" and path.endswith("/content"):
         return
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/bundle.py` around lines 211 - 227, The current guard in
_validate_content uses subdir == "sources" and not path.endswith("/meta.yaml")
which skips validation for any non-meta file; change it to explicitly skip only
the opaque content file by checking if subdir == "sources" and
path.endswith("/content") then return, leaving meta.yaml and any other future
files to be validated by VALIDATORS; update the conditional in _validate_content
and keep the subsequent validator lookup/try-except unchanged.
src/vouch/embeddings/cache.py (1)

27-31: ⚡ Quick win

Clarify or split hit_count semantics in cache stats.

hit_count is incremented on insert and on lookup-hit, so query_cache_stats()["hits"] mixes write and read traffic. Either document this explicitly or track separate counters (write_count, serve_hits).

Also applies to: 76-83

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/embeddings/cache.py` around lines 27 - 31, The current
query_embedding_cache table and its use of the hit_count column (updated by the
INSERT OR REPLACE SQL shown and again at the lookup-hit code around lines 76-83)
mixes writes and read-serves, so update the schema and stats to separate those
semantics: add two columns (e.g., write_count and serve_hits) to the
query_embedding_cache table, change the INSERT/REPLACE logic that currently
increments hit_count to only increment write_count, change the lookup-path that
currently increments hit_count on cache hits to increment serve_hits instead,
and update the query_cache_stats() function to return both write_count and
serve_hits (or rename fields) so metrics clearly distinguish writes vs serves.
Ensure all SQL statements and any code referencing hit_count are updated
accordingly and migration/initialization handles existing rows.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vouch/index_db.py`:
- Around line 311-314: The fallback scoring uses raw dot product (score =
float(q @ vec)) which diverges from the SQL cosine distance path; update the
loop that calls _blob_to_vec to normalize the vector (and ensure q is normalized
or normalize a copy of q) before computing score so score = float((q_normalized)
@ (vec_normalized)) and then compare to min_score; refer to the variables q, vec
and the helper _blob_to_vec when making the change.

---

Duplicate comments:
In `@src/vouch/embeddings/cache.py`:
- Around line 15-16: The cache key currently hashes only the query in
_query_key, which allows reuse across different embedder models/versions; change
_query_key to take an embedder identity (e.g., model name/version or
embedder.id) and include that identity in the hashed input so keys are
namespaced by embedder, and update every cache lookup/storage site to pass that
embedder identity; alternatively, persist embedder/model metadata with each
cached entry and validate it on lookup — but ensure functions that compute keys
(start with _query_key and any other key-generating helpers referenced around
lines 46-53) are updated to use the embedder identity.
- Line 24: The timestamp assigned to the variable now (used for last_used_at)
uses timespec="seconds", which causes ties under bursty access; update the
isoformat call to include sub-second precision (e.g., timespec="microseconds" or
"milliseconds") so last_used_at preserves true LRU ordering, and apply the same
change to the other occurrence that sets now in this module.

In `@src/vouch/index_db.py`:
- Around line 248-254: The current cache _sqlite_vec_loaded uses id(conn) which
can be recycled; update _load_sqlite_vec to stop using id(conn) keys and instead
track connections with a weakref.WeakSet[sqlite3.Connection] (or remove the
cache entirely and always attempt to load), replacing the int set
_sqlite_vec_loaded with a WeakSet and checking membership by connection object;
update any other uses around the same module (the other occurrences of
_sqlite_vec_loaded) to use the new WeakSet to avoid false positives when
connection objects are GC'd and ids get reused.
- Around line 295-300: The SQL uses the SELECT alias "score" in the WHERE clause
(in the query built around the embedding_index selection near the code that
passes (q.tobytes(), *kinds, min_score, limit)), which is non-portable; change
the query in the function/method that builds this statement (and the similar one
around lines 303-304) to either use a subquery/CTE or inline the expression 1.0
- vec_distance_cosine(vec, ?) directly in the WHERE (e.g., wrap the SELECT as a
subquery/CTE that computes score and then filter on score in the outer WHERE),
keeping the same parameter order (q.tobytes(), *kinds, min_score, limit) so
DB-side filtering works reliably.
- Around line 295-299: The SQL query against embedding_index (the multi-line
string building SELECT with vec_distance_cosine and placeholders) currently
filters only by kind; update that query to also filter by the active embedding
metadata fields (model, version, dim) by adding AND clauses (e.g., AND model = ?
AND version = ? AND dim = ?) and bind those values alongside the existing
params, and apply the same metadata filtering to the other two search query
sites noted (around lines 305-307 and 311-314) as well as the fallback in-memory
or Python-path that currently uses kind-only filtering (ensure the fallback
filters rows by model/version/dim before computing score), so mixed-shape/stale
rows are excluded or purged when metadata changes.

In `@tests/embeddings/test_storage.py`:
- Around line 7-8: The test file currently does a hard import of numpy which can
cause collection-time failures; change the import to guard with
pytest.importorskip by calling pytest.importorskip("numpy") before attempting to
import numpy (or replace the direct import with pytest.importorskip followed by
import numpy as np) so tests are skipped when the optional dependency is
missing; update the top of tests/embeddings/test_storage.py around the numpy
import to use pytest.importorskip to prevent CI collection failures.

---

Nitpick comments:
In `@src/vouch/bundle.py`:
- Around line 211-227: The current guard in _validate_content uses subdir ==
"sources" and not path.endswith("/meta.yaml") which skips validation for any
non-meta file; change it to explicitly skip only the opaque content file by
checking if subdir == "sources" and path.endswith("/content") then return,
leaving meta.yaml and any other future files to be validated by VALIDATORS;
update the conditional in _validate_content and keep the subsequent validator
lookup/try-except unchanged.

In `@src/vouch/embeddings/cache.py`:
- Around line 27-31: The current query_embedding_cache table and its use of the
hit_count column (updated by the INSERT OR REPLACE SQL shown and again at the
lookup-hit code around lines 76-83) mixes writes and read-serves, so update the
schema and stats to separate those semantics: add two columns (e.g., write_count
and serve_hits) to the query_embedding_cache table, change the INSERT/REPLACE
logic that currently increments hit_count to only increment write_count, change
the lookup-path that currently increments hit_count on cache hits to increment
serve_hits instead, and update the query_cache_stats() function to return both
write_count and serve_hits (or rename fields) so metrics clearly distinguish
writes vs serves. Ensure all SQL statements and any code referencing hit_count
are updated accordingly and migration/initialization handles existing rows.

In `@tests/embeddings/test_integration.py`:
- Around line 38-53: Update the two tests to also assert that returned
embeddings are unit-normalized: after calling STMinilmEmbedder.encode in
test_st_minilm_loads_and_encodes and FastembedBgeEmbedder.encode in
test_fastembed_bge_loads_and_encodes compute the L2 norm (e.g., via
np.linalg.norm(vec)) and assert it is approximately 1.0 within a small tolerance
(use pytest.approx or a small abs tolerance) in addition to the existing shape
and dtype checks so regressions that break cosine-based ranking are caught.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 6b136ef8-f17a-4d0b-89a9-e56392159cbe

📥 Commits

Reviewing files that changed from the base of the PR and between 1d20dc7 and 540b9ab.

📒 Files selected for processing (17)
  • docs/superpowers/plans/2026-05-20-semantic-search.md
  • docs/superpowers/specs/2026-05-20-semantic-search-design.md
  • pyproject.toml
  • src/vouch/bundle.py
  • src/vouch/embeddings/__init__.py
  • src/vouch/embeddings/base.py
  • src/vouch/embeddings/cache.py
  • src/vouch/embeddings/fastembed_bge.py
  • src/vouch/embeddings/st_minilm.py
  • src/vouch/embeddings/st_mpnet.py
  • src/vouch/index_db.py
  • tests/embeddings/__init__.py
  • tests/embeddings/_fakes.py
  • tests/embeddings/conftest.py
  • tests/embeddings/test_core.py
  • tests/embeddings/test_integration.py
  • tests/embeddings/test_storage.py
✅ Files skipped from review due to trivial changes (2)
  • tests/embeddings/init.py
  • tests/embeddings/conftest.py

Comment thread src/vouch/index_db.py
Comment on lines +311 to +314
for kind, id_, blob, dim in rows:
vec = _blob_to_vec(blob, dim)
score = float(q @ vec)
if score >= min_score:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fallback scoring is dot-product, not cosine similarity.

SQL path uses cosine distance, but fallback uses raw q @ vec; results diverge if stored vectors are not unit-normalized. Normalize vec (or enforce normalized storage) before scoring to keep backend parity.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 311 - 314, The fallback scoring uses raw
dot product (score = float(q @ vec)) which diverges from the SQL cosine distance
path; update the loop that calls _blob_to_vec to normalize the vector (and
ensure q is normalized or normalize a copy of q) before computing score so score
= float((q_normalized) @ (vec_normalized)) and then compare to min_score; refer
to the variables q, vec and the helper _blob_to_vec when making the change.

…embedding

Per CodeRabbit Critical on PR vouchdev#39/40/43: SQLite does not allow a
column alias defined in the SELECT list to be referenced in the
WHERE clause of the same query. The old form

  SELECT ... 1.0 - vec_distance_cosine(vec, ?) AS score ...
  WHERE ... AND score >= ?

raised `no such column: score` at runtime; the catch-all
`except sqlite3.OperationalError` silently swallowed it and made
every call fall through to the NumPy brute-force path, which works
but defeats the whole point of installing sqlite-vec.

Wrap the projection in a subquery so the WHERE clause sees the alias.

No behavior change for callers that have already been using the
fallback (NumPy still produces correct results); the ANN code path
now actually runs when sqlite-vec is loaded.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/vouch/index_db.py (1)

89-94: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

reset() does not clear embedding/query-cache tables.

This leaves stale semantic-search state after a reset/rebuild cycle.

Proposed fix
 def reset(kb_dir: Path) -> None:
     """Drop everything; the rebuild caller re-populates."""
     with open_db(kb_dir) as conn:
         conn.executescript(
-            "DELETE FROM claims_fts; DELETE FROM pages_fts; DELETE FROM entities_fts;"
+            "DELETE FROM claims_fts; "
+            "DELETE FROM pages_fts; "
+            "DELETE FROM entities_fts; "
+            "DELETE FROM embedding_index; "
+            "DELETE FROM query_embedding_cache; "
+            "DELETE FROM embedding_dupes; "
+            "DELETE FROM index_meta WHERE key LIKE 'embedding_%';"
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 89 - 94, The reset(kb_dir) function
currently only clears FTS tables (claims_fts, pages_fts, entities_fts) via
open_db(...).executescript; update that executescript to also delete rows from
the semantic-search/cache tables used by this project (e.g. embedding_cache and
query_cache or whatever concrete table names store embeddings and cached
queries) so embeddings and query-cache state are cleared on reset; modify the
conn.executescript call in reset to include "DELETE FROM embedding_cache; DELETE
FROM query_cache;" (and add any other embedding/query-cache table names present)
so all stale semantic-search state is removed.
♻️ Duplicate comments (5)
src/vouch/embeddings/cache.py (2)

24-24: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use sub-second precision for last_used_at to preserve LRU behavior.

Second-level timestamps make eviction order unstable during bursts.

Proposed fix
-    now = dt.datetime.now(dt.UTC).isoformat(timespec="seconds")
+    now = dt.datetime.now(dt.UTC).isoformat(timespec="microseconds")
...
-    now = dt.datetime.now(dt.UTC).isoformat(timespec="seconds")
+    now = dt.datetime.now(dt.UTC).isoformat(timespec="microseconds")

Also applies to: 48-48

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/embeddings/cache.py` at line 24, The timestamp for last_used_at is
truncated to seconds, breaking LRU ordering; replace
dt.datetime.now(dt.UTC).isoformat(timespec="seconds") with a sub-second
precision timestamp (e.g., timespec="microseconds" or omit timespec to keep full
precision) wherever it’s used (the expression assigning now / last_used_at in
src/vouch/embeddings/cache.py), and update both occurrences so eviction ordering
remains stable under bursts.

15-16: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Namespace cache keys by embedding identity, not just query text.

Using only query in the key can return stale/mismatched vectors when model/version/dimension changes.

Also applies to: 22-23, 47-47

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/embeddings/cache.py` around lines 15 - 16, The cache key currently
derived only from the query text in _query_key can return stale/mismatched
vectors when the embedding model, model version, or embedding dimension changes;
update _query_key to incorporate a namespace that includes the embedding model
identity (name/version or model id) and embedding dimension (e.g.,
"embedding:{model}:{dim}:{hash}") so keys are unique per embedding
configuration, compute the query hash as before (hashlib.sha256 of the query
bytes) and concatenate the model/dim values into the returned string; apply the
same pattern to any other key-generating helpers used for document/vector
caching so all cache keys are namespaced by embedding identity.
src/vouch/index_db.py (3)

248-254: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid caching sqlite-vec load state by id(conn).

id() is only stable for object lifetime; reuse can incorrectly short-circuit loading on a different connection.

Proposed fix
-_sqlite_vec_loaded: set[int] = set()
-
-
 def _load_sqlite_vec(conn: sqlite3.Connection) -> bool:
     """Best-effort load of the sqlite-vec extension."""
-    if id(conn) in _sqlite_vec_loaded:
-        return True
     try:
         conn.enable_load_extension(True)
@@
-    _sqlite_vec_loaded.add(id(conn))
     return True

Also applies to: 270-271

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 248 - 254, The code currently caches
sqlite-vec load state using id(conn) which can collide if object ids are reused;
change _sqlite_vec_loaded from set[int] to a weakref.WeakSet of
sqlite3.Connection objects (import weakref) and update _load_sqlite_vec to check
membership with "if conn in _sqlite_vec_loaded" and add the actual connection
object with "_sqlite_vec_loaded.add(conn)" after a successful load; apply the
same replacement for the other occurrence referenced (the logic around lines
270-271) so you track live connection objects rather than their numeric ids.

315-317: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fallback scoring is dot product, not cosine similarity.

SQL backend uses cosine distance, but fallback uses raw q @ vec; rankings diverge unless stored vectors are guaranteed unit-normalized.

Proposed fix
     scored: list[tuple[str, str, str, float]] = []
     for kind, id_, blob, dim in rows:
         vec = _blob_to_vec(blob, dim)
-        score = float(q @ vec)
+        vnorm = float(np.linalg.norm(vec))
+        if vnorm > 0:
+            vec = vec / vnorm
+        score = float(q @ vec)
         if score >= min_score:
             scored.append((kind, id_, "", score))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 315 - 317, The fallback scorer in the
loop over rows uses raw dot product (q @ vec) which mismatches the SQL
cosine-distance backend; update the scoring in the loop that iterates "for kind,
id_, blob, dim in rows" (and uses _blob_to_vec to get vec) to compute cosine
similarity instead of raw dot: compute the norms of q and vec and divide the dot
product by (||q|| * ||vec||) so score = (q @ vec) / (norm(q) * norm(vec));
ensure you handle zero-length vectors safely (skip or treat score as 0) and use
the same numeric types/array library as q/_blob_to_vec (e.g., numpy.linalg.norm
if using numpy) so rankings match the SQL backend.

298-303: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Constrain embedding search to active (model, model_version, dim).

Current queries score all rows of requested kinds; mixed embedding spaces can skew results and cause dimension errors in fallback.

Also applies to: 309-313

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 298 - 303, The embedding-search SQL
currently returns rows across all embedding spaces; modify the inner SELECT in
the query built around the string starting "SELECT kind, id, 1.0 -
vec_distance_cosine(vec, ?) AS score" to add AND clauses constraining model = ?,
model_version = ?, and dim = ? (i.e., limit to the active (model, model_version,
dim) tuple), and bind those three values immediately after the query's vector
parameter; apply the same change to the second analogous query at the later
block (lines ~309-313) so both queries filter by the active embedding space and
avoid cross-space scoring and dim mismatches.
🧹 Nitpick comments (2)
src/vouch/embeddings/cache.py (1)

27-32: ⚡ Quick win

Clarify hit_count semantics (currently writes + reads).

hit_count is incremented on insert and lookup; if this is meant to represent cache hits, initialize to 0 on insert and increment only on successful lookup (or rename the metric).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/embeddings/cache.py` around lines 27 - 32, The current INSERT OR
REPLACE SQL for query_embedding_cache increments hit_count on both inserts and
lookups; change it to initialize hit_count to 0 on new inserts (remove the "+ 1"
from the VALUES expression so it uses COALESCE(..., 0) as-is) and move the
increment logic to the cache lookup path where hits are detected (update
hit_count with an UPDATE or incrementing statement when a query_hash is found).
Update the SQL in the INSERT statement and add/modify the lookup code that reads
from query_embedding_cache to perform an explicit hit_count increment on
successful cache hits.
src/vouch/index_db.py (1)

54-59: Consider indexing query_embedding_cache(last_used_at) if cache cap increases.

Eviction is ORDER BY last_used_at; an index will keep this predictable as max_entries grows.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/index_db.py` around lines 54 - 59, The eviction query sorts by
last_used_at on the query_embedding_cache table, so add a dedicated index on
query_embedding_cache(last_used_at) to keep ORDER BY efficient as max_entries
grows; update the DB schema/migration or the initialization routine that creates
the table to also create the index (use CREATE INDEX IF NOT EXISTS ...)
referencing the table name query_embedding_cache and column last_used_at so
eviction queries and range scans remain performant.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/vouch/index_db.py`:
- Around line 89-94: The reset(kb_dir) function currently only clears FTS tables
(claims_fts, pages_fts, entities_fts) via open_db(...).executescript; update
that executescript to also delete rows from the semantic-search/cache tables
used by this project (e.g. embedding_cache and query_cache or whatever concrete
table names store embeddings and cached queries) so embeddings and query-cache
state are cleared on reset; modify the conn.executescript call in reset to
include "DELETE FROM embedding_cache; DELETE FROM query_cache;" (and add any
other embedding/query-cache table names present) so all stale semantic-search
state is removed.

---

Duplicate comments:
In `@src/vouch/embeddings/cache.py`:
- Line 24: The timestamp for last_used_at is truncated to seconds, breaking LRU
ordering; replace dt.datetime.now(dt.UTC).isoformat(timespec="seconds") with a
sub-second precision timestamp (e.g., timespec="microseconds" or omit timespec
to keep full precision) wherever it’s used (the expression assigning now /
last_used_at in src/vouch/embeddings/cache.py), and update both occurrences so
eviction ordering remains stable under bursts.
- Around line 15-16: The cache key currently derived only from the query text in
_query_key can return stale/mismatched vectors when the embedding model, model
version, or embedding dimension changes; update _query_key to incorporate a
namespace that includes the embedding model identity (name/version or model id)
and embedding dimension (e.g., "embedding:{model}:{dim}:{hash}") so keys are
unique per embedding configuration, compute the query hash as before
(hashlib.sha256 of the query bytes) and concatenate the model/dim values into
the returned string; apply the same pattern to any other key-generating helpers
used for document/vector caching so all cache keys are namespaced by embedding
identity.

In `@src/vouch/index_db.py`:
- Around line 248-254: The code currently caches sqlite-vec load state using
id(conn) which can collide if object ids are reused; change _sqlite_vec_loaded
from set[int] to a weakref.WeakSet of sqlite3.Connection objects (import
weakref) and update _load_sqlite_vec to check membership with "if conn in
_sqlite_vec_loaded" and add the actual connection object with
"_sqlite_vec_loaded.add(conn)" after a successful load; apply the same
replacement for the other occurrence referenced (the logic around lines 270-271)
so you track live connection objects rather than their numeric ids.
- Around line 315-317: The fallback scorer in the loop over rows uses raw dot
product (q @ vec) which mismatches the SQL cosine-distance backend; update the
scoring in the loop that iterates "for kind, id_, blob, dim in rows" (and uses
_blob_to_vec to get vec) to compute cosine similarity instead of raw dot:
compute the norms of q and vec and divide the dot product by (||q|| * ||vec||)
so score = (q @ vec) / (norm(q) * norm(vec)); ensure you handle zero-length
vectors safely (skip or treat score as 0) and use the same numeric types/array
library as q/_blob_to_vec (e.g., numpy.linalg.norm if using numpy) so rankings
match the SQL backend.
- Around line 298-303: The embedding-search SQL currently returns rows across
all embedding spaces; modify the inner SELECT in the query built around the
string starting "SELECT kind, id, 1.0 - vec_distance_cosine(vec, ?) AS score" to
add AND clauses constraining model = ?, model_version = ?, and dim = ? (i.e.,
limit to the active (model, model_version, dim) tuple), and bind those three
values immediately after the query's vector parameter; apply the same change to
the second analogous query at the later block (lines ~309-313) so both queries
filter by the active embedding space and avoid cross-space scoring and dim
mismatches.

---

Nitpick comments:
In `@src/vouch/embeddings/cache.py`:
- Around line 27-32: The current INSERT OR REPLACE SQL for query_embedding_cache
increments hit_count on both inserts and lookups; change it to initialize
hit_count to 0 on new inserts (remove the "+ 1" from the VALUES expression so it
uses COALESCE(..., 0) as-is) and move the increment logic to the cache lookup
path where hits are detected (update hit_count with an UPDATE or incrementing
statement when a query_hash is found). Update the SQL in the INSERT statement
and add/modify the lookup code that reads from query_embedding_cache to perform
an explicit hit_count increment on successful cache hits.

In `@src/vouch/index_db.py`:
- Around line 54-59: The eviction query sorts by last_used_at on the
query_embedding_cache table, so add a dedicated index on
query_embedding_cache(last_used_at) to keep ORDER BY efficient as max_entries
grows; update the DB schema/migration or the initialization routine that creates
the table to also create the index (use CREATE INDEX IF NOT EXISTS ...)
referencing the table name query_embedding_cache and column last_used_at so
eviction queries and range scans remain performant.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 72e51305-3f13-4b9a-b72e-8627de2db56f

📥 Commits

Reviewing files that changed from the base of the PR and between 540b9ab and d51474b.

📒 Files selected for processing (4)
  • src/vouch/embeddings/cache.py
  • src/vouch/index_db.py
  • tests/embeddings/_fakes.py
  • tests/embeddings/test_storage.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants