feat(embeddings): storage layer — state.db schema + vector search + query cache#38
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an embedding-based semantic search subsystem: design spec and packaging, a pluggable Embedder registry with three adapters, SQLite-backed query caching, extended index DB embedding tables and search APIs (sqlite_vec or Python fallback), and unit/integration tests. ChangesEmbedding-Based Semantic Search
Sequence Diagram(s)sequenceDiagram
participant Client
participant IndexDB
participant SQLite
participant sqlite_vec
Client->>IndexDB: search_embedding(query_vec, kinds, limit)
IndexDB->>SQLite: attempt load sqlite_vec extension
alt sqlite_vec available
IndexDB->>sqlite_vec: SQL cosine scoring(query_vec)
sqlite_vec-->>IndexDB: top-k rows
IndexDB-->>Client: results with scores
else fallback
IndexDB->>SQLite: fetch embedding_index candidate rows
IndexDB->>IndexDB: compute cosine scores in Python
IndexDB-->>Client: sorted results
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
tests/embeddings/test_storage.py (1)
167-172: ⚡ Quick winAssert eviction correctness, not just cache size.
This currently validates only capacity, so a non-LRU implementation could still pass. Add key-level checks to prove oldest entries are evicted.
Proposed test tightening
def test_query_cache_lru_eviction(kb_dir: Path) -> None: - from vouch.embeddings.cache import cache_query_vec, query_cache_size + from vouch.embeddings.cache import cache_query_vec, lookup_query_vec, query_cache_size v = np.ones(4, dtype=np.float32) for i in range(5): cache_query_vec(kb_dir, query=f"q{i}", vec=v, max_entries=3) assert query_cache_size(kb_dir) <= 3 + assert lookup_query_vec(kb_dir, query="q0") is None + assert lookup_query_vec(kb_dir, query="q1") is None + assert lookup_query_vec(kb_dir, query="q2") is not None + assert lookup_query_vec(kb_dir, query="q3") is not None + assert lookup_query_vec(kb_dir, query="q4") is not Nonesrc/vouch/index_db.py (1)
40-50: 🏗️ Heavy liftThis sqlite-vec path is interim brute-force SQL; vec0 indexed search is already planned.
embedding_indexis currently a plain table usingvec_distance_cosine(...) ORDER BY, which is the manual KNN pattern—equivalent to O(n) full scan. However, the approved semantic search design spec (section 4) already calls for migrating to sqlite-vecvec0virtual tables (CREATE VIRTUAL TABLE <kind>_vecs USING vec0(embedding float[768])), which enable indexed ANN. This is tracked as part of the broader embedding integration milestone; the current code is a transitional fallback until that refactor lands.Also applies to: 294-300
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 40 - 50, The embedding_index table currently uses a brute-force KNN pattern (vec_distance_cosine ORDER BY) causing O(n) scans; replace this interim table with sqlite-vec indexed ANN when ready by adding vec0 virtual tables per kind (e.g., CREATE VIRTUAL TABLE <kind>_vecs USING vec0(embedding float[dim])) and migrate inserts/queries to write/read from those virtual tables instead of embedding_index; update any code paths that reference embedding_index and vec_distance_cosine (search logic and insert/update routines) to use the new <kind>_vecs table and vec0 query API, while keeping embedding_index only as a transitional fallback until full migration is complete.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/superpowers/specs/2026-05-20-semantic-search-design.md`:
- Around line 33-47: The fenced code block that lists "src/vouch/embeddings/"
lacks a language tag which triggers markdownlint MD040; update the opening fence
from ``` to ```text (or another appropriate language) for the block containing
"src/vouch/embeddings/" so the linter recognizes it and CI noise is avoided.
Locate the triple-backtick block that begins immediately before the
"src/vouch/embeddings/" listing and add the language identifier to the opening
fence.
In `@src/vouch/embeddings/__init__.py`:
- Around line 20-33: Replace the bare try/except/pass import blocks with
contextlib.suppress(ImportError): blocks to satisfy Ruff SIM105; import
contextlib (or from contextlib import suppress) at the top of the module and
wrap each import of st_mpnet, st_minilm, and fastembed_bge inside a
suppress(ImportError) context manager so the imports are quietly ignored but the
pattern no longer uses empty except clauses.
In `@src/vouch/embeddings/cache.py`:
- Around line 15-16: The cache key currently uses only the query hash in
_query_key, which can return embeddings from a different
model/version/dimension; update key construction to incorporate the embedder
identity by including model, model_version (or version), and dim (embedding
dimensionality) alongside the query hash (e.g., include these values in the
string hashed by _query_key or create a new _cache_key(query, model,
model_version, dim)). Alternatively, persist model/model_version/dim as separate
columns when storing embeddings and add explicit checks in the lookup code (the
retrieval path around lines 19-21) to reject/ignore cached rows whose
model/version/dim do not match the current embedder before returning results.
Ensure you reference and update the functions that compute and use the cache key
(_query_key and the lookup/store code) so mismatched shapes are never returned.
- Line 24: The code sets last_used_at using
dt.datetime.now(dt.UTC).isoformat(timespec="seconds"), which loses sub-second
precision and breaks LRU eviction ordering; update all occurrences (e.g., the
assignment to now and any places referenced by last_used_at) to include
sub-second precision—either use isoformat(timespec="microseconds") or store an
integer epoch with microsecond resolution
(int(datetime.now(tz=dt.UTC).timestamp() * 1_000_000)) consistently across the
functions that set last_used_at so eviction order remains deterministic under
bursty access.
In `@src/vouch/index_db.py`:
- Around line 40-47: search_embedding() is currently scoring all rows for
requested kinds and ignores stored embedding metadata, which risks
mismatched-dimension errors and polluted rankings; update both search paths in
search_embedding() to JOIN or WHERE against the active embedding metadata
(model, model_version, dim) stored for the DB and add WHERE clauses like model =
?, model_version = ?, dim = ? so q @ vec only runs on matching-dim vectors. Also
update set_embedding_meta() to either delete/purge stale embedding_index rows
whose model/model_version/dim do not match the newly set metadata or mark them
invalid, so subsequent searches remain constrained to current metadata;
reference the embedding_index table and the functions search_embedding() and
set_embedding_meta() when making these changes.
In `@tests/embeddings/_fakes.py`:
- Around line 32-42: The mock vector generator unpacks raw SHA256 bytes as
float32 (struct.unpack("<f", ...)) which can produce extremely large/NaN
intermediate values; change the numeric mapping to a bounded, stable range
instead: for each 4-byte chunk (the loop using seed, h, j and writing into
out[i]), convert the 4 bytes to an unsigned 32-bit integer (e.g., via
int.from_bytes(..., "little")), then scale that integer into a stable float
range like [-1.0, 1.0] or [0.0, 1.0] before assigning into out[i] (cast to
np.float32 if needed); keep the existing normalization step and loop structure
(references: seed, h, j, out, i, self.dim, np.linalg.norm) but replace the
struct.unpack("<f", ...) usage with the bounded scaling to avoid overflow
warnings during norm computation.
---
Nitpick comments:
In `@src/vouch/index_db.py`:
- Around line 40-50: The embedding_index table currently uses a brute-force KNN
pattern (vec_distance_cosine ORDER BY) causing O(n) scans; replace this interim
table with sqlite-vec indexed ANN when ready by adding vec0 virtual tables per
kind (e.g., CREATE VIRTUAL TABLE <kind>_vecs USING vec0(embedding float[dim]))
and migrate inserts/queries to write/read from those virtual tables instead of
embedding_index; update any code paths that reference embedding_index and
vec_distance_cosine (search logic and insert/update routines) to use the new
<kind>_vecs table and vec0 query API, while keeping embedding_index only as a
transitional fallback until full migration is complete.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 81d92f16-988c-4fbf-a818-b1ffdd4f15a4
📒 Files selected for processing (15)
docs/superpowers/plans/2026-05-20-semantic-search.mddocs/superpowers/specs/2026-05-20-semantic-search-design.mdpyproject.tomlsrc/vouch/embeddings/__init__.pysrc/vouch/embeddings/base.pysrc/vouch/embeddings/cache.pysrc/vouch/embeddings/fastembed_bge.pysrc/vouch/embeddings/st_minilm.pysrc/vouch/embeddings/st_mpnet.pysrc/vouch/index_db.pytests/embeddings/__init__.pytests/embeddings/_fakes.pytests/embeddings/test_core.pytests/embeddings/test_integration.pytests/embeddings/test_storage.py
| ``` | ||
| src/vouch/embeddings/ | ||
| __init__.py # public API: encode, search, register | ||
| base.py # Embedder ABC + adapter registry | ||
| st_mpnet.py # default impl (sentence-transformers all-mpnet-base-v2) | ||
| st_minilm.py # alternative impl | ||
| fastembed_bge.py # alternative impl (no-torch path via fastembed) | ||
| cache.py # query embedding LRU + content-hash skip cache | ||
| rerank.py # cross-encoder reranker (ms-marco-MiniLM-L6-v2) | ||
| hyde.py # Hypothetical Document Embedding query expansion | ||
| dedup.py # cosine-threshold duplicate detection at ingest | ||
| fusion.py # RRF, weighted-sum, normalized-cosine fusion strategies | ||
| eval.py # recall@k / MRR / nDCG harness | ||
| migration.py # model-identity check + backfill orchestration | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to the fenced code block.
Line 33 triggers markdownlint MD040 (fenced-code-language). This can cause doc-lint CI noise/failures depending on pipeline strictness.
Proposed fix
-```
+```text
src/vouch/embeddings/
__init__.py # public API: encode, search, register
base.py # Embedder ABC + adapter registry
st_mpnet.py # default impl (sentence-transformers all-mpnet-base-v2)
st_minilm.py # alternative impl
fastembed_bge.py # alternative impl (no-torch path via fastembed)
cache.py # query embedding LRU + content-hash skip cache
rerank.py # cross-encoder reranker (ms-marco-MiniLM-L6-v2)
hyde.py # Hypothetical Document Embedding query expansion
dedup.py # cosine-threshold duplicate detection at ingest
fusion.py # RRF, weighted-sum, normalized-cosine fusion strategies
eval.py # recall@k / MRR / nDCG harness
migration.py # model-identity check + backfill orchestration</details>
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>
[warning] 33-33: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @docs/superpowers/specs/2026-05-20-semantic-search-design.md around lines 33
- 47, The fenced code block that lists "src/vouch/embeddings/" lacks a language
tag which triggers markdownlint MD040; update the opening fence from ``` to
"src/vouch/embeddings/" so the linter recognizes it and CI noise is avoided.
Locate the triple-backtick block that begins immediately before the
"src/vouch/embeddings/" listing and add the language identifier to the opening
fence.
| def _query_key(query: str) -> str: | ||
| return hashlib.sha256(query.encode("utf-8")).hexdigest() |
There was a problem hiding this comment.
Namespace cache entries by model/version/dimension.
Right now the cache key is only the query hash. If the configured embedder changes model, version, or dimensionality, the same query will reuse a stale vector; that can either fail downstream on shape mismatches or silently rank against the wrong embedding space. Include (model, model_version, dim) in the key, or persist those columns and reject mismatches on lookup.
Also applies to: 19-21
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/vouch/embeddings/cache.py` around lines 15 - 16, The cache key currently
uses only the query hash in _query_key, which can return embeddings from a
different model/version/dimension; update key construction to incorporate the
embedder identity by including model, model_version (or version), and dim
(embedding dimensionality) alongside the query hash (e.g., include these values
in the string hashed by _query_key or create a new _cache_key(query, model,
model_version, dim)). Alternatively, persist model/model_version/dim as separate
columns when storing embeddings and add explicit checks in the lookup code (the
retrieval path around lines 19-21) to reject/ignore cached rows whose
model/version/dim do not match the current embedder before returning results.
Ensure you reference and update the functions that compute and use the cache key
(_query_key and the lookup/store code) so mismatched shapes are never returned.
| ) -> None: | ||
| h = _query_key(query) | ||
| blob = np.asarray(vec, dtype=np.float32).tobytes() | ||
| now = dt.datetime.now(dt.UTC).isoformat(timespec="seconds") |
There was a problem hiding this comment.
Use sub-second precision for last_used_at.
timespec="seconds" gives every access in the same second the same sort key, so eviction order becomes arbitrary under normal bursty usage. Switching to microseconds or an integer epoch timestamp preserves the LRU contract much more reliably.
Suggested change
- now = dt.datetime.now(dt.UTC).isoformat(timespec="seconds")
+ now = dt.datetime.now(dt.UTC).isoformat(timespec="microseconds")
...
- now = dt.datetime.now(dt.UTC).isoformat(timespec="seconds")
+ now = dt.datetime.now(dt.UTC).isoformat(timespec="microseconds")Also applies to: 37-43, 48-59
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/vouch/embeddings/cache.py` at line 24, The code sets last_used_at using
dt.datetime.now(dt.UTC).isoformat(timespec="seconds"), which loses sub-second
precision and breaks LRU eviction ordering; update all occurrences (e.g., the
assignment to now and any places referenced by last_used_at) to include
sub-second precision—either use isoformat(timespec="microseconds") or store an
integer epoch with microsecond resolution
(int(datetime.now(tz=dt.UTC).timestamp() * 1_000_000)) consistently across the
functions that set last_used_at so eviction order remains deterministic under
bursty access.
| CREATE TABLE IF NOT EXISTS embedding_index ( | ||
| kind TEXT NOT NULL, | ||
| id TEXT NOT NULL, | ||
| vec BLOB NOT NULL, | ||
| content_hash TEXT NOT NULL, | ||
| model TEXT NOT NULL, | ||
| model_version TEXT NOT NULL, | ||
| dim INTEGER NOT NULL, |
There was a problem hiding this comment.
Filter searches to the active embedding metadata.
You persist model, model_version, and dim, but search_embedding() ignores all three and scores every row of the requested kinds. After a model upgrade or partial reindex, mixed rows will either blow up on q @ vec when dims differ or pollute ranking when dims happen to match. Constrain both search paths to the active metadata, or purge stale rows when set_embedding_meta() changes it.
Also applies to: 228-245, 290-313
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/vouch/index_db.py` around lines 40 - 47, search_embedding() is currently
scoring all rows for requested kinds and ignores stored embedding metadata,
which risks mismatched-dimension errors and polluted rankings; update both
search paths in search_embedding() to JOIN or WHERE against the active embedding
metadata (model, model_version, dim) stored for the DB and add WHERE clauses
like model = ?, model_version = ?, dim = ? so q @ vec only runs on matching-dim
vectors. Also update set_embedding_meta() to either delete/purge stale
embedding_index rows whose model/model_version/dim do not match the newly set
metadata or mark them invalid, so subsequent searches remain constrained to
current metadata; reference the embedding_index table and the functions
search_embedding() and set_embedding_meta() when making these changes.
5177248 to
13e709d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pyproject.toml (2)
21-22:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winRemove duplicate
clickdependency declaration.Both lines declare
clickwith different version constraints (>=8.4.0,<9vs>=8.1,<9). This duplication can cause build-tool errors or unpredictable dependency resolution.🔧 Proposed fix
- "click>=8.4.0,<9", "click>=8.1,<9",Keep the less restrictive constraint (
>=8.1) unless there's a specific requirement for 8.4.0+.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pyproject.toml` around lines 21 - 22, Remove the duplicate click dependency entry in pyproject.toml: keep a single "click" requirement and remove the other line (choose the less restrictive constraint ">=8.1,<9" unless you have a hard requirement for ">=8.4.0,<9"); update the dependency list so only one "click" entry remains to avoid conflicting constraints.
37-37:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRelax
sentence-transformersversion constraint to allow v4+ and v5+.The constraint
>=2.7,<4blocks v4.x and v5.x (currently at 5.5.0). The codebase uses only the basic API (SentenceTransformer()initialization and.encode()with standard parameters), which is stable across v4 and v5. The v5.0 breaking changes (sparse encoding support, newencode_query/encode_documentmethods) do not affect this usage. Update the constraint to at least>=2.7,<6to allow users to benefit from newer releases, after confirming compatibility with v4.x through testing.Also applies to: 48-48
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pyproject.toml` at line 37, Update the sentence-transformers version constraint in pyproject.toml from "sentence-transformers>=2.7,<4" to "sentence-transformers>=2.7,<6" (and update the other occurrence on the file, referenced in the comment) so v4/v5 releases are allowed; keep the minimum at 2.7 and run quick compatibility checks against SentenceTransformer() initialization and .encode() usage to confirm no changes are needed.
🧹 Nitpick comments (3)
src/vouch/index_db.py (2)
61-67: 💤 Low valueConsider adding a primary key or unique constraint to
embedding_dupes.The
embedding_dupestable lacks a primary key, allowing duplicate rows for the same(kind, id, near_id)pair. If this is intentional for audit logging, this is fine. Otherwise, consider addingPRIMARY KEY (kind, id, near_id)to prevent redundant records.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 61 - 67, The embedding_dupes table currently has no primary key, which allows duplicate (kind,id,near_id) rows; update the CREATE TABLE for embedding_dupes to enforce uniqueness by adding a PRIMARY KEY (kind, id, near_id) or a UNIQUE(kind, id, near_id) constraint so duplicates are prevented; if audit logging of duplicates is intended instead, add an explicit autoincrement id column (e.g., dup_id) as PRIMARY KEY and keep (kind, id, near_id) indexed or made UNIQUE as appropriate.
248-271: 💤 Low valueConnection-id caching can give false positives after connection reuse.
_sqlite_vec_loadedtracks connections byid(conn). If a connection is closed and Python reuses that object id for a new connection, the cache will incorrectly report the extension as loaded. This is unlikely with the short-livedopen_dbcontext manager pattern but could surface under different usage patterns.Consider clearing the id from
_sqlite_vec_loadedwhen the connection closes, or using a connection attribute instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 248 - 271, The current cache `_sqlite_vec_loaded` uses id(conn) which can be reused and give false positives; change the cache to track the actual connection object (e.g., use weakref.WeakSet or set an attribute on the connection) and update `_load_sqlite_vec` to check that connection-specific marker instead of id(conn); specifically replace uses of `_sqlite_vec_loaded` and `id(conn)` with a WeakSet of sqlite3.Connection objects or `setattr(conn, "_sqlite_vec_loaded", True)`, test presence with `getattr(conn, "_sqlite_vec_loaded", False)`, and set it after successful `sqlite_vec.load(conn)` while leaving the existing extension enable/disable and exception handling around `conn.enable_load_extension` and `sqlite_vec.load` intact.tests/embeddings/test_storage.py (1)
85-92: ⚡ Quick winClarify vector normalization expectations.
The test normalizes
nearexplicitly (line 92) but leavesqueryandfaras-is. Whilequeryandfarhappen to be unit vectors already, the inconsistency may confuse readers about whether the search API requires pre-normalized vectors.♻️ Proposed fix: normalize all vectors explicitly
def test_search_embedding_returns_topk(kb_dir: Path) -> None: from vouch.embeddings.base import content_hash as ch query = np.zeros(8, dtype=np.float32) query[0] = 1.0 + query /= float(np.linalg.norm(query)) near = np.zeros(8, dtype=np.float32) near[0] = 0.99 near[1] = 0.05 far = np.zeros(8, dtype=np.float32) far[7] = 1.0 + far /= float(np.linalg.norm(far)) near /= float(np.linalg.norm(near))Alternatively, add a comment clarifying that
queryandfarare already unit vectors:# query and far are already unit vectors (magnitude=1.0); near requires normalization🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/embeddings/test_storage.py` around lines 85 - 92, The test inconsistently normalizes vectors: only near is normalized while query and far are left implicit; update the test to explicitly normalize query, near, and far (e.g., divide each by float(np.linalg.norm(...))) so all three vectors are clearly unit vectors before use, referencing the variables query, near, and far and replacing the single near normalization with normalization calls for all three vectors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/embeddings/test_storage.py`:
- Line 7: The test module currently does a hard import of numpy ("import numpy
as np") which causes CI failures when numpy isn't installed; replace that
top-level import with a runtime skip using pytest.importorskip('numpy') (or wrap
in try/except and call pytest.skip) so tests are skipped when numpy is
unavailable, or alternatively add a module-level pytest marker like pytestmark =
pytest.mark.requires_embeddings and ensure CI installs the embeddings extra;
target the import in tests/embeddings/test_storage.py (the current "import numpy
as np" line) and adjust it to use pytest.importorskip('numpy') or add the marker
as described.
---
Outside diff comments:
In `@pyproject.toml`:
- Around line 21-22: Remove the duplicate click dependency entry in
pyproject.toml: keep a single "click" requirement and remove the other line
(choose the less restrictive constraint ">=8.1,<9" unless you have a hard
requirement for ">=8.4.0,<9"); update the dependency list so only one "click"
entry remains to avoid conflicting constraints.
- Line 37: Update the sentence-transformers version constraint in pyproject.toml
from "sentence-transformers>=2.7,<4" to "sentence-transformers>=2.7,<6" (and
update the other occurrence on the file, referenced in the comment) so v4/v5
releases are allowed; keep the minimum at 2.7 and run quick compatibility checks
against SentenceTransformer() initialization and .encode() usage to confirm no
changes are needed.
---
Nitpick comments:
In `@src/vouch/index_db.py`:
- Around line 61-67: The embedding_dupes table currently has no primary key,
which allows duplicate (kind,id,near_id) rows; update the CREATE TABLE for
embedding_dupes to enforce uniqueness by adding a PRIMARY KEY (kind, id,
near_id) or a UNIQUE(kind, id, near_id) constraint so duplicates are prevented;
if audit logging of duplicates is intended instead, add an explicit
autoincrement id column (e.g., dup_id) as PRIMARY KEY and keep (kind, id,
near_id) indexed or made UNIQUE as appropriate.
- Around line 248-271: The current cache `_sqlite_vec_loaded` uses id(conn)
which can be reused and give false positives; change the cache to track the
actual connection object (e.g., use weakref.WeakSet or set an attribute on the
connection) and update `_load_sqlite_vec` to check that connection-specific
marker instead of id(conn); specifically replace uses of `_sqlite_vec_loaded`
and `id(conn)` with a WeakSet of sqlite3.Connection objects or `setattr(conn,
"_sqlite_vec_loaded", True)`, test presence with `getattr(conn,
"_sqlite_vec_loaded", False)`, and set it after successful
`sqlite_vec.load(conn)` while leaving the existing extension enable/disable and
exception handling around `conn.enable_load_extension` and `sqlite_vec.load`
intact.
In `@tests/embeddings/test_storage.py`:
- Around line 85-92: The test inconsistently normalizes vectors: only near is
normalized while query and far are left implicit; update the test to explicitly
normalize query, near, and far (e.g., divide each by float(np.linalg.norm(...)))
so all three vectors are clearly unit vectors before use, referencing the
variables query, near, and far and replacing the single near normalization with
normalization calls for all three vectors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 537d08ea-48fd-4476-ab92-cd562b11073e
📒 Files selected for processing (5)
pyproject.tomlsrc/vouch/embeddings/__init__.pysrc/vouch/embeddings/cache.pysrc/vouch/index_db.pytests/embeddings/test_storage.py
|
|
||
| from pathlib import Path | ||
|
|
||
| import numpy as np |
There was a problem hiding this comment.
Critical: numpy import causes test collection failures in CI.
All pipeline failures show ModuleNotFoundError: No module named 'numpy'. Since the PR introduces optional dependency groups and embeddings are optional, numpy is likely not installed in the base CI environment.
🔧 Suggested fixes (choose one approach)
Option 1: Skip tests when numpy unavailable (recommended if embeddings remain optional)
+pytest.importorskip("numpy", reason="embeddings tests require numpy")
+
from pathlib import Path
import numpy as npOption 2: Update CI workflow to install optional embeddings dependencies
Ensure CI installs the embeddings extra when running this test suite:
pip install -e ".[embeddings,test]"Option 3: Mark tests requiring optional dependencies
Add a pytest marker at module or test level:
pytestmark = pytest.mark.requires_embeddingsand configure CI to install embeddings dependencies for tests with that marker.
🧰 Tools
🪛 GitHub Actions: ci / 1_test (py3.13).txt
[error] 7-7: Pytest test collection failed due to ImportError: ModuleNotFoundError: No module named 'numpy'.
🪛 GitHub Actions: ci / 3_test (py3.11).txt
[error] 7-7: Test collection failed (ImportError). Module 'numpy' is missing: ModuleNotFoundError: No module named 'numpy'.
🪛 GitHub Actions: ci / test (py3.11)
[error] 7-7: Pytest import error during test collection: ModuleNotFoundError: No module named 'numpy'.
🪛 GitHub Actions: ci / test (py3.12)
[error] 7-7: pytest collection failed due to ImportError: ModuleNotFoundError: No module named 'numpy'.
🪛 GitHub Actions: ci / test (py3.13)
[error] 7-7: pytest collection failed due to ImportError: ModuleNotFoundError: No module named 'numpy'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/embeddings/test_storage.py` at line 7, The test module currently does a
hard import of numpy ("import numpy as np") which causes CI failures when numpy
isn't installed; replace that top-level import with a runtime skip using
pytest.importorskip('numpy') (or wrap in try/except and call pytest.skip) so
tests are skipped when numpy is unavailable, or alternatively add a module-level
pytest marker like pytestmark = pytest.mark.requires_embeddings and ensure CI
installs the embeddings extra; target the import in
tests/embeddings/test_storage.py (the current "import numpy as np" line) and
adjust it to use pytest.importorskip('numpy') or add the marker as described.
|
Schema looks good. Storing model identity per row + raw float32 bytes is exactly right (way better than the JSON approach in #36). The NumPy fallback when sqlite-vec isn't loadable is a nice touch. One thing I want to flag before merging: In Smaller stuff:
Tests are solid. Ship it once the SQLite version thing is confirmed. |
13e709d to
1d20dc7
Compare
Approved design for feat/semantic-search. Embedding (sentence- transformers all-mpnet-base-v2) becomes the primary search backend with FTS5 as fallback. Synchronous-at-write indexing across all six artifact types (claim, page, source, entity, relation, evidence). Maximally functional scope (~3000 LOC): pluggable model adapter registry, sqlite-vec ANN with NumPy fallback, cross-encoder rerank, HyDE query expansion, ingest-time duplicate detection, model-identity migration, recall/MRR/nDCG eval harness, and full CLI/MCP/JSONL parity. The writing-plans step will turn the rollout order in section 16 into concrete phased tasks.
32-task TDD plan for the semantic-search feature: foundation, storage, write hooks across all six artifact types, semantic-primary search integration in MCP/JSONL/CLI, RRF fusion + hybrid, cross-encoder rerank, HyDE expansion, ingest-time duplicate detection, model-identity migration, recall/MRR/nDCG scorer, end-to-end integration test, and user docs. Each task: failing test, minimal implementation, passing test, commit. MockEmbedder test double keeps the unit suite fast; the real model is exercised only under @pytest.mark.integration. The evaluation module is named scorer.py rather than eval.py to avoid shadowing the Python builtin and to keep static analysers quiet; the user-facing CLI subcommand remains `vouch eval embedding` (the Click group is registered under the name "eval", with the Python identifier eval_group).
CI ran ruff which flagged SIM105 on the three try/except ImportError guard blocks in src/vouch/embeddings/__init__.py. Replace each with `contextlib.suppress(ImportError)` -- same semantics, satisfies ruff. Also add mypy overrides for numpy / sqlite_vec / sentence_transformers / fastembed so the type check passes in the base [dev] CI install where those optional extras aren't present. The embedding code paths that import these libraries are only reached when the extras are installed at runtime; for the static type check the missing stubs are noise.
CI runs `pip install -e '.[dev]'` which deliberately excludes the
optional `[embeddings]` extras. tests/embeddings/test_*.py modules
import numpy at top level, so pytest collection fails with
ModuleNotFoundError before any test can be deselected.
Add tests/embeddings/conftest.py with `pytest.importorskip("numpy")`
so the entire embeddings test directory skips gracefully when numpy
is absent. Once `pip install vouch[embeddings]` (or numpy itself) is
present, the skip is a no-op and the tests run normally.
`_validate_content` in bundle.py uses the path's first directory component to look up a Pydantic validator. For sources, both `sources/<sha>/meta.yaml` (the Source model) and `sources/<sha>/content` (raw opaque bytes) hit the same "sources" key, so the validator was being run on the raw content bytes and failing with "1 validation error for Source". Add an early return for any non-meta.yaml path under `sources/` so opaque content bytes are not Pydantic-validated. Only the Source metadata file is checked, which matches the original intent of the PR vouchdev#13 validation. Unblocks three pre-existing test_bundle.py failures inherited from upstream main.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/vouch/index_db.py (1)
89-94:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
reset()does not match its “Drop everything” contract.Line 90 says “Drop everything”, but only FTS tables are cleared. Embedding tables (
embedding_index,query_embedding_cache,embedding_dupes) remain populated, which leaves stale semantic state after reset.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 89 - 94, The reset() function currently only clears FTS tables; update it (in src/vouch/index_db.py inside reset and using open_db) to also remove all embedding-related state by executing DELETE (or DROP/CREATE) for embedding_index, query_embedding_cache, and embedding_dupes so no stale semantic state remains; ensure the same connection.executescript call includes those table names (and any other non-FTS tables that must be cleared) so the function truly “drops everything.”
♻️ Duplicate comments (4)
src/vouch/embeddings/cache.py (2)
24-24:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse sub-second precision for
last_used_atto preserve LRU ordering.
timespec="seconds"causes ties during bursty access, making eviction order non-deterministic.Also applies to: 48-48
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/embeddings/cache.py` at line 24, The current timestamp generation uses dt.datetime.now(dt.UTC).isoformat(timespec="seconds") which drops sub-second precision and can break deterministic LRU eviction; update both occurrences (the now assignment used to set last_used_at) to include sub-second precision, e.g., use timespec="microseconds" (or "milliseconds" if you prefer) so last_used_at preserves ordering across bursty accesses; locate the dt.datetime.now(dt.UTC).isoformat(...) calls (the now variable assignments that populate last_used_at) and replace timespec="seconds" with timespec="microseconds" in both places.
15-16:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftNamespace cache keys by embedding space, not query text alone.
Using only
queryhash allows stale vectors to be reused across model/version/dim changes. That can silently corrupt ranking or break on shape mismatch downstream. Include model identity in the key (or persist model/version/dim in cache rows and validate on lookup).Also applies to: 22-33, 47-53
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/embeddings/cache.py` around lines 15 - 16, The current _query_key function only hashes the query text, which can cause reuse of stale embeddings across model/version/dimension changes; update the key strategy to include model identity (e.g., model name/version and embedding dimension) when generating cache keys—either change _query_key to accept model_id/model_name and dim and incorporate them into the SHA256 input, or persist model/version/dim alongside cached vectors and validate those fields on lookup before returning a cached embedding; ensure all places that call _query_key (and other cache key builders indicated around lines 22-33 and 47-53) are updated to pass the model identity/dim so keys are namespaced per embedding space.tests/embeddings/test_storage.py (1)
7-8:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard optional dependency import to avoid test-collection failures.
Top-level
import numpy as npwill fail collection in environments without embeddings extras. Usepytest.importorskip("numpy")(or equivalent marker strategy) at module import time so the suite skips cleanly when optional deps are absent.#!/usr/bin/env bash set -euo pipefail # Verify whether numpy is declared in runtime/test extras and whether CI installs those extras. rg -n "numpy|embeddings|optional-dependencies|extras_require" pyproject.toml setup.py rg -n "pip install|\\.[^\"']*embeddings|pytest" .github/workflows🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/embeddings/test_storage.py` around lines 7 - 8, Top-level "import numpy as np" in the tests/embeddings module causes test collection to fail when the optional embeddings extras aren't installed; replace the direct import with a guarded import using pytest.importorskip("numpy") at module import time (e.g., call pytest.importorskip("numpy") before any use of numpy) so the whole module is skipped cleanly when numpy is absent and avoids collection errors.src/vouch/index_db.py (1)
295-299:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftConstrain embedding search to active model/version/dim metadata.
Both SQL and NumPy paths score all rows of selected
kinds, ignoring stored embedding metadata. Mixed embedding spaces can degrade ranking and can throw runtime errors on dimension mismatch inq @ vec. Filter search rows to active(model, model_version, dim)(or purge stale rows when meta changes).Also applies to: 306-314
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 295 - 299, The SQL and NumPy search paths currently score all rows for the requested kinds and ignore embedding metadata; restrict both to only rows matching the active embedding metadata (model, model_version, dim). Update the SQL query that uses "FROM embedding_index ... vec_distance_cosine" to add WHERE conditions for model = ? AND model_version = ? AND dim = ? and pass the active model/model_version/dim as parameters (in addition to the existing placeholders and score/limit), and in the NumPy path (the code that computes q @ vec) filter the retrieved rows by their stored model/model_version/dim before computing dot products (or drop/stale-purge rows that don't match). Ensure you reference the embedding_index rows' model/model_version/dim fields in both code paths so mixed embedding spaces cannot be compared.
🧹 Nitpick comments (1)
src/vouch/embeddings/cache.py (1)
27-33: ⚡ Quick winClarify/adjust
hit_countsemantics (insert vs cache-hit).
hit_countis incremented both on write and on lookup-hit, soquery_cache_stats()["hits"]currently mixes “encoded+stored” with “served-from-cache”. If this metric is intended to represent cache effectiveness, increment only on lookup hits (or split into two counters).Also applies to: 57-58, 79-83
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/embeddings/cache.py` around lines 27 - 33, The INSERT/REPLACE SQL for the query_embedding_cache currently adds +1 to hit_count on writes, causing query_cache_stats()["hits"] to conflate store operations with cache hits; change the upsert SQL in src/vouch/embeddings/cache.py (the statement referencing "INSERT OR REPLACE INTO query_embedding_cache" and the query_embedding_cache table) to preserve existing hit_count on write (e.g., COALESCE((SELECT hit_count FROM query_embedding_cache WHERE query_hash=?), 0) without +1) or initialize to 0 for new rows, and ensure only the lookup path (the code that currently increments hit_count on cache lookups — referenced by the lookup/update block around the other occurrences at lines noted 57-58 and 79-83) performs the +1 increment; update the logic used by query_cache_stats() to rely on the lookup-only increment so "hits" reflects served-from-cache only.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/vouch/index_db.py`:
- Around line 253-254: The current id()-based cache (_sqlite_vec_loaded) is
unsafe; replace it with a WeakSet or remove it: change the declaration of
_sqlite_vec_loaded from set[int] to weakref.WeakSet(), import weakref, and
update checks from "if id(conn) in _sqlite_vec_loaded" to "if conn in
_sqlite_vec_loaded" and additions from "_sqlite_vec_loaded.add(id(conn))" to
"_sqlite_vec_loaded.add(conn)"; alternatively, remove all mentions of
_sqlite_vec_loaded (lines checking/adding) so the loader code in the function
that accepts conn always runs (loader is idempotent), ensuring behavior is
correct without relying on object id reuse.
- Around line 295-300: The SQL uses the SELECT alias "score" in the WHERE clause
inside the query built in src/vouch/index_db.py (the call that passes
(q.tobytes(), *kinds, min_score, limit) and calls vec_distance_cosine(vec, ?)),
which is non-standard and can fail in some SQLite configs and triggers the slow
Python fallback; fix it by rewriting the query to compute the score in a
subquery/CTE or by inlining the cosine expression in the WHERE clause (e.g.,
SELECT ... FROM (SELECT ..., 1.0 - vec_distance_cosine(vec, ?) AS score FROM
embedding_index WHERE kind IN (...) ) AS sub WHERE score >= ? ORDER BY score
DESC LIMIT ?), keeping the same parameter ordering for q, kinds, min_score,
limit and still using vec_distance_cosine(vec, ?) so the DB does the filtering
instead of Python post-filtering.
---
Outside diff comments:
In `@src/vouch/index_db.py`:
- Around line 89-94: The reset() function currently only clears FTS tables;
update it (in src/vouch/index_db.py inside reset and using open_db) to also
remove all embedding-related state by executing DELETE (or DROP/CREATE) for
embedding_index, query_embedding_cache, and embedding_dupes so no stale semantic
state remains; ensure the same connection.executescript call includes those
table names (and any other non-FTS tables that must be cleared) so the function
truly “drops everything.”
---
Duplicate comments:
In `@src/vouch/embeddings/cache.py`:
- Line 24: The current timestamp generation uses
dt.datetime.now(dt.UTC).isoformat(timespec="seconds") which drops sub-second
precision and can break deterministic LRU eviction; update both occurrences (the
now assignment used to set last_used_at) to include sub-second precision, e.g.,
use timespec="microseconds" (or "milliseconds" if you prefer) so last_used_at
preserves ordering across bursty accesses; locate the
dt.datetime.now(dt.UTC).isoformat(...) calls (the now variable assignments that
populate last_used_at) and replace timespec="seconds" with
timespec="microseconds" in both places.
- Around line 15-16: The current _query_key function only hashes the query text,
which can cause reuse of stale embeddings across model/version/dimension
changes; update the key strategy to include model identity (e.g., model
name/version and embedding dimension) when generating cache keys—either change
_query_key to accept model_id/model_name and dim and incorporate them into the
SHA256 input, or persist model/version/dim alongside cached vectors and validate
those fields on lookup before returning a cached embedding; ensure all places
that call _query_key (and other cache key builders indicated around lines 22-33
and 47-53) are updated to pass the model identity/dim so keys are namespaced per
embedding space.
In `@src/vouch/index_db.py`:
- Around line 295-299: The SQL and NumPy search paths currently score all rows
for the requested kinds and ignore embedding metadata; restrict both to only
rows matching the active embedding metadata (model, model_version, dim). Update
the SQL query that uses "FROM embedding_index ... vec_distance_cosine" to add
WHERE conditions for model = ? AND model_version = ? AND dim = ? and pass the
active model/model_version/dim as parameters (in addition to the existing
placeholders and score/limit), and in the NumPy path (the code that computes q @
vec) filter the retrieved rows by their stored model/model_version/dim before
computing dot products (or drop/stale-purge rows that don't match). Ensure you
reference the embedding_index rows' model/model_version/dim fields in both code
paths so mixed embedding spaces cannot be compared.
In `@tests/embeddings/test_storage.py`:
- Around line 7-8: Top-level "import numpy as np" in the tests/embeddings module
causes test collection to fail when the optional embeddings extras aren't
installed; replace the direct import with a guarded import using
pytest.importorskip("numpy") at module import time (e.g., call
pytest.importorskip("numpy") before any use of numpy) so the whole module is
skipped cleanly when numpy is absent and avoids collection errors.
---
Nitpick comments:
In `@src/vouch/embeddings/cache.py`:
- Around line 27-33: The INSERT/REPLACE SQL for the query_embedding_cache
currently adds +1 to hit_count on writes, causing query_cache_stats()["hits"] to
conflate store operations with cache hits; change the upsert SQL in
src/vouch/embeddings/cache.py (the statement referencing "INSERT OR REPLACE INTO
query_embedding_cache" and the query_embedding_cache table) to preserve existing
hit_count on write (e.g., COALESCE((SELECT hit_count FROM query_embedding_cache
WHERE query_hash=?), 0) without +1) or initialize to 0 for new rows, and ensure
only the lookup path (the code that currently increments hit_count on cache
lookups — referenced by the lookup/update block around the other occurrences at
lines noted 57-58 and 79-83) performs the +1 increment; update the logic used by
query_cache_stats() to rely on the lookup-only increment so "hits" reflects
served-from-cache only.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 6e96ac29-9924-4992-a9fe-b2c5725e2bfa
📒 Files selected for processing (4)
src/vouch/embeddings/cache.pysrc/vouch/index_db.pytests/embeddings/conftest.pytests/embeddings/test_storage.py
| if id(conn) in _sqlite_vec_loaded: | ||
| return True |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Can Python reuse object IDs after an object is garbage collected, and is id() stable only for an object's lifetime?
💡 Result:
Yes. Python’s id() is stable only while the object is alive (reachable). If an object is garbage-collected, Python may reuse the same id() value for a later, different object with a non-overlapping lifetime [1][2]. What Python guarantees: - id(object) returns an integer that is “unique and constant for this object during its lifetime” [1][2]. - “Two objects with non-overlapping lifetimes may have the same id() value” [1][2]. About reuse after garbage collection: - Python objects become garbage-collected when they are unreachable; implementations are allowed to collect them at different times, and you should not depend on immediate finalization [3]. - When the original object is collected and its memory is available, a later object may be allocated at the same address (CPython implementation detail), causing id() to repeat [1][2]. So, id() is not a lifetime-unique identifier across time in a program; it’s only stable for the lifetime of the specific object instance [1][2]. If you need a stable ID beyond that, use something like a UUID stored on the object rather than id().
Citations:
- 1: https://docs.python.org/3/library/functions.html?highlight=hash
- 2: https://docs.python.org/3/library/functions.html
- 3: http://docs.python.org/reference/datamodel.html
🏁 Script executed:
cat -n src/vouch/index_db.py | sed -n '245,280p'Repository: vouchdev/vouch
Length of output: 1258
Remove the sqlite-vec load state cache or switch to a stable cache key.
Python's id() can be reused after garbage collection; when a connection is garbage-collected and a new one is allocated at the same memory address, the code will incorrectly skip loader execution for the new connection. Replace _sqlite_vec_loaded with either a WeakSet to track connections by object identity, or remove the cache entirely if performance impact is acceptable (loader can be called multiple times safely).
Lines affected
- Line 248:
_sqlite_vec_loaded: set[int] = set() - Lines 253–254:
if id(conn) in _sqlite_vec_loaded: return True - Lines 270–271:
_sqlite_vec_loaded.add(id(conn))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/vouch/index_db.py` around lines 253 - 254, The current id()-based cache
(_sqlite_vec_loaded) is unsafe; replace it with a WeakSet or remove it: change
the declaration of _sqlite_vec_loaded from set[int] to weakref.WeakSet(), import
weakref, and update checks from "if id(conn) in _sqlite_vec_loaded" to "if conn
in _sqlite_vec_loaded" and additions from "_sqlite_vec_loaded.add(id(conn))" to
"_sqlite_vec_loaded.add(conn)"; alternatively, remove all mentions of
_sqlite_vec_loaded (lines checking/adding) so the loader code in the function
that accepts conn always runs (loader is idempotent), ensuring behavior is
correct without relying on object id reuse.
1d20dc7 to
540b9ab
Compare
Per CodeRabbit on PR vouchdev#37: decoding sha256 chunks as raw float32 (via struct.unpack("<f", ...)) produced NaN/Inf for ~1 in 2^7 chunks because most 4-byte patterns aren't finite IEEE754 floats. Norm overflow then made unit normalization unreliable and broke cosine similarity asserts in downstream phases (the dedup test in Phase 6 already had to work around this with a custom _HashEmbedder). Decode each chunk as an unsigned 32-bit integer, scale to [-1.0, 1.0], accumulate in float64 for exact norm, and cast to float32 on return. Deterministic, finite, well-behaved.
540b9ab to
3722ee4
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (6)
src/vouch/index_db.py (3)
248-254:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid caching sqlite-vec load state by
id(conn).
id()values can be reused after GC, so a new connection may be incorrectly treated as already loaded. Use aweakref.WeakSet[sqlite3.Connection]or drop this cache and always attempt load.Also applies to: 270-271
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 248 - 254, The current cache _sqlite_vec_loaded uses id(conn) which can be recycled; update _load_sqlite_vec to stop using id(conn) keys and instead track connections with a weakref.WeakSet[sqlite3.Connection] (or remove the cache entirely and always attempt to load), replacing the int set _sqlite_vec_loaded with a WeakSet and checking membership by connection object; update any other uses around the same module (the other occurrences of _sqlite_vec_loaded) to use the new WeakSet to avoid false positives when connection objects are GC'd and ids get reused.
295-300:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t filter on SELECT alias
scoreinWHERE.Using
scorealias in the same SELECT is non-portable and can triggerOperationalError, silently downgrading to Python brute-force. Rewrite with a subquery/CTE (or inline the expression inWHERE) so DB-side filtering is reliable.Also applies to: 303-304
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 295 - 300, The SQL uses the SELECT alias "score" in the WHERE clause (in the query built around the embedding_index selection near the code that passes (q.tobytes(), *kinds, min_score, limit)), which is non-portable; change the query in the function/method that builds this statement (and the similar one around lines 303-304) to either use a subquery/CTE or inline the expression 1.0 - vec_distance_cosine(vec, ?) directly in the WHERE (e.g., wrap the SELECT as a subquery/CTE that computes score and then filter on score in the outer WHERE), keeping the same parameter order (q.tobytes(), *kinds, min_score, limit) so DB-side filtering works reliably.
295-299:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftConstrain search rows to active embedding metadata (model/version/dim).
Current search scans by
kindonly. Mixed embeddings (post-upgrade/partial reindex) can produce shape errors or cross-space ranking noise. Filter both SQL and fallback paths to active metadata, or purge stale rows when meta changes.Also applies to: 305-307, 311-314
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 295 - 299, The SQL query against embedding_index (the multi-line string building SELECT with vec_distance_cosine and placeholders) currently filters only by kind; update that query to also filter by the active embedding metadata fields (model, version, dim) by adding AND clauses (e.g., AND model = ? AND version = ? AND dim = ?) and bind those values alongside the existing params, and apply the same metadata filtering to the other two search query sites noted (around lines 305-307 and 311-314) as well as the fallback in-memory or Python-path that currently uses kind-only filtering (ensure the fallback filters rows by model/version/dim before computing score), so mixed-shape/stale rows are excluded or purged when metadata changes.tests/embeddings/test_storage.py (1)
7-8:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard numpy import to avoid collection-time CI failures when optional deps are absent.
Use
pytest.importorskip("numpy")before importingnumpy(or ensure CI always installs embeddings extras). Current hard import can fail the entire suite during collection.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/embeddings/test_storage.py` around lines 7 - 8, The test file currently does a hard import of numpy which can cause collection-time failures; change the import to guard with pytest.importorskip by calling pytest.importorskip("numpy") before attempting to import numpy (or replace the direct import with pytest.importorskip followed by import numpy as np) so tests are skipped when the optional dependency is missing; update the top of tests/embeddings/test_storage.py around the numpy import to use pytest.importorskip to prevent CI collection failures.src/vouch/embeddings/cache.py (2)
15-16:⚠️ Potential issue | 🟠 Major | ⚡ Quick winNamespace the cache key by embedding identity, not query text alone.
Hashing only
queryallows stale vectors to be reused after model/version/dim changes. Include embedder identity in the key (or store/check model metadata on lookup) to prevent cross-model cache poisoning.Also applies to: 46-53
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/embeddings/cache.py` around lines 15 - 16, The cache key currently hashes only the query in _query_key, which allows reuse across different embedder models/versions; change _query_key to take an embedder identity (e.g., model name/version or embedder.id) and include that identity in the hashed input so keys are namespaced by embedder, and update every cache lookup/storage site to pass that embedder identity; alternatively, persist embedder/model metadata with each cached entry and validate it on lookup — but ensure functions that compute keys (start with _query_key and any other key-generating helpers referenced around lines 46-53) are updated to use the embedder identity.
24-24:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse sub-second precision for
last_used_atto preserve true LRU ordering.
timespec="seconds"can produce ties during bursty access and evict arbitrarily.Also applies to: 48-48
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/embeddings/cache.py` at line 24, The timestamp assigned to the variable now (used for last_used_at) uses timespec="seconds", which causes ties under bursty access; update the isoformat call to include sub-second precision (e.g., timespec="microseconds" or "milliseconds") so last_used_at preserves true LRU ordering, and apply the same change to the other occurrence that sets now in this module.
🧹 Nitpick comments (3)
tests/embeddings/test_integration.py (1)
38-53: ⚡ Quick winAdd normalization assertions for ST-MiniLM and FastEmbed integration checks.
These tests currently validate only shape/dtype; adding unit-norm assertions here will catch adapter regressions that impact cosine-based ranking behavior.
Proposed test hardening
`@pytest.mark.integration` def test_st_minilm_loads_and_encodes() -> None: from vouch.embeddings.st_minilm import STMinilmEmbedder e = STMinilmEmbedder() vec = e.encode("hello world") + assert isinstance(vec, np.ndarray) assert vec.shape == (384,) assert vec.dtype == np.float32 + assert abs(float(np.linalg.norm(vec)) - 1.0) < 1e-3 @@ `@pytest.mark.integration` def test_fastembed_bge_loads_and_encodes() -> None: pytest.importorskip("fastembed") from vouch.embeddings.fastembed_bge import FastembedBgeEmbedder e = FastembedBgeEmbedder() vec = e.encode("hello world") + assert isinstance(vec, np.ndarray) assert vec.shape == (384,) assert vec.dtype == np.float32 + assert abs(float(np.linalg.norm(vec)) - 1.0) < 1e-3🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/embeddings/test_integration.py` around lines 38 - 53, Update the two tests to also assert that returned embeddings are unit-normalized: after calling STMinilmEmbedder.encode in test_st_minilm_loads_and_encodes and FastembedBgeEmbedder.encode in test_fastembed_bge_loads_and_encodes compute the L2 norm (e.g., via np.linalg.norm(vec)) and assert it is approximately 1.0 within a small tolerance (use pytest.approx or a small abs tolerance) in addition to the existing shape and dtype checks so regressions that break cosine-based ranking are caught.src/vouch/bundle.py (1)
211-227: ⚡ Quick winConsider a more explicit check for the content file.
The current logic skips validation for any file in
sources/that doesn't end with/meta.yaml. While this is correct given the documented two-file structure (meta.yaml and content), an explicit check for/contentwould be clearer and more defensive:if subdir == "sources" and path.endswith("/content"): returnThis makes the intent explicit—skip validation only for the opaque content file—and prevents accidental validation skips if the file structure evolves.
🔍 Proposed refinement
subdir = path.split("/")[0] # Source artifacts have two file kinds: # sources/<sha>/meta.yaml -- the Source pydantic model (validate) # sources/<sha>/content -- the raw source bytes (skip validation) # The opaque content file isn't a pydantic model, so model_validate # on raw bytes raises spuriously. - if subdir == "sources" and not path.endswith("/meta.yaml"): + if subdir == "sources" and path.endswith("/content"): return🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/bundle.py` around lines 211 - 227, The current guard in _validate_content uses subdir == "sources" and not path.endswith("/meta.yaml") which skips validation for any non-meta file; change it to explicitly skip only the opaque content file by checking if subdir == "sources" and path.endswith("/content") then return, leaving meta.yaml and any other future files to be validated by VALIDATORS; update the conditional in _validate_content and keep the subsequent validator lookup/try-except unchanged.src/vouch/embeddings/cache.py (1)
27-31: ⚡ Quick winClarify or split
hit_countsemantics in cache stats.
hit_countis incremented on insert and on lookup-hit, soquery_cache_stats()["hits"]mixes write and read traffic. Either document this explicitly or track separate counters (write_count,serve_hits).Also applies to: 76-83
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/embeddings/cache.py` around lines 27 - 31, The current query_embedding_cache table and its use of the hit_count column (updated by the INSERT OR REPLACE SQL shown and again at the lookup-hit code around lines 76-83) mixes writes and read-serves, so update the schema and stats to separate those semantics: add two columns (e.g., write_count and serve_hits) to the query_embedding_cache table, change the INSERT/REPLACE logic that currently increments hit_count to only increment write_count, change the lookup-path that currently increments hit_count on cache hits to increment serve_hits instead, and update the query_cache_stats() function to return both write_count and serve_hits (or rename fields) so metrics clearly distinguish writes vs serves. Ensure all SQL statements and any code referencing hit_count are updated accordingly and migration/initialization handles existing rows.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/vouch/index_db.py`:
- Around line 311-314: The fallback scoring uses raw dot product (score =
float(q @ vec)) which diverges from the SQL cosine distance path; update the
loop that calls _blob_to_vec to normalize the vector (and ensure q is normalized
or normalize a copy of q) before computing score so score = float((q_normalized)
@ (vec_normalized)) and then compare to min_score; refer to the variables q, vec
and the helper _blob_to_vec when making the change.
---
Duplicate comments:
In `@src/vouch/embeddings/cache.py`:
- Around line 15-16: The cache key currently hashes only the query in
_query_key, which allows reuse across different embedder models/versions; change
_query_key to take an embedder identity (e.g., model name/version or
embedder.id) and include that identity in the hashed input so keys are
namespaced by embedder, and update every cache lookup/storage site to pass that
embedder identity; alternatively, persist embedder/model metadata with each
cached entry and validate it on lookup — but ensure functions that compute keys
(start with _query_key and any other key-generating helpers referenced around
lines 46-53) are updated to use the embedder identity.
- Line 24: The timestamp assigned to the variable now (used for last_used_at)
uses timespec="seconds", which causes ties under bursty access; update the
isoformat call to include sub-second precision (e.g., timespec="microseconds" or
"milliseconds") so last_used_at preserves true LRU ordering, and apply the same
change to the other occurrence that sets now in this module.
In `@src/vouch/index_db.py`:
- Around line 248-254: The current cache _sqlite_vec_loaded uses id(conn) which
can be recycled; update _load_sqlite_vec to stop using id(conn) keys and instead
track connections with a weakref.WeakSet[sqlite3.Connection] (or remove the
cache entirely and always attempt to load), replacing the int set
_sqlite_vec_loaded with a WeakSet and checking membership by connection object;
update any other uses around the same module (the other occurrences of
_sqlite_vec_loaded) to use the new WeakSet to avoid false positives when
connection objects are GC'd and ids get reused.
- Around line 295-300: The SQL uses the SELECT alias "score" in the WHERE clause
(in the query built around the embedding_index selection near the code that
passes (q.tobytes(), *kinds, min_score, limit)), which is non-portable; change
the query in the function/method that builds this statement (and the similar one
around lines 303-304) to either use a subquery/CTE or inline the expression 1.0
- vec_distance_cosine(vec, ?) directly in the WHERE (e.g., wrap the SELECT as a
subquery/CTE that computes score and then filter on score in the outer WHERE),
keeping the same parameter order (q.tobytes(), *kinds, min_score, limit) so
DB-side filtering works reliably.
- Around line 295-299: The SQL query against embedding_index (the multi-line
string building SELECT with vec_distance_cosine and placeholders) currently
filters only by kind; update that query to also filter by the active embedding
metadata fields (model, version, dim) by adding AND clauses (e.g., AND model = ?
AND version = ? AND dim = ?) and bind those values alongside the existing
params, and apply the same metadata filtering to the other two search query
sites noted (around lines 305-307 and 311-314) as well as the fallback in-memory
or Python-path that currently uses kind-only filtering (ensure the fallback
filters rows by model/version/dim before computing score), so mixed-shape/stale
rows are excluded or purged when metadata changes.
In `@tests/embeddings/test_storage.py`:
- Around line 7-8: The test file currently does a hard import of numpy which can
cause collection-time failures; change the import to guard with
pytest.importorskip by calling pytest.importorskip("numpy") before attempting to
import numpy (or replace the direct import with pytest.importorskip followed by
import numpy as np) so tests are skipped when the optional dependency is
missing; update the top of tests/embeddings/test_storage.py around the numpy
import to use pytest.importorskip to prevent CI collection failures.
---
Nitpick comments:
In `@src/vouch/bundle.py`:
- Around line 211-227: The current guard in _validate_content uses subdir ==
"sources" and not path.endswith("/meta.yaml") which skips validation for any
non-meta file; change it to explicitly skip only the opaque content file by
checking if subdir == "sources" and path.endswith("/content") then return,
leaving meta.yaml and any other future files to be validated by VALIDATORS;
update the conditional in _validate_content and keep the subsequent validator
lookup/try-except unchanged.
In `@src/vouch/embeddings/cache.py`:
- Around line 27-31: The current query_embedding_cache table and its use of the
hit_count column (updated by the INSERT OR REPLACE SQL shown and again at the
lookup-hit code around lines 76-83) mixes writes and read-serves, so update the
schema and stats to separate those semantics: add two columns (e.g., write_count
and serve_hits) to the query_embedding_cache table, change the INSERT/REPLACE
logic that currently increments hit_count to only increment write_count, change
the lookup-path that currently increments hit_count on cache hits to increment
serve_hits instead, and update the query_cache_stats() function to return both
write_count and serve_hits (or rename fields) so metrics clearly distinguish
writes vs serves. Ensure all SQL statements and any code referencing hit_count
are updated accordingly and migration/initialization handles existing rows.
In `@tests/embeddings/test_integration.py`:
- Around line 38-53: Update the two tests to also assert that returned
embeddings are unit-normalized: after calling STMinilmEmbedder.encode in
test_st_minilm_loads_and_encodes and FastembedBgeEmbedder.encode in
test_fastembed_bge_loads_and_encodes compute the L2 norm (e.g., via
np.linalg.norm(vec)) and assert it is approximately 1.0 within a small tolerance
(use pytest.approx or a small abs tolerance) in addition to the existing shape
and dtype checks so regressions that break cosine-based ranking are caught.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 6b136ef8-f17a-4d0b-89a9-e56392159cbe
📒 Files selected for processing (17)
docs/superpowers/plans/2026-05-20-semantic-search.mddocs/superpowers/specs/2026-05-20-semantic-search-design.mdpyproject.tomlsrc/vouch/bundle.pysrc/vouch/embeddings/__init__.pysrc/vouch/embeddings/base.pysrc/vouch/embeddings/cache.pysrc/vouch/embeddings/fastembed_bge.pysrc/vouch/embeddings/st_minilm.pysrc/vouch/embeddings/st_mpnet.pysrc/vouch/index_db.pytests/embeddings/__init__.pytests/embeddings/_fakes.pytests/embeddings/conftest.pytests/embeddings/test_core.pytests/embeddings/test_integration.pytests/embeddings/test_storage.py
✅ Files skipped from review due to trivial changes (2)
- tests/embeddings/init.py
- tests/embeddings/conftest.py
| for kind, id_, blob, dim in rows: | ||
| vec = _blob_to_vec(blob, dim) | ||
| score = float(q @ vec) | ||
| if score >= min_score: |
There was a problem hiding this comment.
Fallback scoring is dot-product, not cosine similarity.
SQL path uses cosine distance, but fallback uses raw q @ vec; results diverge if stored vectors are not unit-normalized. Normalize vec (or enforce normalized storage) before scoring to keep backend parity.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/vouch/index_db.py` around lines 311 - 314, The fallback scoring uses raw
dot product (score = float(q @ vec)) which diverges from the SQL cosine distance
path; update the loop that calls _blob_to_vec to normalize the vector (and
ensure q is normalized or normalize a copy of q) before computing score so score
= float((q_normalized) @ (vec_normalized)) and then compare to min_score; refer
to the variables q, vec and the helper _blob_to_vec when making the change.
…embedding Per CodeRabbit Critical on PR vouchdev#39/40/43: SQLite does not allow a column alias defined in the SELECT list to be referenced in the WHERE clause of the same query. The old form SELECT ... 1.0 - vec_distance_cosine(vec, ?) AS score ... WHERE ... AND score >= ? raised `no such column: score` at runtime; the catch-all `except sqlite3.OperationalError` silently swallowed it and made every call fall through to the NumPy brute-force path, which works but defeats the whole point of installing sqlite-vec. Wrap the projection in a subquery so the WHERE clause sees the alias. No behavior change for callers that have already been using the fallback (NumPy still produces correct results); the ANN code path now actually runs when sqlite-vec is loaded.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/vouch/index_db.py (1)
89-94:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
reset()does not clear embedding/query-cache tables.This leaves stale semantic-search state after a reset/rebuild cycle.
Proposed fix
def reset(kb_dir: Path) -> None: """Drop everything; the rebuild caller re-populates.""" with open_db(kb_dir) as conn: conn.executescript( - "DELETE FROM claims_fts; DELETE FROM pages_fts; DELETE FROM entities_fts;" + "DELETE FROM claims_fts; " + "DELETE FROM pages_fts; " + "DELETE FROM entities_fts; " + "DELETE FROM embedding_index; " + "DELETE FROM query_embedding_cache; " + "DELETE FROM embedding_dupes; " + "DELETE FROM index_meta WHERE key LIKE 'embedding_%';" )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 89 - 94, The reset(kb_dir) function currently only clears FTS tables (claims_fts, pages_fts, entities_fts) via open_db(...).executescript; update that executescript to also delete rows from the semantic-search/cache tables used by this project (e.g. embedding_cache and query_cache or whatever concrete table names store embeddings and cached queries) so embeddings and query-cache state are cleared on reset; modify the conn.executescript call in reset to include "DELETE FROM embedding_cache; DELETE FROM query_cache;" (and add any other embedding/query-cache table names present) so all stale semantic-search state is removed.
♻️ Duplicate comments (5)
src/vouch/embeddings/cache.py (2)
24-24:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse sub-second precision for
last_used_atto preserve LRU behavior.Second-level timestamps make eviction order unstable during bursts.
Proposed fix
- now = dt.datetime.now(dt.UTC).isoformat(timespec="seconds") + now = dt.datetime.now(dt.UTC).isoformat(timespec="microseconds") ... - now = dt.datetime.now(dt.UTC).isoformat(timespec="seconds") + now = dt.datetime.now(dt.UTC).isoformat(timespec="microseconds")Also applies to: 48-48
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/embeddings/cache.py` at line 24, The timestamp for last_used_at is truncated to seconds, breaking LRU ordering; replace dt.datetime.now(dt.UTC).isoformat(timespec="seconds") with a sub-second precision timestamp (e.g., timespec="microseconds" or omit timespec to keep full precision) wherever it’s used (the expression assigning now / last_used_at in src/vouch/embeddings/cache.py), and update both occurrences so eviction ordering remains stable under bursts.
15-16:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftNamespace cache keys by embedding identity, not just query text.
Using only
queryin the key can return stale/mismatched vectors when model/version/dimension changes.Also applies to: 22-23, 47-47
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/embeddings/cache.py` around lines 15 - 16, The cache key currently derived only from the query text in _query_key can return stale/mismatched vectors when the embedding model, model version, or embedding dimension changes; update _query_key to incorporate a namespace that includes the embedding model identity (name/version or model id) and embedding dimension (e.g., "embedding:{model}:{dim}:{hash}") so keys are unique per embedding configuration, compute the query hash as before (hashlib.sha256 of the query bytes) and concatenate the model/dim values into the returned string; apply the same pattern to any other key-generating helpers used for document/vector caching so all cache keys are namespaced by embedding identity.src/vouch/index_db.py (3)
248-254:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid caching sqlite-vec load state by
id(conn).
id()is only stable for object lifetime; reuse can incorrectly short-circuit loading on a different connection.Proposed fix
-_sqlite_vec_loaded: set[int] = set() - - def _load_sqlite_vec(conn: sqlite3.Connection) -> bool: """Best-effort load of the sqlite-vec extension.""" - if id(conn) in _sqlite_vec_loaded: - return True try: conn.enable_load_extension(True) @@ - _sqlite_vec_loaded.add(id(conn)) return TrueAlso applies to: 270-271
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 248 - 254, The code currently caches sqlite-vec load state using id(conn) which can collide if object ids are reused; change _sqlite_vec_loaded from set[int] to a weakref.WeakSet of sqlite3.Connection objects (import weakref) and update _load_sqlite_vec to check membership with "if conn in _sqlite_vec_loaded" and add the actual connection object with "_sqlite_vec_loaded.add(conn)" after a successful load; apply the same replacement for the other occurrence referenced (the logic around lines 270-271) so you track live connection objects rather than their numeric ids.
315-317:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFallback scoring is dot product, not cosine similarity.
SQL backend uses cosine distance, but fallback uses raw
q @ vec; rankings diverge unless stored vectors are guaranteed unit-normalized.Proposed fix
scored: list[tuple[str, str, str, float]] = [] for kind, id_, blob, dim in rows: vec = _blob_to_vec(blob, dim) - score = float(q @ vec) + vnorm = float(np.linalg.norm(vec)) + if vnorm > 0: + vec = vec / vnorm + score = float(q @ vec) if score >= min_score: scored.append((kind, id_, "", score))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 315 - 317, The fallback scorer in the loop over rows uses raw dot product (q @ vec) which mismatches the SQL cosine-distance backend; update the scoring in the loop that iterates "for kind, id_, blob, dim in rows" (and uses _blob_to_vec to get vec) to compute cosine similarity instead of raw dot: compute the norms of q and vec and divide the dot product by (||q|| * ||vec||) so score = (q @ vec) / (norm(q) * norm(vec)); ensure you handle zero-length vectors safely (skip or treat score as 0) and use the same numeric types/array library as q/_blob_to_vec (e.g., numpy.linalg.norm if using numpy) so rankings match the SQL backend.
298-303:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftConstrain embedding search to active
(model, model_version, dim).Current queries score all rows of requested kinds; mixed embedding spaces can skew results and cause dimension errors in fallback.
Also applies to: 309-313
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 298 - 303, The embedding-search SQL currently returns rows across all embedding spaces; modify the inner SELECT in the query built around the string starting "SELECT kind, id, 1.0 - vec_distance_cosine(vec, ?) AS score" to add AND clauses constraining model = ?, model_version = ?, and dim = ? (i.e., limit to the active (model, model_version, dim) tuple), and bind those three values immediately after the query's vector parameter; apply the same change to the second analogous query at the later block (lines ~309-313) so both queries filter by the active embedding space and avoid cross-space scoring and dim mismatches.
🧹 Nitpick comments (2)
src/vouch/embeddings/cache.py (1)
27-32: ⚡ Quick winClarify
hit_countsemantics (currently writes + reads).
hit_countis incremented on insert and lookup; if this is meant to represent cache hits, initialize to0on insert and increment only on successful lookup (or rename the metric).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/embeddings/cache.py` around lines 27 - 32, The current INSERT OR REPLACE SQL for query_embedding_cache increments hit_count on both inserts and lookups; change it to initialize hit_count to 0 on new inserts (remove the "+ 1" from the VALUES expression so it uses COALESCE(..., 0) as-is) and move the increment logic to the cache lookup path where hits are detected (update hit_count with an UPDATE or incrementing statement when a query_hash is found). Update the SQL in the INSERT statement and add/modify the lookup code that reads from query_embedding_cache to perform an explicit hit_count increment on successful cache hits.src/vouch/index_db.py (1)
54-59: Consider indexingquery_embedding_cache(last_used_at)if cache cap increases.Eviction is
ORDER BY last_used_at; an index will keep this predictable asmax_entriesgrows.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/index_db.py` around lines 54 - 59, The eviction query sorts by last_used_at on the query_embedding_cache table, so add a dedicated index on query_embedding_cache(last_used_at) to keep ORDER BY efficient as max_entries grows; update the DB schema/migration or the initialization routine that creates the table to also create the index (use CREATE INDEX IF NOT EXISTS ...) referencing the table name query_embedding_cache and column last_used_at so eviction queries and range scans remain performant.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/vouch/index_db.py`:
- Around line 89-94: The reset(kb_dir) function currently only clears FTS tables
(claims_fts, pages_fts, entities_fts) via open_db(...).executescript; update
that executescript to also delete rows from the semantic-search/cache tables
used by this project (e.g. embedding_cache and query_cache or whatever concrete
table names store embeddings and cached queries) so embeddings and query-cache
state are cleared on reset; modify the conn.executescript call in reset to
include "DELETE FROM embedding_cache; DELETE FROM query_cache;" (and add any
other embedding/query-cache table names present) so all stale semantic-search
state is removed.
---
Duplicate comments:
In `@src/vouch/embeddings/cache.py`:
- Line 24: The timestamp for last_used_at is truncated to seconds, breaking LRU
ordering; replace dt.datetime.now(dt.UTC).isoformat(timespec="seconds") with a
sub-second precision timestamp (e.g., timespec="microseconds" or omit timespec
to keep full precision) wherever it’s used (the expression assigning now /
last_used_at in src/vouch/embeddings/cache.py), and update both occurrences so
eviction ordering remains stable under bursts.
- Around line 15-16: The cache key currently derived only from the query text in
_query_key can return stale/mismatched vectors when the embedding model, model
version, or embedding dimension changes; update _query_key to incorporate a
namespace that includes the embedding model identity (name/version or model id)
and embedding dimension (e.g., "embedding:{model}:{dim}:{hash}") so keys are
unique per embedding configuration, compute the query hash as before
(hashlib.sha256 of the query bytes) and concatenate the model/dim values into
the returned string; apply the same pattern to any other key-generating helpers
used for document/vector caching so all cache keys are namespaced by embedding
identity.
In `@src/vouch/index_db.py`:
- Around line 248-254: The code currently caches sqlite-vec load state using
id(conn) which can collide if object ids are reused; change _sqlite_vec_loaded
from set[int] to a weakref.WeakSet of sqlite3.Connection objects (import
weakref) and update _load_sqlite_vec to check membership with "if conn in
_sqlite_vec_loaded" and add the actual connection object with
"_sqlite_vec_loaded.add(conn)" after a successful load; apply the same
replacement for the other occurrence referenced (the logic around lines 270-271)
so you track live connection objects rather than their numeric ids.
- Around line 315-317: The fallback scorer in the loop over rows uses raw dot
product (q @ vec) which mismatches the SQL cosine-distance backend; update the
scoring in the loop that iterates "for kind, id_, blob, dim in rows" (and uses
_blob_to_vec to get vec) to compute cosine similarity instead of raw dot:
compute the norms of q and vec and divide the dot product by (||q|| * ||vec||)
so score = (q @ vec) / (norm(q) * norm(vec)); ensure you handle zero-length
vectors safely (skip or treat score as 0) and use the same numeric types/array
library as q/_blob_to_vec (e.g., numpy.linalg.norm if using numpy) so rankings
match the SQL backend.
- Around line 298-303: The embedding-search SQL currently returns rows across
all embedding spaces; modify the inner SELECT in the query built around the
string starting "SELECT kind, id, 1.0 - vec_distance_cosine(vec, ?) AS score" to
add AND clauses constraining model = ?, model_version = ?, and dim = ? (i.e.,
limit to the active (model, model_version, dim) tuple), and bind those three
values immediately after the query's vector parameter; apply the same change to
the second analogous query at the later block (lines ~309-313) so both queries
filter by the active embedding space and avoid cross-space scoring and dim
mismatches.
---
Nitpick comments:
In `@src/vouch/embeddings/cache.py`:
- Around line 27-32: The current INSERT OR REPLACE SQL for query_embedding_cache
increments hit_count on both inserts and lookups; change it to initialize
hit_count to 0 on new inserts (remove the "+ 1" from the VALUES expression so it
uses COALESCE(..., 0) as-is) and move the increment logic to the cache lookup
path where hits are detected (update hit_count with an UPDATE or incrementing
statement when a query_hash is found). Update the SQL in the INSERT statement
and add/modify the lookup code that reads from query_embedding_cache to perform
an explicit hit_count increment on successful cache hits.
In `@src/vouch/index_db.py`:
- Around line 54-59: The eviction query sorts by last_used_at on the
query_embedding_cache table, so add a dedicated index on
query_embedding_cache(last_used_at) to keep ORDER BY efficient as max_entries
grows; update the DB schema/migration or the initialization routine that creates
the table to also create the index (use CREATE INDEX IF NOT EXISTS ...)
referencing the table name query_embedding_cache and column last_used_at so
eviction queries and range scans remain performant.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 72e51305-3f13-4b9a-b72e-8627de2db56f
📒 Files selected for processing (4)
src/vouch/embeddings/cache.pysrc/vouch/index_db.pytests/embeddings/_fakes.pytests/embeddings/test_storage.py
…orage feat(embeddings): storage layer — state.db schema + vector search + query cache
Summary
Phase 2 of the semantic search feature: the storage layer. Adds three new tables to
.vouch/state.db(embedding_index,query_embedding_cache,embedding_dupes), the put/get/search helpers invouch.index_db, and a query-embedding LRU cache. Still no user-visible behavior change — nothing inKBStore.put_*orkb.searchreaches this code yet. Phase 3 wires the write hooks; Phase 4 wires the read path.Tracks
docs/superpowers/specs/2026-05-20-semantic-search-design.md§4 (Storage layout) and Tasks 7-10 ofdocs/superpowers/plans/2026-05-20-semantic-search.md.What's in this PR
5 commits, one per task plus a small style cleanup:
2d8ffa5state.db+put_embedding/get_embedding/set_embedding_meta/get_embedding_metahelperscc2a7e0search_embedding(kb_dir, query_vec, kinds, limit, min_score)— pure NumPy brute-force cosine path3480507sqlite-vecANN path layered on top, with idempotent_load_sqlite_vec(conn)extension loader and graceful NumPy fallback when the extension is unavailable35b7549vouch.embeddings.cache—cache_query_vec/lookup_query_vec/query_cache_size/query_cache_clear/query_cache_stats; bounded LRU eviction bylast_used_at5177248style:cleanup for ruff E402 / SIM105 / E702 / I001 violations introduced by the verbatim plan codesrc/vouch/index_db.pySCHEMA(3 new tables, 1 index); appended ~140 LOC of embedding helpers (put/get/search/meta + sqlite-vec loader)src/vouch/embeddings/cache.py(new)tests/embeddings/test_storage.py(new)Design notes
One ANN path with a NumPy escape hatch.
_load_sqlite_vec(conn)does best-effort extension loading and is idempotent — repeat calls on the same connection return cached state. If the extension is present,search_embeddingusesvec_distance_cosinedirectly in SQL with ORDER BY + LIMIT pushdown; if not (or if it raisesOperationalError), the function falls back to loading all candidate rows and computing cosine in NumPy. The result shape is identical:list[tuple[kind, id, snippet, score]]to matchindex_db.search(FTS5) for drop-in compatibility.Content-hash idempotency.
put_embeddingusesINSERT OR REPLACEkeyed on(kind, id)— re-encoding the same artifact text is a no-op at the storage layer (Phase 3 will add the skip-if-unchanged decision based oncontent_hashbefore even calling encode).Model identity is first-class.
set_embedding_meta/get_embedding_metapersist the model name, version, and dim into the existingindex_metakey/value table. Phase 6's migration module reads these to detect a model swap and trigger a backfill warning.Query cache bounded by
last_used_at.cache_query_vec(..., max_entries=1024)runs an explicit eviction sweep after each insert. Cheap and predictable; no background task or LRU dict-in-memory.Test Plan
.venv/bin/pytest tests/embeddings -v→ 20 passed, 4 deselected (5 schema + 3 brute-force + 2 sqlite-vec + 2 cache = 12 new; 8 from Phase 1 still pass).venv/bin/python -m ruff check src/vouch/index_db.py src/vouch/embeddings/cache.py tests/embeddings/test_storage.py→ clean.venv/bin/pytest --ignore=tests/test_sessions.py -q→ no regression in pre-existing tests_load_sqlite_vectest asserts only loader idempotency; the actual ANN cosine vs NumPy parity test passes under both paths via the samesearch_embeddingentry point)Summary by CodeRabbit
New Features
Documentation
Chores
Tests
Bug Fixes