Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
23f504f
docs(spec): semantic search as primary retrieval backend
dripsmvcp May 20, 2026
2206cb3
docs(plan): semantic search implementation plan
dripsmvcp May 20, 2026
b3795e5
feat(embeddings): add optional-deps extras and pytest markers
dripsmvcp May 20, 2026
8659b75
feat(embeddings): create package skeleton
dripsmvcp May 20, 2026
5a00683
feat(embeddings): Embedder ABC, registry, content_hash, MockEmbedder
dripsmvcp May 20, 2026
3536e0f
feat(embeddings): sentence-transformers all-mpnet-base-v2 default ada…
dripsmvcp May 20, 2026
9fb6f0c
feat(embeddings): sentence-transformers MiniLM-L6 alternative adapter
dripsmvcp May 20, 2026
84ab0ac
feat(embeddings): fastembed BGE alternative (no-torch) adapter
dripsmvcp May 20, 2026
02ed5d1
fix(ci): satisfy ruff SIM105 + add mypy overrides for optional deps
dripsmvcp May 20, 2026
dbe52ce
fix(ci): skip embeddings test suite when numpy isn't installed
dripsmvcp May 20, 2026
2e3ef76
fix(bundle): skip Pydantic validation for opaque source content files
dripsmvcp May 20, 2026
7a7951f
fix(embeddings): MockEmbedder uses uint32 scaling to avoid NaN/Inf
dripsmvcp May 20, 2026
2abce16
feat(embeddings): state.db schema for embedding storage + put/get hel…
dripsmvcp May 20, 2026
d36b2e8
feat(embeddings): NumPy brute-force cosine search over embedding_index
dripsmvcp May 20, 2026
b769dab
feat(embeddings): sqlite-vec ANN path with NumPy fallback
dripsmvcp May 20, 2026
c6b0e8b
feat(embeddings): query embedding LRU cache
dripsmvcp May 20, 2026
3722ee4
style: fix ruff E402/SIM105/E702/I001 violations from Phase 2 storage…
dripsmvcp May 20, 2026
d51474b
fix(embeddings): use outer-query for WHERE on cosine alias in search_…
dripsmvcp May 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3,769 changes: 3,769 additions & 0 deletions docs/superpowers/plans/2026-05-20-semantic-search.md

Large diffs are not rendered by default.

258 changes: 258 additions & 0 deletions docs/superpowers/specs/2026-05-20-semantic-search-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
# Semantic Search Design — vouch

**Date:** 2026-05-20
**Status:** Approved (design)
**Branch:** `feat/semantic-search`
**Tracks issue:** (to be filed)

## 1. Goal

Add embedding-based semantic retrieval as the **primary** search backend in vouch's MCP, JSONL, and CLI search surfaces, with FTS5 as the deterministic fallback. Cover every artifact type (claims, sources, pages, entities, relations, evidence) with reranking, query expansion, duplicate detection, eval harness, and pluggable model adapters.

The existing search layer is FTS5 + substring (`src/vouch/index_db.py`, `src/vouch/storage.py:search_substring`). The current code even anticipates this addition — `index_db.py:8-9`:

> *"Vector search can be layered later as a second `backend` in the ContextItem response."*

This spec realizes that.

## 2. Integration shape

Decisions taken during brainstorming (all confirmed with the user):

| Axis | Choice |
|---|---|
| Integration mode | **Embedding as primary, FTS5 as fallback** (was: opt-in / hybrid-default / primary) |
| Compute timing | **Synchronous at write** (mirrors FTS5 today) |
| Default model | **`sentence-transformers/all-mpnet-base-v2`** — 768-dim, ~420MB, best quality at its tier |
| Vector store | **`sqlite-vec`** ANN, with **NumPy brute-force** fallback if the extension is unavailable |
| Default behavior | Behavior change: `kb.search` returns embedding hits first, FTS5 only if embedding returns none. Documented in changelog. |
| Scope | Maximally functional — all artifact types, reranking, HyDE, dedup, eval harness, multiple model adapters |

## 3. New package layout

```
src/vouch/embeddings/
__init__.py # public API: encode, search, register
base.py # Embedder ABC + adapter registry
st_mpnet.py # default impl (sentence-transformers all-mpnet-base-v2)
st_minilm.py # alternative impl
fastembed_bge.py # alternative impl (no-torch path via fastembed)
cache.py # query embedding LRU + content-hash skip cache
rerank.py # cross-encoder reranker (ms-marco-MiniLM-L6-v2)
hyde.py # Hypothetical Document Embedding query expansion
dedup.py # cosine-threshold duplicate detection at ingest
fusion.py # RRF, weighted-sum, normalized-cosine fusion strategies
eval.py # recall@k / MRR / nDCG harness
migration.py # model-identity check + backfill orchestration
```
Comment on lines +33 to +47

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language tag to the fenced code block.

Line 33 triggers markdownlint MD040 (fenced-code-language). This can cause doc-lint CI noise/failures depending on pipeline strictness.

Proposed fix
-```
+```text
 src/vouch/embeddings/
   __init__.py                # public API: encode, search, register
   base.py                    # Embedder ABC + adapter registry
   st_mpnet.py                # default impl (sentence-transformers all-mpnet-base-v2)
   st_minilm.py               # alternative impl
   fastembed_bge.py           # alternative impl (no-torch path via fastembed)
   cache.py                   # query embedding LRU + content-hash skip cache
   rerank.py                  # cross-encoder reranker (ms-marco-MiniLM-L6-v2)
   hyde.py                    # Hypothetical Document Embedding query expansion
   dedup.py                   # cosine-threshold duplicate detection at ingest
   fusion.py                  # RRF, weighted-sum, normalized-cosine fusion strategies
   eval.py                    # recall@k / MRR / nDCG harness
   migration.py               # model-identity check + backfill orchestration
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>

[warning] 33-33: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/superpowers/specs/2026-05-20-semantic-search-design.md around lines 33

  • 47, The fenced code block that lists "src/vouch/embeddings/" lacks a language
    tag which triggers markdownlint MD040; update the opening fence from ``` to
"src/vouch/embeddings/" so the linter recognizes it and CI noise is avoided.
Locate the triple-backtick block that begins immediately before the
"src/vouch/embeddings/" listing and add the language identifier to the opening
fence.


## 4. Storage layout (extends `.vouch/state.db`)

```sql
-- per-artifact ANN tables (sqlite-vec vec0 virtual tables)
CREATE VIRTUAL TABLE claim_vecs USING vec0(embedding float[768]);
CREATE VIRTUAL TABLE page_vecs USING vec0(embedding float[768]);
CREATE VIRTUAL TABLE source_vecs USING vec0(embedding float[768]);
CREATE VIRTUAL TABLE entity_vecs USING vec0(embedding float[768]);
CREATE VIRTUAL TABLE relation_vecs USING vec0(embedding float[768]);
CREATE VIRTUAL TABLE evidence_vecs USING vec0(embedding float[768]);

-- mapping vec0 rowid <-> artifact id (+ content hash for skip-if-unchanged)
CREATE TABLE embedding_index (
kind TEXT NOT NULL,
id TEXT NOT NULL,
rowid INTEGER NOT NULL,
content_hash TEXT NOT NULL,
model TEXT NOT NULL,
model_version TEXT NOT NULL,
dim INTEGER NOT NULL,
created_at TEXT NOT NULL,
PRIMARY KEY (kind, id)
);

-- model identity for mismatch detection
-- stored as rows in existing index_meta table:
-- ('embedding_model', 'sentence-transformers/all-mpnet-base-v2')
-- ('embedding_dim', '768')
-- ('embedding_lib', 'sentence-transformers')
-- ('embedding_lib_version', '<resolved>')

-- query embedding cache (content-addressed)
CREATE TABLE query_embedding_cache (
query_hash TEXT PRIMARY KEY,
vec BLOB NOT NULL,
hit_count INTEGER NOT NULL DEFAULT 1,
last_used_at TEXT NOT NULL
);

-- duplicate detection ledger (audit trail for ingest-time near-dupes)
CREATE TABLE embedding_dupes (
kind TEXT NOT NULL,
id TEXT NOT NULL,
near_id TEXT NOT NULL,
cosine REAL NOT NULL,
detected_at TEXT NOT NULL
);
```

`state.db` is derived. Losing it is non-fatal — `vouch reindex --embeddings --backfill` regenerates from disk. Same invariant as FTS5 today.

## 5. Touched modules and estimated LOC

| Module | Change | Est. LOC |
|---|---|---|
| `embeddings/base.py` | `Embedder` ABC, `register`, content hashing, batched encode | 120 |
| `embeddings/st_mpnet.py` | Default adapter | 80 |
| `embeddings/st_minilm.py` | Alternative adapter | 60 |
| `embeddings/fastembed_bge.py` | Alternative adapter (no-torch) | 80 |
| `embeddings/cache.py` | LRU query cache + persistent backing | 100 |
| `embeddings/rerank.py` | Cross-encoder reranker | 110 |
| `embeddings/hyde.py` | Template HyDE + optional LLM hook | 80 |
| `embeddings/dedup.py` | Ingest-time duplicate detection | 90 |
| `embeddings/fusion.py` | RRF, weighted, normalized-cosine fusion | 100 |
| `embeddings/eval.py` | recall@k / MRR / nDCG eval runner | 150 |
| `embeddings/migration.py` | Model-identity check + backfill orchestration | 110 |
| `index_db.py` | Vector tables, search fns, hybrid path, schema migration | 250 |
| `storage.py` | Hook all 6 `put_*` + `update_*` paths | 80 |
| `server.py` | Extended `kb_search` + 5 new MCP tools | 180 |
| `jsonl_server.py` | Parity for all new MCP tools | 160 |
| `cli.py` | Flags on existing commands + new commands | 250 |
| `context.py` | Semantic-default + `--explain` breakdown | 80 |
| `lifecycle.py` | Re-embed on `update_claim` / `update_page` | 60 |
| `pyproject.toml` | Three optional-deps extras | 15 |
| **Tests** | 9 files, ~900 LOC | 900 |
| **Total** | | **~3055 lines** |

## 6. Default behavior

| Call | Behavior |
|---|---|
| `kb.search(query)` | Embedding primary; FTS5 only if embedding returns zero hits. |
| `kb.search(query, backend="hybrid")` | RRF fusion of embedding + FTS5 result lists. |
| `kb.search(query, backend="hybrid", rerank=True)` | Hybrid + cross-encoder rerank of top-50. |
| `kb.search(query, hyde=True)` | Query expanded via HyDE template before encoding. |
| `kb.search(query, backend="fts5")` | Force lexical-only (precision mode). |
| `build_context_pack(task)` | Semantic-default; `--explain` returns per-result score breakdown. |

Every flag exposed identically across CLI / MCP / JSONL.

## 7. Write path

For every `put_*` and `update_*` in `KBStore`:

1. Compute `content_hash = sha256(text)`. If `embedding_index` has the same `(kind, id, content_hash)`, **skip encode** (idempotent re-ingest is free).
2. Otherwise: `Embedder.encode(text)` synchronously; persist to `<kind>_vecs` and `embedding_index`.
3. Run `dedup.check()` — cosine vs top-1 nearest neighbor. If ≥ `dedup_threshold` (default 0.95), log to `embedding_dupes` and emit `embedding.duplicate_detected` audit event. Ingest still proceeds.
4. Invalidate any `query_embedding_cache` entries known to reference this artifact (cache invalidation: drop entries by `last_used_at` age cutoff; full LRU eviction on cap).

## 8. Migration / backfill

On `KBStore.__init__`:

- Read `index_meta.embedding_model`. If absent (legacy KB) or mismatched with the current adapter:
- Emit `embedding.model_mismatch` audit event.
- `kb.search` still works via FTS5 fallback path; embedding results carry `embedding_stale: true` until reindex.
- Maintainer-visible warning surfaces in `vouch doctor` and `vouch embeddings stats`.
- `vouch reindex --embeddings --backfill` does a single-pass re-encode of all artifacts under the current adapter, updates `index_meta`, drops `embedding_stale` tagging.

## 9. CLI surface

```bash
# search
vouch search "query"
vouch search "query" --semantic --top-k 20 --min-score 0.4
vouch search "query" --hybrid --rerank --hyde --explain
vouch search "query" --backend fts5 # force lexical

# reindex
vouch reindex --embeddings [--model NAME] [--backfill] [--force]

# eval
vouch eval embedding --queries eval/queries.jsonl --metric recall@10,mrr,ndcg

# dedup
vouch dedup --threshold 0.95 --dry-run

# stats
vouch embeddings stats # model identity, vector counts, cache hit rate
```

## 10. MCP / JSONL parity

Every new flag/command exposed identically as a tool:

- `kb.search` gains: `backend`, `top_k`, `min_score`, `rerank`, `hyde`, `explain`
- New tools: `kb.reindex_embeddings`, `kb.eval_embeddings`, `kb.dedup_scan`, `kb.embeddings_stats`

JSONL handlers mirror the MCP tools 1:1 (same method names, same param names).

## 11. Dependencies

```toml
# pyproject.toml
[project.optional-dependencies]
embeddings = ["sentence-transformers>=2.7", "numpy>=1.26", "sqlite-vec>=0.1"]
embeddings-fast = ["fastembed>=0.3", "onnxruntime>=1.18", "sqlite-vec>=0.1"]
rerank = ["sentence-transformers>=2.7"] # shared base with embeddings
```

Base install stays lean. CI matrix exercises all three install modes (none / `[embeddings]` / `[embeddings-fast]`).

## 12. Default values (tunable via config)

| Knob | Default | Source |
|---|---|---|
| Model cache location | `~/.cache/vouch/models/` | env `VOUCH_MODEL_CACHE` override |
| Embedding dimension | 768 (matches mpnet) | derived from model |
| Dedup threshold | 0.95 cosine | `config.yaml` |
| Rerank top-K | 50 | CLI flag |
| HyDE template | template-only (no LLM) | CLI flag enables LLM hook |
| Query cache size | 1024 LRU entries | `config.yaml` |
| Backend ordering | `["embedding", "fts5"]` | `config.yaml` |

## 13. Test plan (~900 LOC across 9 files)

| File | Covers |
|---|---|
| `tests/test_embeddings_core.py` | Embedder ABC, registry, content-hash skip, batched encode, lazy load |
| `tests/test_embeddings_storage.py` | vec0 + sqlite-vec round trip + NumPy fallback parity |
| `tests/test_embeddings_search.py` | Semantic primary, FTS5 fallback, lexical-disjoint regression |
| `tests/test_embeddings_fusion.py` | RRF, weighted, normalized fusion strategies correctness |
| `tests/test_embeddings_rerank.py` | Cross-encoder rerank changes top-K order on a known pair |
| `tests/test_embeddings_hyde.py` | HyDE expansion improves recall on terse queries |
| `tests/test_embeddings_dedup.py` | Threshold ledger + audit event |
| `tests/test_embeddings_migration.py` | Model-version mismatch + backfill flow |
| `tests/test_embeddings_eval.py` | recall@k / MRR / nDCG correctness on synthetic ground truth |
| `tests/test_embeddings_cli.py` | CLI flag routing |

## 14. Acceptance criteria

- [ ] `vouch search --semantic "how do we authenticate users"` returns a claim that says *"login flow uses session cookies signed by the API"* in a KB with no lexical overlap.
- [ ] `pip install vouch` (no extras) still works and uses FTS5/substring without errors.
- [ ] `pip install vouch[embeddings]` enables the full embedding stack with no other code changes required.
- [ ] `vouch reindex --embeddings --backfill` is idempotent; running twice yields the same `embedding_index` row count.
- [ ] All 9 test files pass; `ruff` and `mypy` clean.
- [ ] CI matrix runs `(base, [embeddings], [embeddings-fast])` and all three modes pass.
- [ ] Model-identity mismatch (delete `state.db`, change `embedding_model` in `index_meta`) produces a clear warning, NOT a crash.

## 15. Out of scope (genuinely orthogonal — separate spec)

- Multi-language / cross-lingual models (deferred to a follow-up; current scope is English).
- Distributed embedding compute (everything stays in-process).
- Online learning / fine-tuning hooks (consumers can bring their own adapter via the registry).
- Replacement of FTS5 (FTS5 stays as fallback / precision-mode forever).

## 16. Rollout order (suggested for implementation plan)

The `writing-plans` step will turn this into concrete tasks. Suggested phases:

1. **Foundation** — `embeddings/base.py`, default adapter, `pyproject.toml` extras
2. **Storage** — `index_db.py` vec tables, NumPy fallback, schema migration
3. **Write path** — `storage.py` hook on `put_claim` (first artifact type); extend to remaining 5
4. **Read path** — `index_db.search_embedding`, integrate into `kb.search` (MCP + JSONL + CLI)
5. **Fusion + hybrid** — `embeddings/fusion.py`, hybrid backend
6. **Rerank, HyDE, dedup, eval, migration** — independent capability slices
7. **Context pack + explain** — `context.py` updates
8. **CLI + JSONL parity sweep** — `vouch search/reindex/eval/dedup/embeddings`

Each phase ends with passing tests for its slice — no big-bang merge.
41 changes: 40 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,20 @@ dev = [
"mypy>=1.10",
"types-pyyaml",
]
embeddings = [
"sentence-transformers>=2.7,<4",
"numpy>=1.26,<3",
"sqlite-vec>=0.1,<1",
]
embeddings-fast = [
"fastembed>=0.3,<1",
"onnxruntime>=1.18,<2",
"numpy>=1.26,<3",
"sqlite-vec>=0.1,<1",
]
rerank = [
"sentence-transformers>=2.7,<4",
]

[project.scripts]
vouch = "vouch.cli:cli"
Expand All @@ -57,4 +71,29 @@ select = ["E", "F", "I", "B", "UP", "SIM", "RUF"]

[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "-q"
addopts = "-q -m 'not integration'"
markers = [
"integration: tests that load the real embedding model (slow, network on first run)",
]

# numpy is an optional runtime dependency (pulled in by the [embeddings] or
# [embeddings-fast] extras); the base CI install only has [dev], so mypy can't
# resolve `import numpy as np` in the embedding-stack modules. Silence the
# missing-stub errors -- the embeddings code paths are only reached when the
# extras are installed, and the test suite for those paths runs in a separate
# job with the extras present.
[[tool.mypy.overrides]]
module = ["numpy", "numpy.*"]
ignore_missing_imports = true

[[tool.mypy.overrides]]
module = ["sqlite_vec", "sqlite_vec.*"]
ignore_missing_imports = true

[[tool.mypy.overrides]]
module = ["sentence_transformers", "sentence_transformers.*"]
ignore_missing_imports = true

[[tool.mypy.overrides]]
module = ["fastembed", "fastembed.*"]
ignore_missing_imports = true
7 changes: 7 additions & 0 deletions src/vouch/bundle.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,13 @@ class ImportCheckResult:

def _validate_content(path: str, data: bytes, issues: list[str]) -> None:
subdir = path.split("/")[0]
# Source artifacts have two file kinds:
# sources/<sha>/meta.yaml -- the Source pydantic model (validate)
# sources/<sha>/content -- the raw source bytes (skip validation)
# The opaque content file isn't a pydantic model, so model_validate
# on raw bytes raises spuriously.
if subdir == "sources" and not path.endswith("/meta.yaml"):
return
validator = VALIDATORS.get(subdir)
if validator is None:
return
Expand Down
39 changes: 39 additions & 0 deletions src/vouch/embeddings/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
"""Embedding-based semantic retrieval for vouch.

Pluggable model adapters via the `register` / `get_embedder` registry in
`base`. Default adapter is `st_mpnet` (sentence-transformers
all-mpnet-base-v2) when installed via `pip install vouch[embeddings]`.

The base install of vouch has no hard dependency on this package -- the
modules are only imported when an embedding code path executes.
"""

import contextlib

from .base import (
DEFAULT_MODEL_NAME,
Embedder,
content_hash,
get_embedder,
register,
)

# Auto-register the default adapter if sentence-transformers is installed.
# Each adapter import is best-effort -- failure means that adapter's optional
# dependency isn't installed, which is fine (a different adapter may be).
with contextlib.suppress(ImportError):
from . import st_mpnet # noqa: F401

with contextlib.suppress(ImportError):
from . import st_minilm # noqa: F401

with contextlib.suppress(ImportError):
from . import fastembed_bge # noqa: F401

__all__ = [
"DEFAULT_MODEL_NAME",
"Embedder",
"content_hash",
"get_embedder",
"register",
]
Loading
Loading