Skip to content

feat(search): add embedding-based semantic retrieval as a third backend #35

Description

@plind-junior

Motivation

kb.search (and build_context_pack, which fans out from it) currently has two backends:

  • fts5 — BM25 lexical match via src/vouch/index_db.py
  • substring — fallback scanner in KBStore.search_substring

Both are token-overlap-based. An agent searching for "how do we authenticate users?" gets nothing if the claim text reads "login flow uses session cookies signed by the API" — the concepts overlap, the tokens don't. This is the most common retrieval miss in agent workflows that compose claims into context packs.

Adding an embedding backend would close that gap while leaving the existing FTS5/substring chain intact as a precision-mode complement.

Proposed approach

Introduce a third backend slot, parallel to FTS5/substring:

  1. src/vouch/embeddings.py (new) — Lazy-loaded local model (sentence-transformers/all-MiniLM-L6-v2 or fastembed's BAAI/bge-small-en-v1.5). One encode(texts: list[str]) -> np.ndarray entrypoint, batched.
  2. src/vouch/index_db.py — Add an embeddings(kind, id, vec BLOB, dim INT) table. Two implementation options:
    • MVP: pure NumPy cosine over rows loaded into memory. Simple, no extra deps, fine for KBs under ~10k claims.
    • Scale-up: sqlite-vec extension for ANN. Defer until someone hits the NumPy ceiling.
  3. Indexing hooksKBStore.put_claim / put_source / put_page compute and store the embedding on write. Backfill via vouch index --rebuild.
  4. Search dispatchkb.search accepts backend: "fts5" | "substring" | "embedding" | "hybrid". Hybrid = reciprocal rank fusion of FTS5 + embedding results. Default stays fts5 so existing callers don't shift.
  5. Optional dep — Ship under pip install vouch[embeddings] so the base install stays lean. CI matrix runs both with and without the extra.

Scope

In scope (this issue):

  • embeddings.py module
  • embeddings table + reads/writes in index_db.py
  • put_claim / put_source / put_page indexing hooks
  • backend="embedding" and backend="hybrid" paths in MCP + JSONL search handlers
  • vouch index --rebuild regenerates embeddings
  • Regression test: a claim that's semantically related but lexically disjoint from a query is retrievable
  • Optional dep wiring in pyproject.toml

Out of scope (follow-ups):

  • Multi-model support / pluggable backends — single hardcoded model for now
  • ANN index (sqlite-vec, FAISS, etc.) — NumPy brute force is enough until proven otherwise
  • Embedding cache invalidation on claim update — start with insert-only; address when update_claim lands
  • Cross-lingual or domain-finetuned models

Open questions

  1. Default backend for build_context_pack — leave as FTS5, or default to hybrid once embeddings are available? Argument for hybrid: that's where agents live. Argument for FTS5: it's a behavior change for everyone who pulls.
  2. Model identity in state.db — should the embeddings table record the model name + version so a mismatch on next read triggers a re-index? Yes, almost certainly — saves a footgun later.
  3. Where does the model cache live?~/.cache/vouch/models/ vs .vouch/models/. The former is cross-project (good for multiple KBs), the latter keeps the KB self-contained (matches the "files are source of truth" principle).
  4. Embedding-as-citation — if two claims have ≥0.95 cosine similarity at ingest time, do we surface a "possible duplicate" warning at put_claim? Defer, but worth noting it's cheap to add later.

Acceptance criteria

  • vouch search --semantic "how do we authenticate users" returns a claim containing "login flow uses session cookies signed by the API" in a KB that has no lexical overlap between the two.
  • kb.search over MCP and JSONL both accept the new backend values and route through the shared handler.
  • pip install vouch (no extras) still works and uses FTS5/substring; pip install vouch[embeddings] enables the new backend with no other code changes required from callers.
  • vouch index --rebuild regenerates the embeddings table from disk; running twice is idempotent.
  • Regression test in tests/test_embeddings.py proves the lexical-disjoint case is now retrievable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions