Skip to content

feat: add LocalEmbedder with GPU support, configurable batch size, and CodeRankEmbed opt-in#22

Open
mareurs wants to merge 23 commits intococoindex-io:mainfrom
mareurs:local-embedder-coderank
Open

feat: add LocalEmbedder with GPU support, configurable batch size, and CodeRankEmbed opt-in#22
mareurs wants to merge 23 commits intococoindex-io:mainfrom
mareurs:local-embedder-coderank

Conversation

@mareurs
Copy link

@mareurs mareurs commented Feb 25, 2026

Overview

This PR adds a LocalEmbedder class with proper CocoIndex protocol support, GPU auto-detection, asymmetric query/document retrieval, a configurable COCOINDEX_CODE_BATCH_SIZE env var, and documents nomic-ai/CodeRankEmbed as a recommended opt-in for users with a GPU.

The default model remains sbert/sentence-transformers/all-MiniLM-L6-v2no breaking changes for existing users.

Changes

LocalEmbedder class (src/cocoindex_code/embedder.py)

  • Implements CocoIndex's VectorSchemaProvider protocol
  • Lazy model loading with thread-safe double-checked locking
  • Full pickle support (__getstate__/__setstate__) for CocoIndex multiprocessing
  • embed() for indexing (no prompt), embed_query() for queries with prompt_name="query" — enables asymmetric retrieval for models like nomic-ai/CodeRankEmbed
  • Explicit device selection (COCOINDEX_CODE_DEVICE) and trust_remote_code support

Configurable batch size (COCOINDEX_CODE_BATCH_SIZE)

  • New env var, default 16, documented in README config table
  • Validated at startup: non-integer, zero, or negative values raise a descriptive ValueError
  • Owned by the Config dataclass alongside all other env vars

Config refactor (src/cocoindex_code/config.py)

  • Module-level config singleton at the bottom of config.py — all modules import it directly, eliminating the duplicate Config.from_env() call that was in shared.py
  • COCOINDEX_CODE_DEVICE: auto-detects CUDA, falls back to CPU, overridable via env var
  • COCOINDEX_CODE_TRUST_REMOTE_CODE: opt-in for models with custom remote code (CodeRankEmbed is auto-whitelisted)

GPU-optimised model (opt-in, README only)

nomic-ai/CodeRankEmbed (137M params, 768-dim, 8192-token context, code-specific) is documented in a new section with a complete claude mcp add example. Users who switch models will need to re-index (noted in the README).

Chunk sizes updated (src/cocoindex_code/indexer.py)

  • CHUNK_SIZE: 1000 → 4000
  • MIN_CHUNK_SIZE: 300 → 500
  • CHUNK_OVERLAP: 200 → 400

Backwards Compatibility

  • Default model unchanged (sbert/sentence-transformers/all-MiniLM-L6-v2) — existing indexes continue to work
  • COCOINDEX_CODE_BATCH_SIZE defaults to 16, a conservative value safe for all model sizes
  • No changes to MCP tool interface or public API

Testing

32 unit tests covering: device detection, trust_remote_code config, batch size validation, LocalEmbedder init/pickle/memo-key, and embed_query prompt forwarding. All pre-commit checks pass (ruff, mypy strict, pytest).

Generated with Claude Code

mareurs and others added 23 commits February 25, 2026 00:05
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements a custom LocalEmbedder class wrapping sentence_transformers.SentenceTransformer
with explicit device= and trust_remote_code= args, required for Jina GPU models that
the built-in SentenceTransformerEmbedder does not support. Includes thread-safe lazy
loading, pickle-safe __getstate__/__setstate__, and CocoIndex memo key support.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace `import typing` + `typing.TYPE_CHECKING` with direct `from typing import TYPE_CHECKING`
- Add `# type: ignore[assignment]` on model.encode() call with a per-file mypy override to suppress warn_unused_ignores (SentenceTransformer stubs not available)
- Add two missing TestLocalEmbedderMemoKey tests: trust_remote_code and normalize_embeddings variants

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace jina-embeddings-v2-base-code (broken with transformers 5.x) with
nomic-ai/CodeRankEmbed: same 137M params and 8192-token context, but
state-of-the-art code retrieval (outperforms jina by ~10 MRR points) and
fully compatible with transformers 5.x via its own custom NomicBERT code.

Changes:
- Switch default model to sbert/nomic-ai/CodeRankEmbed
- Add query_prompt_name + embed_query() to LocalEmbedder for asymmetric
  retrieval (CodeRankEmbed uses prompt_name="query" for queries, no prompt
  for indexed code chunks)
- Auto-enable trust_remote_code for known-compatible models (CodeRankEmbed)
- Use embed_query() in query_codebase() instead of embed()
- Reduce max_batch_size 64→16 (prevents OOM with 8192-token attention)
- Add einops dependency (required by CodeRankEmbed custom code)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- embed_query: use explicit prompt_name= kwarg instead of **kwargs so
  mypy can type-check against SentenceTransformer.encode overloads
- query_codebase: remove stale type: ignore (mypy narrows union type
  via hasattr, so union-attr error doesn't exist in that branch)
- test: modernize isinstance tuple to str | bool (ruff UP038)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- shared.py: log _trust (effective value) instead of config.trust_remote_code
  so the log correctly shows True for CodeRankEmbed regardless of env var
- tests: add memo key test for query_prompt_name dimension
- tests: add embed_query tests asserting prompt_name is forwarded/omitted
  correctly (regression guard on asymmetric retrieval behaviour)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ible defaults

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-L6-v2

- Reverts _DEFAULT_MODEL from CodeRankEmbed back to all-MiniLM-L6-v2
- Adds batch_size: int field to Config dataclass (default 16, env var COCOINDEX_CODE_BATCH_SIZE)
- Adds module-level config singleton for import by shared.py and embedder.py
- TDD: tests written first (3 failing), then implementation made them pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…in shared.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…DE_BATCH_SIZE)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…reexport

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… as GPU opt-in

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant