feat: add LocalEmbedder with GPU support, configurable batch size, and CodeRankEmbed opt-in by mareurs · Pull Request #22 · cocoindex-io/cocoindex-code

mareurs · 2026-02-25T09:31:27Z

Overview

This PR adds a LocalEmbedder class with proper CocoIndex protocol support, GPU auto-detection, asymmetric query/document retrieval, a configurable COCOINDEX_CODE_BATCH_SIZE env var, and documents nomic-ai/CodeRankEmbed as a recommended opt-in for users with a GPU.

The default model remains sbert/sentence-transformers/all-MiniLM-L6-v2 — no breaking changes for existing users.

Changes

`LocalEmbedder` class (`src/cocoindex_code/embedder.py`)

Implements CocoIndex's VectorSchemaProvider protocol
Lazy model loading with thread-safe double-checked locking
Full pickle support (__getstate__/__setstate__) for CocoIndex multiprocessing
embed() for indexing (no prompt), embed_query() for queries with prompt_name="query" — enables asymmetric retrieval for models like nomic-ai/CodeRankEmbed
Explicit device selection (COCOINDEX_CODE_DEVICE) and trust_remote_code support

Configurable batch size (`COCOINDEX_CODE_BATCH_SIZE`)

New env var, default 16, documented in README config table
Validated at startup: non-integer, zero, or negative values raise a descriptive ValueError
Owned by the Config dataclass alongside all other env vars

`Config` refactor (`src/cocoindex_code/config.py`)

Module-level config singleton at the bottom of config.py — all modules import it directly, eliminating the duplicate Config.from_env() call that was in shared.py
COCOINDEX_CODE_DEVICE: auto-detects CUDA, falls back to CPU, overridable via env var
COCOINDEX_CODE_TRUST_REMOTE_CODE: opt-in for models with custom remote code (CodeRankEmbed is auto-whitelisted)

GPU-optimised model (opt-in, README only)

nomic-ai/CodeRankEmbed (137M params, 768-dim, 8192-token context, code-specific) is documented in a new section with a complete claude mcp add example. Users who switch models will need to re-index (noted in the README).

Chunk sizes updated (`src/cocoindex_code/indexer.py`)

CHUNK_SIZE: 1000 → 4000
MIN_CHUNK_SIZE: 300 → 500
CHUNK_OVERLAP: 200 → 400

Backwards Compatibility

Default model unchanged (sbert/sentence-transformers/all-MiniLM-L6-v2) — existing indexes continue to work
COCOINDEX_CODE_BATCH_SIZE defaults to 16, a conservative value safe for all model sizes
No changes to MCP tool interface or public API

Testing

32 unit tests covering: device detection, trust_remote_code config, batch size validation, LocalEmbedder init/pickle/memo-key, and embed_query prompt forwarding. All pre-commit checks pass (ruff, mypy strict, pytest).

Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…lation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Implements a custom LocalEmbedder class wrapping sentence_transformers.SentenceTransformer with explicit device= and trust_remote_code= args, required for Jina GPU models that the built-in SentenceTransformerEmbedder does not support. Includes thread-safe lazy loading, pickle-safe __getstate__/__setstate__, and CocoIndex memo key support. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Replace `import typing` + `typing.TYPE_CHECKING` with direct `from typing import TYPE_CHECKING` - Add `# type: ignore[assignment]` on model.encode() call with a per-file mypy override to suppress warn_unused_ignores (SentenceTransformer stubs not available) - Add two missing TestLocalEmbedderMemoKey tests: trust_remote_code and normalize_embeddings variants Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…N806

…ers 5.x)

Replace jina-embeddings-v2-base-code (broken with transformers 5.x) with nomic-ai/CodeRankEmbed: same 137M params and 8192-token context, but state-of-the-art code retrieval (outperforms jina by ~10 MRR points) and fully compatible with transformers 5.x via its own custom NomicBERT code. Changes: - Switch default model to sbert/nomic-ai/CodeRankEmbed - Add query_prompt_name + embed_query() to LocalEmbedder for asymmetric retrieval (CodeRankEmbed uses prompt_name="query" for queries, no prompt for indexed code chunks) - Auto-enable trust_remote_code for known-compatible models (CodeRankEmbed) - Use embed_query() in query_codebase() instead of embed() - Reduce max_batch_size 64→16 (prevents OOM with 8192-token attention) - Add einops dependency (required by CodeRankEmbed custom code) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- embed_query: use explicit prompt_name= kwarg instead of **kwargs so mypy can type-check against SentenceTransformer.encode overloads - query_codebase: remove stale type: ignore (mypy narrows union type via hasattr, so union-attr error doesn't exist in that branch) - test: modernize isinstance tuple to str | bool (ruff UP038) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- shared.py: log _trust (effective value) instead of config.trust_remote_code so the log correctly shows True for CodeRankEmbed regardless of env var - tests: add memo key test for query_prompt_name dimension - tests: add embed_query tests asserting prompt_name is forwarded/omitted correctly (regression guard on asymmetric retrieval behaviour) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ible defaults Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…-L6-v2 - Reverts _DEFAULT_MODEL from CodeRankEmbed back to all-MiniLM-L6-v2 - Adds batch_size: int field to Config dataclass (default 16, env var COCOINDEX_CODE_BATCH_SIZE) - Adds module-level config singleton for import by shared.py and embedder.py - TDD: tests written first (3 failing), then implementation made them pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…in shared.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…DE_BATCH_SIZE) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…reexport Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… as GPU opt-in Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mareurs and others added 23 commits February 25, 2026 00:05

feat: add device auto-detection and trust_remote_code to Config

6459fc6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: restore .cocoindex_code discovery priority and clean up test iso…

f74bce8

…lation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: use LocalEmbedder in shared.py with GPU device logging

f8da74b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: increase chunk sizes to use jina-embeddings 8192-token context

a979517

fix: use module-level constant for patch target to satisfy ruff E501/…

f0fac4e

…N806

fix: remove jinaai trust_remote_code auto-detection (breaks transform…

0892c96

…ers 5.x)

style: add blank lines after module docstrings (ruff-format)

b67b861

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add design doc for configurable batch size and backwards-compat…

3751a50

…ible defaults Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add implementation plan for configurable batch size

8fd7bbd

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor: import config singleton from config.py instead of creating …

456e933

…in shared.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: use config.batch_size in LocalEmbedder decorators (COCOINDEX_CO…

d3d2afc

…DE_BATCH_SIZE) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: import config directly from .config to satisfy mypy no-implicit-…

a665654

…reexport Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: move config import before TYPE_CHECKING block in embedder.py

c2e6aa9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: add COCOINDEX_CODE_BATCH_SIZE to config table and CodeRankEmbed…

f2c8d24

… as GPU opt-in Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: validate COCOINDEX_CODE_BATCH_SIZE is a positive integer

6dd5247

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: remove internal planning docs from PR

e27bd90

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style: sort imports (ruff)

1c0e6fd

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add LocalEmbedder with GPU support, configurable batch size, and CodeRankEmbed opt-in#22

feat: add LocalEmbedder with GPU support, configurable batch size, and CodeRankEmbed opt-in#22
mareurs wants to merge 23 commits intococoindex-io:mainfrom
mareurs:local-embedder-coderank

mareurs commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mareurs commented Feb 25, 2026

Overview

Changes

LocalEmbedder class (src/cocoindex_code/embedder.py)

Configurable batch size (COCOINDEX_CODE_BATCH_SIZE)

Config refactor (src/cocoindex_code/config.py)

GPU-optimised model (opt-in, README only)

Chunk sizes updated (src/cocoindex_code/indexer.py)

Backwards Compatibility

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`LocalEmbedder` class (`src/cocoindex_code/embedder.py`)

Configurable batch size (`COCOINDEX_CODE_BATCH_SIZE`)

`Config` refactor (`src/cocoindex_code/config.py`)

Chunk sizes updated (`src/cocoindex_code/indexer.py`)