feat: add LocalEmbedder with GPU support, configurable batch size, and CodeRankEmbed opt-in#22
Open
mareurs wants to merge 23 commits intococoindex-io:mainfrom
Open
feat: add LocalEmbedder with GPU support, configurable batch size, and CodeRankEmbed opt-in#22mareurs wants to merge 23 commits intococoindex-io:mainfrom
mareurs wants to merge 23 commits intococoindex-io:mainfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements a custom LocalEmbedder class wrapping sentence_transformers.SentenceTransformer with explicit device= and trust_remote_code= args, required for Jina GPU models that the built-in SentenceTransformerEmbedder does not support. Includes thread-safe lazy loading, pickle-safe __getstate__/__setstate__, and CocoIndex memo key support. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace `import typing` + `typing.TYPE_CHECKING` with direct `from typing import TYPE_CHECKING` - Add `# type: ignore[assignment]` on model.encode() call with a per-file mypy override to suppress warn_unused_ignores (SentenceTransformer stubs not available) - Add two missing TestLocalEmbedderMemoKey tests: trust_remote_code and normalize_embeddings variants Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace jina-embeddings-v2-base-code (broken with transformers 5.x) with nomic-ai/CodeRankEmbed: same 137M params and 8192-token context, but state-of-the-art code retrieval (outperforms jina by ~10 MRR points) and fully compatible with transformers 5.x via its own custom NomicBERT code. Changes: - Switch default model to sbert/nomic-ai/CodeRankEmbed - Add query_prompt_name + embed_query() to LocalEmbedder for asymmetric retrieval (CodeRankEmbed uses prompt_name="query" for queries, no prompt for indexed code chunks) - Auto-enable trust_remote_code for known-compatible models (CodeRankEmbed) - Use embed_query() in query_codebase() instead of embed() - Reduce max_batch_size 64→16 (prevents OOM with 8192-token attention) - Add einops dependency (required by CodeRankEmbed custom code) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- embed_query: use explicit prompt_name= kwarg instead of **kwargs so mypy can type-check against SentenceTransformer.encode overloads - query_codebase: remove stale type: ignore (mypy narrows union type via hasattr, so union-attr error doesn't exist in that branch) - test: modernize isinstance tuple to str | bool (ruff UP038) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- shared.py: log _trust (effective value) instead of config.trust_remote_code so the log correctly shows True for CodeRankEmbed regardless of env var - tests: add memo key test for query_prompt_name dimension - tests: add embed_query tests asserting prompt_name is forwarded/omitted correctly (regression guard on asymmetric retrieval behaviour) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ible defaults Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-L6-v2 - Reverts _DEFAULT_MODEL from CodeRankEmbed back to all-MiniLM-L6-v2 - Adds batch_size: int field to Config dataclass (default 16, env var COCOINDEX_CODE_BATCH_SIZE) - Adds module-level config singleton for import by shared.py and embedder.py - TDD: tests written first (3 failing), then implementation made them pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…in shared.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…DE_BATCH_SIZE) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…reexport Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… as GPU opt-in Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds a
LocalEmbedderclass with proper CocoIndex protocol support, GPU auto-detection, asymmetric query/document retrieval, a configurableCOCOINDEX_CODE_BATCH_SIZEenv var, and documentsnomic-ai/CodeRankEmbedas a recommended opt-in for users with a GPU.The default model remains
sbert/sentence-transformers/all-MiniLM-L6-v2— no breaking changes for existing users.Changes
LocalEmbedderclass (src/cocoindex_code/embedder.py)VectorSchemaProviderprotocol__getstate__/__setstate__) for CocoIndex multiprocessingembed()for indexing (no prompt),embed_query()for queries withprompt_name="query"— enables asymmetric retrieval for models likenomic-ai/CodeRankEmbedCOCOINDEX_CODE_DEVICE) andtrust_remote_codesupportConfigurable batch size (
COCOINDEX_CODE_BATCH_SIZE)16, documented in README config tableValueErrorConfigdataclass alongside all other env varsConfigrefactor (src/cocoindex_code/config.py)configsingleton at the bottom ofconfig.py— all modules import it directly, eliminating the duplicateConfig.from_env()call that was inshared.pyCOCOINDEX_CODE_DEVICE: auto-detects CUDA, falls back to CPU, overridable via env varCOCOINDEX_CODE_TRUST_REMOTE_CODE: opt-in for models with custom remote code (CodeRankEmbed is auto-whitelisted)GPU-optimised model (opt-in, README only)
nomic-ai/CodeRankEmbed(137M params, 768-dim, 8192-token context, code-specific) is documented in a new section with a completeclaude mcp addexample. Users who switch models will need to re-index (noted in the README).Chunk sizes updated (
src/cocoindex_code/indexer.py)CHUNK_SIZE: 1000 → 4000MIN_CHUNK_SIZE: 300 → 500CHUNK_OVERLAP: 200 → 400Backwards Compatibility
sbert/sentence-transformers/all-MiniLM-L6-v2) — existing indexes continue to workCOCOINDEX_CODE_BATCH_SIZEdefaults to16, a conservative value safe for all model sizesTesting
32 unit tests covering: device detection, trust_remote_code config, batch size validation,
LocalEmbedderinit/pickle/memo-key, andembed_queryprompt forwarding. All pre-commit checks pass (ruff, mypy strict, pytest).Generated with Claude Code