-
Notifications
You must be signed in to change notification settings - Fork 34
Bubble bullshit #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bubble bullshit #205
Conversation
Invoke garbage collection after processing each file in index_repo to prevent memory accumulation. Also update FakeClient in tests to support on_disk_payload parameter in create_collection.
Introduces a new embedding_service with Dockerfile, FastAPI server, and requirements for serving ONNX-based embeddings as a shared service. Updates docker-compose.yml to include the embedding service and its cache volume. Modifies reset.py to always include the embedding service in container management and updates help text. Enhances qdrant.py to support both local and remote embedding providers, with concurrency control and HTTP fallback. Adds an empty __init__.py for embedding_service package initialization.
🤖 Augment PR SummarySummary: Introduces a shared embedding microservice and an optional remote-embedding mode to reduce per-indexer ONNX memory usage. Changes:
Technical Notes: Remote mode uses 🤖 Was this summary useful? React with 👍 or 👎 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated docker-compose.yml to support 2 replicas of the embedding service with increased memory limits and improved healthcheck. Added remote embedding provider detection and stub in embedder.py to avoid loading ONNX models when using a remote service, saving memory. Requirements updated to include requests. Concurrency environment variable parsing hardened in main.py and qdrant.py.
Adds detailed ONNX and embedding model configuration options to .env.example and docker-compose.yml, including threading and batching controls. Embedding service now supports ONNX CPU optimizations and improved batching for efficiency. The reset CLI command starts multiple embedding service replicas and waits for readiness, improving reliability for indexing workflows.
Introduces the _wait_for_embedding function to check if the embedding service is available before proceeding. This improves reliability by ensuring dependent services are ready before use.
Updated Dockerfile to allow embedding model selection via build argument, defaulting to nomic-ai/nomic-embed-text-v1.5 for improved performance. Modified reset.py to support overriding the default tokenizer URL using the TOKENIZER_URL environment variable.
|
auggie review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scripts/embedder.py
Outdated
| _ARCTIC_V2_REGISTER_LOCK = threading.Lock() | ||
|
|
||
| # Remote embedding provider detection | ||
| _EMBEDDING_PROVIDER = os.environ.get("EMBEDDING_PROVIDER", "local").strip().lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_EMBEDDING_PROVIDER is read once at import time, so changing EMBEDDING_PROVIDER later (e.g., via tests/CLI env loading) won’t switch between local/remote and may unexpectedly load ONNX in-process. Consider ensuring get_embedding_model() re-evaluates the env/config so remote mode can’t become stale.
🤖 Was this useful? React with 👍 or 👎
scripts/ingest/qdrant.py
Outdated
| _EMBED_SEMAPHORE = threading.Semaphore(_EMBED_MAX_CONCURRENT) | ||
|
|
||
| # Remote embedding service configuration | ||
| _EMBEDDING_PROVIDER = os.environ.get("EMBEDDING_PROVIDER", "local").strip().lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_EMBEDDING_PROVIDER/_EMBEDDING_SERVICE_URL(_S) are captured at module import, so runtime env changes (common in tests) won’t affect embed_batch() routing or target URL. This can lead to embedding calls going to the wrong provider/endpoint despite updated env vars.
🤖 Was this useful? React with 👍 or 👎
| # Default: nomic (faster + better quality than BGE) | ||
| # Override at build time: --build-arg EMBEDDING_MODEL=BAAI/bge-base-en-v1.5 | ||
| ARG EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5 | ||
| RUN python -c "from fastembed import TextEmbedding; TextEmbedding('${EMBEDDING_MODEL}')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build-time TextEmbedding('${EMBEDDING_MODEL}') pre-download will use the default cache location, but runtime config points HF_HOME/FASTEMBED_CACHE_PATH at /tmp/..., so this layer may not actually speed startup and can just increase image size. Worth verifying cache paths are consistent between build and runtime.
🤖 Was this useful? React with 👍 or 👎
docker-compose.yml
Outdated
| - MKL_NUM_THREADS=${ONNX_THREADS:-4} | ||
| # Cache paths | ||
| - HF_HOME=/tmp/huggingface | ||
| - FASTEMBED_CACHE_PATH=/tmp/fastembed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the embedding service, FASTEMBED_CACHE_PATH is set to /tmp/fastembed but only /tmp/huggingface is volume-backed, so FastEmbed/ONNX downloads may not persist and could be re-fetched after container recreation. If persistence is intended, the cache path and mounted volume likely need to align.
🤖 Was this useful? React with 👍 or 👎
embedding_service/main.py
Outdated
|
|
||
| # ONNX runtime optimizations | ||
| ONNX_THREADS = int(os.environ.get("ONNX_THREADS", "0") or 0) # 0 = auto (1 per physical core) | ||
| ONNX_DISABLE_SPINNING = os.environ.get("ONNX_DISABLE_SPINNING", "0").strip().lower() in {"1", "true", "yes"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align embedding cache paths in Docker and compose files, default to quantized model for lower memory, and add aggressive GC and memory reporting in embedding_service. Refactor embedding provider and service URL logic for runtime flexibility and improved testability. Add remote embedding support and load balancing in hybrid/embed.py and ingest/qdrant.py. Update Neo4j status check to use HTTP API. Ensure remote embedding dimension probing in mcp_memory_server. These changes improve memory efficiency, remote deployment, and testability.
No description provided.