Skip to content

Conversation

@m1rl0k
Copy link
Collaborator

@m1rl0k m1rl0k commented Jan 26, 2026

No description provided.

Invoke garbage collection after processing each file in index_repo to prevent memory accumulation. Also update FakeClient in tests to support on_disk_payload parameter in create_collection.
Introduces a new embedding_service with Dockerfile, FastAPI server, and requirements for serving ONNX-based embeddings as a shared service. Updates docker-compose.yml to include the embedding service and its cache volume. Modifies reset.py to always include the embedding service in container management and updates help text. Enhances qdrant.py to support both local and remote embedding providers, with concurrency control and HTTP fallback. Adds an empty __init__.py for embedding_service package initialization.
@augmentcode
Copy link

augmentcode bot commented Jan 26, 2026

🤖 Augment PR Summary

Summary: Introduces a shared embedding microservice and an optional remote-embedding mode to reduce per-indexer ONNX memory usage.

Changes:

  • Adds embedding_service/ (FastAPI + FastEmbed) exposing /embed and /health, with semaphore-based concurrency and internal batching.
  • Updates the ingest pipeline to route embed_batch() through either local ONNX or an HTTP embedding service via EMBEDDING_PROVIDER.
  • Extends Docker Compose to run/scale the embedding service and introduces an embedding_cache volume.
  • Adds Kubernetes and Helm manifests for the embedding service (Deployment/Service/HPA + probes) and wires new env vars (EMBEDDING_SERVICE_URL, INDEX_WORKERS).
  • Updates ctx reset to start the embedding service early, wait for readiness, and scale replicas.
  • Adds remote warm-start behavior and documents new embedding/ONNX tuning env vars in .env.example.

Technical Notes: Remote mode uses requests and supports client-side round-robin via EMBEDDING_SERVICE_URLS; local mode throttles ONNX inference with EMBED_MAX_CONCURRENT.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 4 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Updated docker-compose.yml to support 2 replicas of the embedding service with increased memory limits and improved healthcheck. Added remote embedding provider detection and stub in embedder.py to avoid loading ONNX models when using a remote service, saving memory. Requirements updated to include requests. Concurrency environment variable parsing hardened in main.py and qdrant.py.
Adds detailed ONNX and embedding model configuration options to .env.example and docker-compose.yml, including threading and batching controls. Embedding service now supports ONNX CPU optimizations and improved batching for efficiency. The reset CLI command starts multiple embedding service replicas and waits for readiness, improving reliability for indexing workflows.
Introduces the _wait_for_embedding function to check if the embedding service is available before proceeding. This improves reliability by ensuring dependent services are ready before use.
Updated Dockerfile to allow embedding model selection via build argument, defaulting to nomic-ai/nomic-embed-text-v1.5 for improved performance. Modified reset.py to support overriding the default tokenizer URL using the TOKENIZER_URL environment variable.
@m1rl0k
Copy link
Collaborator Author

m1rl0k commented Jan 26, 2026

auggie review

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 5 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

_ARCTIC_V2_REGISTER_LOCK = threading.Lock()

# Remote embedding provider detection
_EMBEDDING_PROVIDER = os.environ.get("EMBEDDING_PROVIDER", "local").strip().lower()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_EMBEDDING_PROVIDER is read once at import time, so changing EMBEDDING_PROVIDER later (e.g., via tests/CLI env loading) won’t switch between local/remote and may unexpectedly load ONNX in-process. Consider ensuring get_embedding_model() re-evaluates the env/config so remote mode can’t become stale.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

_EMBED_SEMAPHORE = threading.Semaphore(_EMBED_MAX_CONCURRENT)

# Remote embedding service configuration
_EMBEDDING_PROVIDER = os.environ.get("EMBEDDING_PROVIDER", "local").strip().lower()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_EMBEDDING_PROVIDER/_EMBEDDING_SERVICE_URL(_S) are captured at module import, so runtime env changes (common in tests) won’t affect embed_batch() routing or target URL. This can lead to embedding calls going to the wrong provider/endpoint despite updated env vars.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

# Default: nomic (faster + better quality than BGE)
# Override at build time: --build-arg EMBEDDING_MODEL=BAAI/bge-base-en-v1.5
ARG EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5
RUN python -c "from fastembed import TextEmbedding; TextEmbedding('${EMBEDDING_MODEL}')"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build-time TextEmbedding('${EMBEDDING_MODEL}') pre-download will use the default cache location, but runtime config points HF_HOME/FASTEMBED_CACHE_PATH at /tmp/..., so this layer may not actually speed startup and can just increase image size. Worth verifying cache paths are consistent between build and runtime.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

- MKL_NUM_THREADS=${ONNX_THREADS:-4}
# Cache paths
- HF_HOME=/tmp/huggingface
- FASTEMBED_CACHE_PATH=/tmp/fastembed
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the embedding service, FASTEMBED_CACHE_PATH is set to /tmp/fastembed but only /tmp/huggingface is volume-backed, so FastEmbed/ONNX downloads may not persist and could be re-fetched after container recreation. If persistence is intended, the cache path and mounted volume likely need to align.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎


# ONNX runtime optimizations
ONNX_THREADS = int(os.environ.get("ONNX_THREADS", "0") or 0) # 0 = auto (1 per physical core)
ONNX_DISABLE_SPINNING = os.environ.get("ONNX_DISABLE_SPINNING", "0").strip().lower() in {"1", "true", "yes"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ONNX_DISABLE_SPINNING is parsed and reported in /health, but it doesn’t appear to be applied to FastEmbed/onnxruntime anywhere (only threads is passed into TextEmbedding). If operators rely on this env var, the service may not actually change spinning behavior.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Align embedding cache paths in Docker and compose files, default to quantized model for lower memory, and add aggressive GC and memory reporting in embedding_service. Refactor embedding provider and service URL logic for runtime flexibility and improved testability. Add remote embedding support and load balancing in hybrid/embed.py and ingest/qdrant.py. Update Neo4j status check to use HTTP API. Ensure remote embedding dimension probing in mcp_memory_server. These changes improve memory efficiency, remote deployment, and testability.
@m1rl0k m1rl0k merged commit 3853760 into test Jan 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants