Implement batch embedding API calls for annotation embeddings#1148
Implement batch embedding API calls for annotation embeddings#1148
Conversation
Add embed_texts_batch() to the embedder hierarchy so that calculate_embeddings_for_annotation_batch can send multiple texts in a single HTTP request instead of one request per annotation. - BaseEmbedder.embed_texts_batch(): sequential fallback for embedders that don't override it - MicroserviceEmbedder.embed_texts_batch(): uses /embeddings/batch endpoint; extracted _get_service_config() to DRY up config logic - TestEmbedder / TestMultimodalEmbedder: local batch for tests - _batch_embed_text_annotations(): groups text-only annotations into sub-batches of EMBEDDING_API_BATCH_SIZE (50), calls embed_texts_batch(), and stores results per annotation - calculate_embeddings_for_annotation_batch: partitions annotations into text-only (batch path) vs multimodal (individual path) - EMBEDDING_API_BATCH_SIZE constant (50) for API-level batching - Unit tests covering success, partial failure, total failure, sub-batching, and empty-text skipping https://claude.ai/code/session_01CtSL4LRo6R61gALUJe3TCm
Code Review - PR 1148 Batch EmbeddingOverall this is a well-structured improvement. The architecture (base class fallback to subclass batch override to helper partition function) is clean, and error handling at each level (batch exception, None return, per-item None vector, DB error) is thorough. A few issues worth addressing before merge. Bug: Vector count mismatch is undetected In sent_transformer_microservice.py embed_texts_batch and embeddings_task.py _batch_embed_text_annotations, if the microservice returns a different number of vectors than texts sent, zip(chunk, vectors) silently truncates. Annotations past the truncation point are neither succeeded nor failed. Should validate len(vectors) == len(chunk) and fail the whole chunk explicitly if not. Magic number: hardcoded 100 in MicroserviceEmbedder.embed_texts_batch The value 100 at lines 174-176 should be a named constant (e.g. MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE) in opencontractserver/constants/document_processing.py per the no-magic-numbers rule. EMBEDDING_API_BATCH_SIZE=50 is a sub-batching concept and should not be reused for this cap. Also, silent truncation here means the last N annotations in an oversized chunk get neither a vector nor an error -- invisible failure. Raising ValueError or sub-batching would be safer. Deferred import should be module-level The import of EMBEDDING_API_BATCH_SIZE inside calculate_embeddings_for_annotation_batch (embeddings_task.py lines 396-399) has no apparent justification. Move it to module level. _get_service_config exposes api_key that no caller uses Both call sites discard api_key with _. Either remove it from the return tuple or use it. Hardcoded timeout timeout=60 at sent_transformer_microservice.py line 192 (and in _embed_text_impl) should be a named constant (e.g. EMBEDDER_REQUEST_TIMEOUT_SECONDS). NaN handling fails the whole batch rather than individual items Lines 198-204: when NaN values are detected the entire batch returns None. Since _batch_embed_text_annotations handles per-item None correctly already, it is straightforward to use a nan_mask and return None only for the affected indices. Resilience improvement, not a blocking bug. Test coverage gaps Three gaps worth filling: (1) calculate_embeddings_for_annotation_batch with the new batch path has no test -- a Django TestCase with TestEmbedder and real Annotation objects would catch integration issues. (2) MicroserviceEmbedder.embed_texts_batch has no test -- a mock of requests.post covering 200/4xx/5xx/exception would be straightforward. (3) Once the vector count mismatch validation is added, a test for len(vectors) != len(texts) via a custom embedder would be easy. Minor: duplicate embed_texts_batch on test embedders Both TestEmbedder and TestMultimodalEmbedder override embed_texts_batch with identical list comprehensions over embed_text -- exactly what BaseEmbedder.embed_texts_batch already does. Removing the overrides would eliminate the duplication. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
… missing tests - Extract MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE, EMBEDDER_SINGLE_REQUEST_TIMEOUT_SECONDS, EMBEDDER_BATCH_REQUEST_TIMEOUT_SECONDS constants (no magic numbers) - Raise ValueError on oversized batch instead of silent truncation - Add len(vectors) != len(texts) validation in both MicroserviceEmbedder and _batch_embed_text_annotations to detect count mismatches - Handle NaN values per-item instead of failing entire batch - Remove unused api_key from _get_service_config return tuple - Move EMBEDDING_API_BATCH_SIZE import to module level - Remove duplicate embed_texts_batch overrides from test embedders - Add TestMicroserviceEmbedderBatch (8 tests) and vector count mismatch test
PR Review: Batch Embedding API CallsOverall this is a well-structured, thoughtful refactor. The architecture is solid — the sequential fallback in Bugs / Correctness1. In items.append((idx, annot, text)) # idx captured
...
for (_, annot, _), vector in zip(chunk, vectors): # discardedThe 2. Annotations with image content but a non-multimodal embedder silently fall into text-only path ( if can_embed_images and has_images:
multimodal_annots.append(annot)
else:
text_only_annots.append(annot) # image-bearing annotation embedded without imageWhen 3. The base if not self.supports_text:
return NoneThe Design Concern4. Retry semantics changed for embedding failures The task is decorated with Minor Issues5. embedder._settings = MicroserviceEmbedder.Settings(...)Accessing 6. In merged_kwargs = {**self.get_component_settings(), **direct_kwargs}
service_url, headers = self._get_service_config(merged_kwargs)This works correctly but is an asymmetry in the API contract. A brief comment in 7. No test covering the The tests thoroughly cover Nitpicks
Good work overall — the batch path delivers a clear performance win and the error handling is thorough. The above issues are relatively minor; items 1 and 2 are the most worth addressing before merge. |
…g-annotations-FijtH
…ed image content - Add supports_text check to MicroserviceEmbedder.embed_texts_batch() to match the base class guard that was being bypassed by the override - Remove unused idx variable from _batch_embed_text_annotations(); the enumerate index was collected in 3-tuples but never referenced - Add logger.debug when image annotations silently fall into the text-only path because the embedder does not support images
|
placeholder to be replaced |
|
PR Review: Batch Embedding API Calls - see full review in next comments |
|
PR Review: Batch Embedding API Calls Overall this is a solid, well-structured refactor with good error handling and test coverage. A few items worth addressing before merge (continued in next comment). |
|
Bugs / Correctness 1. ValueError from embed_texts_batch silently treated as annotation failure In _batch_embed_text_annotations, the except Exception block catches all exceptions including ValueError. MicroserviceEmbedder.embed_texts_batch() raises ValueError when len(texts) > MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE. If someone calls the helper with api_batch_size > 100, the programming error is swallowed and annotations are silently marked failed rather than surfacing the misconfiguration. Since EMBEDDING_API_BATCH_SIZE (50) < MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE (100) this cannot happen today, but it is a latent footgun. Consider re-raising ValueError (it is a caller contract violation) or adding a guard at the top of the helper. 2. PR description mentions test embedder updates not in the diff The PR description says: Updated test embedders: Both TestEmbedder and TestMultimodalEmbedder now implement embed_texts_batch() for deterministic testing. Neither TestEmbedder nor TestMultimodalEmbedder appear in the diff. Either the description is inaccurate or the changes were accidentally dropped. |
|
Missing Test Coverage 3. No integration test for calculate_embeddings_for_annotation_batch with the new batch code path The existing tests cover _batch_embed_text_annotations, BaseEmbedder.embed_texts_batch, and MicroserviceEmbedder.embed_texts_batch in isolation, but there is no end-to-end test for calculate_embeddings_for_annotation_batch when embedder_path is provided. A Django TestCase integration test would verify: the text-only vs multimodal partition logic, that annotation_map lookups handle missing IDs correctly, and that the result dict is properly aggregated across both code paths. |
|
Minor Issues
|
|
Positives
|
…y test helper - Re-raise ValueError in _batch_embed_text_annotations so programming errors (e.g., batch size misconfiguration) surface loudly instead of being silently recorded as annotation failures - Add retry semantics note to calculate_embeddings_for_annotation_batch docstring explaining that partial failures don't trigger Celery retries - Document EMBEDDING_API_BATCH_SIZE <= MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE constraint in constants - Add explanatory comment for _make_embedder test helper accessing private _settings attribute
|
PR Review: Batch Embedding API Calls - see inline for details |
|
PR Review Part 1 of 3 - Bugs and Correctness Issues: (1) BaseEmbedder.embed_texts_batch handles empty list implicitly - the base class returns empty list implicitly while MicroserviceEmbedder has explicit early return. Consider making base class equally explicit. (2) ValueError from contract violation triggers Celery retry - exceeding MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE is a programming error not a transient failure, retrying burns through max_retries. Consider catching ValueError at task level and failing fast without retry. (3) Response parsing assumes embeddings key on HTTP 200 - a malformed 200 body raises KeyError caught by except Exception with unhelpful message. A dedicated KeyError catch showing the actual response would help debugging. (4) Subtle inconsistency in _get_service_config callers - in _embed_text_impl the call uses all_kwargs which already has component settings merged. In embed_texts_batch the merge is explicit. Both equivalent but implicit dependency in _embed_text_impl is fragile, a short comment would help. |
|
PR Review Part 2 of 3 - Performance and Code Quality: PERFORMANCE: (5) No retry/backoff on transient HTTP errors in embed_texts_batch - a transient 503 drops up to 50 annotations at once. Task-level autoretry_for covers the whole task but partial failures are swallowed internally. Consider whether within-method retry such as a requests retry adapter makes sense for the batch endpoint. CODE QUALITY: (6) EMBEDDING_API_BATCH_SIZE constraint is a comment not an assertion - the comment says Must be less than or equal to MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE. A module-level assert would catch misconfiguration at startup before any tasks run. (7) bind=True tasks called directly in tests skip Celery machinery - the integration tests call calculate_embeddings_for_annotation_batch as a plain function so self, the Celery task instance, is not injected. Paths using self.retry() or self.update_state() are untested. Fine since those paths are not in tested branches but worth a brief comment. |
|
PR Review Part 3 of 3 - Nits and What is Good: NITS: _batch_embed_text_annotations mutating result in place is clearly documented in the docstring, good. BaseEmbedder.embed_texts_batch returning None when supports_text is False vs empty list for empty input is a meaningful semantic distinction and callers handle both correctly. WHAT IS GOOD: Partitioning logic for text-only vs multimodal is clean and fallback to per-annotation processing for multimodal is correct. Constants extracted to document_processing.py rather than hardcoded. The _get_service_config refactor cleanly eliminates duplication in MicroserviceEmbedder. Per-item failure isolation where individual None vectors do not fail the whole batch is well-handled. Test coverage is comprehensive covering sequential fallback, empty inputs, sub-batching, partial failures, count mismatches, DB errors, HTTP error codes, NaN handling, and the dual-strategy path. CHANGELOG updated. Overall the implementation is solid and the approach is sound. The items above are mostly robustness improvements rather than blockers. |
… test coverage - Add explicit empty-list early return to BaseEmbedder.embed_texts_batch - Catch ValueError at task level to fail fast without burning Celery retries - Add dedicated check for malformed 200 responses missing 'embeddings' key - Add module-level assert enforcing EMBEDDING_API_BATCH_SIZE <= MAX_BATCH_SIZE - Document _get_service_config caller merge responsibility - Add 13 new tests: single-text embed (success/1D/2D, malformed, NaN, 4xx, 5xx, exception), _get_service_config (api_key, no key, override, fallback), ValueError fail-fast, empty service URL, 3D response squeeze
PR Review: Batch Embedding API CallsOverall this is a solid, well-structured change with good error handling and comprehensive tests. A few issues worth addressing before merge: Bug: Failed-count calculation on
|
… for transient errors, add shape guard - Fix failed-count calculation on ValueError path by snapshotting result["skipped"] before calling _batch_embed_text_annotations and subtracting only the delta (empty-text skips), not partition-loop skips. - Restore retry semantics for transient HTTP errors: requests.Timeout, requests.ConnectionError, and EmbeddingServerError (5xx) now re-raise from _batch_embed_text_annotations and embed_texts_batch so the task-level Celery autoretry_for=(Exception,) decorator can fire. Non-retriable errors (malformed response, NaN, count mismatch) remain as permanent per-annotation failures. - Add 3D squeeze shape guard in MicroserviceEmbedder.embed_texts_batch, BaseMultimodalMicroserviceEmbedder.embed_texts_batch, and embed_images_batch: check shape[1]==1 before squeezing. - Strengthen test assertion: test_batch_path_valueerror_fails_fast now asserts assertEqual(result["failed"], 3) instead of assertGreater. - Replace module-level assert with ValueError raise in document_processing.py constants. - Add tests for transient error propagation (Timeout, ConnectionError, EmbeddingServerError) in both _batch_embed_text_annotations and MicroserviceEmbedder.embed_texts_batch.
PR Review: Batch Embedding API CallsOverall this is a well-structured refactor with good documentation, clear separation of concerns, and solid test coverage. A few issues worth addressing before merge: Bug:
|
CI fixes: - Update test_embed_texts_batch_bad_status_code and test_embed_images_batch_bad_status_code to expect EmbeddingServerError instead of None return, matching the intentional behavior change where batch methods raise on 5xx for Celery task-level retry - Add tests for 3D array bad shape guard in both embed_texts_batch and embed_images_batch (shape[1] != 1 returns None) Review comment fixes: - Fix ValueError handler double-counting failures by snapshotting succeeded and failed counts before batch call, not just skipped - Add unprocessed multimodal_annots count to failures on ValueError early return, since those annotations were never attempted - Document add_embedding() idempotency (upsert via store_embedding) to confirm Celery retry safety - Document intentional asymmetry between single-text (returns None) and batch (raises EmbeddingServerError) error handling in _embed_text_impl docstring - Explain import-time validation rationale in document_processing.py constants (fail-fast on deployment misconfiguration) - Clarify FailingBatchEmbedder docstring: uses built-in ConnectionError (not requests.exceptions.ConnectionError) for non-retriable exception simulation
PR Review: Batch Embedding API CallsOverall this is a well-designed change with clear rationale, good error-handling layering, and solid test coverage. A few issues worth addressing before merge: Issues1.
|
Summary
Refactored the annotation embedding pipeline to use batch API calls for significantly improved throughput. When an explicit embedder is provided, text-only annotations are now grouped and embedded via a new
embed_texts_batch()method, reducing HTTP requests from one-per-annotation to one-per-batch.Key Changes
Added
embed_texts_batch()method toBaseEmbedder: Default implementation callsembed_text()sequentially for backward compatibility. Subclasses can override for true batch API endpoints.Implemented batch endpoint in
MicroserviceEmbedder: Addedembed_texts_batch()that calls/embeddings/batchendpoint on the microservice, with proper error handling for NaN values and HTTP status codes.Refactored
calculate_embeddings_for_annotation_batch()task:_batch_embed_text_annotations()helperAdded
_batch_embed_text_annotations()helper function: Handles sub-batching of annotations, skips empty text, manages per-annotation and per-batch error cases, and tracks success/failure/skip counts.Added batch size constant:
EMBEDDING_API_BATCH_SIZEindocument_processing.pycontrols the sub-batch size for API calls (separate from task-levelEMBEDDING_BATCH_SIZE).Updated test embedders: Both
TestEmbedderandTestMultimodalEmbeddernow implementembed_texts_batch()for deterministic testing.Comprehensive test suite: Added 260 lines of tests covering sequential fallback, empty lists, individual failures, sub-batching, batch API failures, partial failures, and database errors.
Implementation Details
content_modalitiesand embedder capabilities (is_multimodal+supports_images)https://claude.ai/code/session_01CtSL4LRo6R61gALUJe3TCm