Skip to content

Implement batch embedding API calls for annotation embeddings#1148

Open
JSv4 wants to merge 13 commits intomainfrom
claude/batch-embedding-annotations-FijtH
Open

Implement batch embedding API calls for annotation embeddings#1148
JSv4 wants to merge 13 commits intomainfrom
claude/batch-embedding-annotations-FijtH

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Mar 23, 2026

Summary

Refactored the annotation embedding pipeline to use batch API calls for significantly improved throughput. When an explicit embedder is provided, text-only annotations are now grouped and embedded via a new embed_texts_batch() method, reducing HTTP requests from one-per-annotation to one-per-batch.

Key Changes

  • Added embed_texts_batch() method to BaseEmbedder: Default implementation calls embed_text() sequentially for backward compatibility. Subclasses can override for true batch API endpoints.

  • Implemented batch endpoint in MicroserviceEmbedder: Added embed_texts_batch() that calls /embeddings/batch endpoint on the microservice, with proper error handling for NaN values and HTTP status codes.

  • Refactored calculate_embeddings_for_annotation_batch() task:

    • Partitions annotations into text-only vs multimodal based on content modalities and embedder capabilities
    • Batch-embeds text-only annotations using _batch_embed_text_annotations() helper
    • Processes multimodal annotations individually (require image extraction and special handling)
    • Maintains dual embedding strategy when no explicit embedder is provided
  • Added _batch_embed_text_annotations() helper function: Handles sub-batching of annotations, skips empty text, manages per-annotation and per-batch error cases, and tracks success/failure/skip counts.

  • Added batch size constant: EMBEDDING_API_BATCH_SIZE in document_processing.py controls the sub-batch size for API calls (separate from task-level EMBEDDING_BATCH_SIZE).

  • Updated test embedders: Both TestEmbedder and TestMultimodalEmbedder now implement embed_texts_batch() for deterministic testing.

  • Comprehensive test suite: Added 260 lines of tests covering sequential fallback, empty lists, individual failures, sub-batching, batch API failures, partial failures, and database errors.

Implementation Details

  • Text-only annotations are identified by checking content_modalities and embedder capabilities (is_multimodal + supports_images)
  • Batch failures (exception or None return) mark all annotations in that chunk as failed
  • Individual vector failures (None in results list) are handled per-annotation
  • Empty/whitespace-only text is skipped before batching
  • Error messages include annotation IDs and specific failure reasons for debugging
  • Backward compatible: existing code without explicit embedder continues using dual embedding strategy

https://claude.ai/code/session_01CtSL4LRo6R61gALUJe3TCm

Add embed_texts_batch() to the embedder hierarchy so that
calculate_embeddings_for_annotation_batch can send multiple texts
in a single HTTP request instead of one request per annotation.

- BaseEmbedder.embed_texts_batch(): sequential fallback for
  embedders that don't override it
- MicroserviceEmbedder.embed_texts_batch(): uses /embeddings/batch
  endpoint; extracted _get_service_config() to DRY up config logic
- TestEmbedder / TestMultimodalEmbedder: local batch for tests
- _batch_embed_text_annotations(): groups text-only annotations
  into sub-batches of EMBEDDING_API_BATCH_SIZE (50), calls
  embed_texts_batch(), and stores results per annotation
- calculate_embeddings_for_annotation_batch: partitions annotations
  into text-only (batch path) vs multimodal (individual path)
- EMBEDDING_API_BATCH_SIZE constant (50) for API-level batching
- Unit tests covering success, partial failure, total failure,
  sub-batching, and empty-text skipping

https://claude.ai/code/session_01CtSL4LRo6R61gALUJe3TCm
@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

Code Review - PR 1148 Batch Embedding

Overall this is a well-structured improvement. The architecture (base class fallback to subclass batch override to helper partition function) is clean, and error handling at each level (batch exception, None return, per-item None vector, DB error) is thorough. A few issues worth addressing before merge.

Bug: Vector count mismatch is undetected

In sent_transformer_microservice.py embed_texts_batch and embeddings_task.py _batch_embed_text_annotations, if the microservice returns a different number of vectors than texts sent, zip(chunk, vectors) silently truncates. Annotations past the truncation point are neither succeeded nor failed. Should validate len(vectors) == len(chunk) and fail the whole chunk explicitly if not.

Magic number: hardcoded 100 in MicroserviceEmbedder.embed_texts_batch

The value 100 at lines 174-176 should be a named constant (e.g. MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE) in opencontractserver/constants/document_processing.py per the no-magic-numbers rule. EMBEDDING_API_BATCH_SIZE=50 is a sub-batching concept and should not be reused for this cap. Also, silent truncation here means the last N annotations in an oversized chunk get neither a vector nor an error -- invisible failure. Raising ValueError or sub-batching would be safer.

Deferred import should be module-level

The import of EMBEDDING_API_BATCH_SIZE inside calculate_embeddings_for_annotation_batch (embeddings_task.py lines 396-399) has no apparent justification. Move it to module level.

_get_service_config exposes api_key that no caller uses

Both call sites discard api_key with _. Either remove it from the return tuple or use it.

Hardcoded timeout

timeout=60 at sent_transformer_microservice.py line 192 (and in _embed_text_impl) should be a named constant (e.g. EMBEDDER_REQUEST_TIMEOUT_SECONDS).

NaN handling fails the whole batch rather than individual items

Lines 198-204: when NaN values are detected the entire batch returns None. Since _batch_embed_text_annotations handles per-item None correctly already, it is straightforward to use a nan_mask and return None only for the affected indices. Resilience improvement, not a blocking bug.

Test coverage gaps

Three gaps worth filling: (1) calculate_embeddings_for_annotation_batch with the new batch path has no test -- a Django TestCase with TestEmbedder and real Annotation objects would catch integration issues. (2) MicroserviceEmbedder.embed_texts_batch has no test -- a mock of requests.post covering 200/4xx/5xx/exception would be straightforward. (3) Once the vector count mismatch validation is added, a test for len(vectors) != len(texts) via a custom embedder would be easy.

Minor: duplicate embed_texts_batch on test embedders

Both TestEmbedder and TestMultimodalEmbedder override embed_texts_batch with identical list comprehensions over embed_text -- exactly what BaseEmbedder.embed_texts_batch already does. Removing the overrides would eliminate the duplication.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 23, 2026

… missing tests

- Extract MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE, EMBEDDER_SINGLE_REQUEST_TIMEOUT_SECONDS,
  EMBEDDER_BATCH_REQUEST_TIMEOUT_SECONDS constants (no magic numbers)
- Raise ValueError on oversized batch instead of silent truncation
- Add len(vectors) != len(texts) validation in both MicroserviceEmbedder and
  _batch_embed_text_annotations to detect count mismatches
- Handle NaN values per-item instead of failing entire batch
- Remove unused api_key from _get_service_config return tuple
- Move EMBEDDING_API_BATCH_SIZE import to module level
- Remove duplicate embed_texts_batch overrides from test embedders
- Add TestMicroserviceEmbedderBatch (8 tests) and vector count mismatch test
@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

PR Review: Batch Embedding API Calls

Overall this is a well-structured, thoughtful refactor. The architecture is solid — the sequential fallback in BaseEmbedder, the partitioning logic, and the error-handling granularity (per-item NaN, vector count mismatch, whole-batch failure) are all good design choices. Tests are comprehensive. A few issues and suggestions:


Bugs / Correctness

1. idx variable is collected but never used (embeddings_task.py)

In _batch_embed_text_annotations, items are stored as (idx, annot, text) tuples but idx is always unpacked as _:

items.append((idx, annot, text))   # idx captured
...
for (_, annot, _), vector in zip(chunk, vectors):  # discarded

The enumerate is dead weight. Either use idx for something (e.g., a debug log showing original position) or remove it and just store (annot, text) tuples.

2. Annotations with image content but a non-multimodal embedder silently fall into text-only path (embeddings_task.py, line ~442-448)

if can_embed_images and has_images:
    multimodal_annots.append(annot)
else:
    text_only_annots.append(annot)   # image-bearing annotation embedded without image

When can_embed_images is False but an annotation has ContentModality.IMAGE, it's silently batch-embedded using only its text. This may be intentional (best-effort), but it's worth at minimum a logger.debug noting that image content is being dropped for that annotation. Otherwise this is a silent data quality issue that's hard to diagnose.

3. MicroserviceEmbedder.embed_texts_batch bypasses supports_text check (sent_transformer_microservice.py)

The base embed_texts_batch guards against non-text embedders:

if not self.supports_text:
    return None

The MicroserviceEmbedder override skips this guard entirely. If MicroserviceEmbedder is always text-capable this is fine, but it should either call super() or include the guard explicitly to be robust against future subclasses.


Design Concern

4. Retry semantics changed for embedding failures

The task is decorated with autoretry_for=(Exception,), but _batch_embed_text_annotations catches all exceptions and records them in result["errors"] without re-raising. This means embedding failures will never trigger a Celery retry — the task silently succeeds (from Celery's perspective) with partial failures. The old per-annotation path had the same behavior within the loop, so this is not a regression, but it's worth documenting explicitly in the task docstring since the autoretry_for decorator implies retriability.


Minor Issues

5. _make_embedder test helper accesses private attribute (test_batch_embedding.py, line ~839)

embedder._settings = MicroserviceEmbedder.Settings(...)

Accessing _settings directly is fragile. If the base class changes its settings attribute name or initialization pattern, the test breaks silently. Consider using the public settings injection path if one exists, or at minimum adding a comment explaining why the private attribute is used.

6. _get_service_config inconsistency: kwargs merging

In _embed_text_impl, _get_service_config is called with all_kwargs (already merged by the base class before invoking the _impl method). In embed_texts_batch, the merging is done explicitly:

merged_kwargs = {**self.get_component_settings(), **direct_kwargs}
service_url, headers = self._get_service_config(merged_kwargs)

This works correctly but is an asymmetry in the API contract. A brief comment in _get_service_config noting that callers are responsible for pre-merging kwargs would prevent future confusion.

7. No test covering the calculate_embeddings_for_annotation_batch task directly with batch path

The tests thoroughly cover _batch_embed_text_annotations and MicroserviceEmbedder.embed_texts_batch in isolation, but there's no integration test exercising the full Celery task with a real (or mocked) embedder going through the embedder_path branch. Edge cases like embedder_path set but embedder instantiation failing, or the annotation partition producing zero text-only and zero multimodal annotations (all not-found), are only covered by the helper-level tests. A task-level test (even with unittest.mock for the DB calls) would close this gap.


Nitpicks

  • The EMBEDDING_API_BATCH_SIZE = 50 constant is correctly kept below MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE = 100 — good. Consider adding a comment asserting this relationship (e.g., # Must be <= MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE) to make the constraint explicit for future maintainers.
  • The items: list[tuple[int, Annotation, str]] type annotation should be list[tuple[Annotation, str]] once the unused idx is removed.

Good work overall — the batch path delivers a clear performance win and the error handling is thorough. The above issues are relatively minor; items 1 and 2 are the most worth addressing before merge.

JSv4 added 2 commits March 23, 2026 02:13
…ed image content

- Add supports_text check to MicroserviceEmbedder.embed_texts_batch() to
  match the base class guard that was being bypassed by the override
- Remove unused idx variable from _batch_embed_text_annotations(); the
  enumerate index was collected in 3-tuples but never referenced
- Add logger.debug when image annotations silently fall into the text-only
  path because the embedder does not support images
@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

placeholder to be replaced

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

PR Review: Batch Embedding API Calls - see full review in next comments

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

PR Review: Batch Embedding API Calls

Overall this is a solid, well-structured refactor with good error handling and test coverage. A few items worth addressing before merge (continued in next comment).

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

Bugs / Correctness

1. ValueError from embed_texts_batch silently treated as annotation failure

In _batch_embed_text_annotations, the except Exception block catches all exceptions including ValueError. MicroserviceEmbedder.embed_texts_batch() raises ValueError when len(texts) > MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE. If someone calls the helper with api_batch_size > 100, the programming error is swallowed and annotations are silently marked failed rather than surfacing the misconfiguration. Since EMBEDDING_API_BATCH_SIZE (50) < MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE (100) this cannot happen today, but it is a latent footgun. Consider re-raising ValueError (it is a caller contract violation) or adding a guard at the top of the helper.

2. PR description mentions test embedder updates not in the diff

The PR description says: Updated test embedders: Both TestEmbedder and TestMultimodalEmbedder now implement embed_texts_batch() for deterministic testing. Neither TestEmbedder nor TestMultimodalEmbedder appear in the diff. Either the description is inaccurate or the changes were accidentally dropped.

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

Missing Test Coverage

3. No integration test for calculate_embeddings_for_annotation_batch with the new batch code path

The existing tests cover _batch_embed_text_annotations, BaseEmbedder.embed_texts_batch, and MicroserviceEmbedder.embed_texts_batch in isolation, but there is no end-to-end test for calculate_embeddings_for_annotation_batch when embedder_path is provided. A Django TestCase integration test would verify: the text-only vs multimodal partition logic, that annotation_map lookups handle missing IDs correctly, and that the result dict is properly aggregated across both code paths.

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

Minor Issues

  1. Transient 5xx errors silently fail without retry: In MicroserviceEmbedder.embed_texts_batch, 5xx responses return None. The code comment says this may be a transient error but does not retry. The Celery task autoretry_for catches exceptions, not None returns. Single-text embedding has the same behavior so this is not a regression, but batching amplifies the impact (50 annotations fail instead of 1). Worth noting as a known limitation in a comment.

  2. logger.info for per-annotation skip may be too verbose: In _batch_embed_text_annotations, the per-annotation skip log at INFO level will flood logs for corpuses with many structural (empty-text) annotations. logger.debug is more appropriate for routine skips.

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

Positives

  • _get_service_config() is a clean DRY improvement over the duplicated config block
  • Text-only/multimodal partition logic is correct; annotations without image content correctly route to the batch path even with a multimodal embedder
  • All new constants are documented with purpose and clear relationship to each other
  • Per-item failure handling (partial NaN vectors, count mismatches) is robust and maintains progress on partial failures rather than failing the whole task
  • BaseEmbedder sequential fallback preserves full backward compatibility for all existing embedders without any code changes
  • Test embedder hierarchy (FailingBatchEmbedder, PartialFailBatchEmbedder, NullBatchEmbedder, MismatchCountEmbedder) covers all meaningful failure modes comprehensively
  • CHANGELOG entry is detailed and accurate

JSv4 added 3 commits March 23, 2026 08:23
…y test helper

- Re-raise ValueError in _batch_embed_text_annotations so programming
  errors (e.g., batch size misconfiguration) surface loudly instead of
  being silently recorded as annotation failures
- Add retry semantics note to calculate_embeddings_for_annotation_batch
  docstring explaining that partial failures don't trigger Celery retries
- Document EMBEDDING_API_BATCH_SIZE <= MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE
  constraint in constants
- Add explanatory comment for _make_embedder test helper accessing private
  _settings attribute
@claude
Copy link
Copy Markdown

claude bot commented Mar 24, 2026

PR Review: Batch Embedding API Calls - see inline for details

@claude
Copy link
Copy Markdown

claude bot commented Mar 24, 2026

PR Review Part 1 of 3 - Bugs and Correctness Issues: (1) BaseEmbedder.embed_texts_batch handles empty list implicitly - the base class returns empty list implicitly while MicroserviceEmbedder has explicit early return. Consider making base class equally explicit. (2) ValueError from contract violation triggers Celery retry - exceeding MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE is a programming error not a transient failure, retrying burns through max_retries. Consider catching ValueError at task level and failing fast without retry. (3) Response parsing assumes embeddings key on HTTP 200 - a malformed 200 body raises KeyError caught by except Exception with unhelpful message. A dedicated KeyError catch showing the actual response would help debugging. (4) Subtle inconsistency in _get_service_config callers - in _embed_text_impl the call uses all_kwargs which already has component settings merged. In embed_texts_batch the merge is explicit. Both equivalent but implicit dependency in _embed_text_impl is fragile, a short comment would help.

@claude
Copy link
Copy Markdown

claude bot commented Mar 24, 2026

PR Review Part 2 of 3 - Performance and Code Quality: PERFORMANCE: (5) No retry/backoff on transient HTTP errors in embed_texts_batch - a transient 503 drops up to 50 annotations at once. Task-level autoretry_for covers the whole task but partial failures are swallowed internally. Consider whether within-method retry such as a requests retry adapter makes sense for the batch endpoint. CODE QUALITY: (6) EMBEDDING_API_BATCH_SIZE constraint is a comment not an assertion - the comment says Must be less than or equal to MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE. A module-level assert would catch misconfiguration at startup before any tasks run. (7) bind=True tasks called directly in tests skip Celery machinery - the integration tests call calculate_embeddings_for_annotation_batch as a plain function so self, the Celery task instance, is not injected. Paths using self.retry() or self.update_state() are untested. Fine since those paths are not in tested branches but worth a brief comment.

@claude
Copy link
Copy Markdown

claude bot commented Mar 24, 2026

PR Review Part 3 of 3 - Nits and What is Good: NITS: _batch_embed_text_annotations mutating result in place is clearly documented in the docstring, good. BaseEmbedder.embed_texts_batch returning None when supports_text is False vs empty list for empty input is a meaningful semantic distinction and callers handle both correctly. WHAT IS GOOD: Partitioning logic for text-only vs multimodal is clean and fallback to per-annotation processing for multimodal is correct. Constants extracted to document_processing.py rather than hardcoded. The _get_service_config refactor cleanly eliminates duplication in MicroserviceEmbedder. Per-item failure isolation where individual None vectors do not fail the whole batch is well-handled. Test coverage is comprehensive covering sequential fallback, empty inputs, sub-batching, partial failures, count mismatches, DB errors, HTTP error codes, NaN handling, and the dual-strategy path. CHANGELOG updated. Overall the implementation is solid and the approach is sound. The items above are mostly robustness improvements rather than blockers.

JSv4 added 2 commits March 23, 2026 22:40
… test coverage

- Add explicit empty-list early return to BaseEmbedder.embed_texts_batch
- Catch ValueError at task level to fail fast without burning Celery retries
- Add dedicated check for malformed 200 responses missing 'embeddings' key
- Add module-level assert enforcing EMBEDDING_API_BATCH_SIZE <= MAX_BATCH_SIZE
- Document _get_service_config caller merge responsibility
- Add 13 new tests: single-text embed (success/1D/2D, malformed, NaN, 4xx,
  5xx, exception), _get_service_config (api_key, no key, override, fallback),
  ValueError fail-fast, empty service URL, 3D response squeeze
@claude
Copy link
Copy Markdown

claude bot commented Mar 24, 2026

PR Review: Batch Embedding API Calls

Overall this is a solid, well-structured change with good error handling and comprehensive tests. A few issues worth addressing before merge:


Bug: Failed-count calculation on ValueError path

In embeddings_task.py, when a ValueError is caught after calling _batch_embed_text_annotations, the failed count is computed as:

result["failed"] += len(text_only_annots) - (
    result["succeeded"] + result["skipped"]
)

result["skipped"] already includes missing-annotation skips (from the partition loop above), not just empty-text skips. So if e.g. 2 annotations were missing (skipped) and 3 are in text_only_annots and ValueError fires immediately, this computes 3 - (0 + 2) = 1 failed instead of 3. The formula needs to track only the within-batch-function skips separately, or snapshot result["skipped"] before calling _batch_embed_text_annotations and subtract the delta:

skipped_before = result["skipped"]
try:
    _batch_embed_text_annotations(...)
except ValueError as e:
    ...
    result["failed"] += len(text_only_annots) - (
        result["succeeded"] + (result["skipped"] - skipped_before)
    )

Retry semantics regression

The original per-annotation loop was decorated with autoretry_for=(Exception,), meaning transient failures (network timeouts, 503s) triggered Celery retries. In the new batch path, all exceptions from embed_texts_batch are caught internally and turned into permanent failures — no retry is possible.

This means a 60-second timeout or a momentary service blip permanently fails potentially 50 annotations per chunk. Consider either:

  • Re-raising retriable exceptions (e.g., requests.exceptions.Timeout, requests.exceptions.ConnectionError) from _batch_embed_text_annotations so the task-level retry fires, or
  • Adding explicit Celery retry logic in the batch path for transient HTTP errors (the microservice already distinguishes 4xx vs 5xx responses).

At minimum this behavior change should be explicitly called out in the changelog entry.


3D squeeze assumption

In MicroserviceEmbedder.embed_texts_batch:

if embeddings_array.ndim == 3:
    embeddings_array = embeddings_array.squeeze(axis=1)

This assumes the shape is (batch, 1, dim). If the service returns (batch, dim, 1) instead, the squeeze silently produces the wrong shape and the length check against len(texts) will pass (it only checks the outer dimension). A shape guard would make this more defensive:

if embeddings_array.ndim == 3:
    if embeddings_array.shape[1] != 1:
        logger.error(f"Unexpected 3D shape {embeddings_array.shape}")
        return None
    embeddings_array = embeddings_array.squeeze(axis=1)

Weak assertion in test_batch_path_valueerror_fails_fast

self.assertGreater(result["failed"], 0)

This should be self.assertEqual(result["failed"], 3) to match the 3 annotations created. The weak assertion would pass even if only 1 failed.


Minor observations (non-blocking)

  • EMBEDDING_API_BATCH_SIZE = 50 with EMBEDDING_BATCH_SIZE = 100: A task processes up to 100 annotations but sub-batches at 50. This means text-only tasks always make at least 2 HTTP calls. Worth noting in the constant comment whether the intent is to always sub-batch or whether 100 would be fine. (The existing comment explains API vs task sizing, which is good.)

  • Module-level assert: The assert EMBEDDING_API_BATCH_SIZE <= MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE will be silently skipped if Python is run with -O. Prefer if ... raise ValueError(...) for invariants that need to hold in production.

  • embed_texts_batch in BaseEmbedder logs a warning and returns None if supports_text is False: But _batch_embed_text_annotations only routes annotations here when the embedder is confirmed to support text. The guard is fine as a defensive layer but the log message could be confusing since it should never trigger in normal operation.


The test coverage is excellent — the variety of failure modes tested (partial vectors, count mismatch, NaN per-item, exception, None return, contract violations, dual strategy path) is thorough. The _get_service_config DRY refactor is clean. The backward-compatible default implementation in BaseEmbedder is the right approach for the API.

JSv4 added 2 commits March 23, 2026 23:43
… for transient errors, add shape guard

- Fix failed-count calculation on ValueError path by snapshotting
  result["skipped"] before calling _batch_embed_text_annotations and
  subtracting only the delta (empty-text skips), not partition-loop skips.

- Restore retry semantics for transient HTTP errors: requests.Timeout,
  requests.ConnectionError, and EmbeddingServerError (5xx) now re-raise
  from _batch_embed_text_annotations and embed_texts_batch so the
  task-level Celery autoretry_for=(Exception,) decorator can fire.
  Non-retriable errors (malformed response, NaN, count mismatch) remain
  as permanent per-annotation failures.

- Add 3D squeeze shape guard in MicroserviceEmbedder.embed_texts_batch,
  BaseMultimodalMicroserviceEmbedder.embed_texts_batch, and
  embed_images_batch: check shape[1]==1 before squeezing.

- Strengthen test assertion: test_batch_path_valueerror_fails_fast now
  asserts assertEqual(result["failed"], 3) instead of assertGreater.

- Replace module-level assert with ValueError raise in
  document_processing.py constants.

- Add tests for transient error propagation (Timeout, ConnectionError,
  EmbeddingServerError) in both _batch_embed_text_annotations and
  MicroserviceEmbedder.embed_texts_batch.
@claude
Copy link
Copy Markdown

claude bot commented Mar 24, 2026

PR Review: Batch Embedding API Calls

Overall this is a well-structured refactor with good documentation, clear separation of concerns, and solid test coverage. A few issues worth addressing before merge:


Bug: ValueError handler overcounts failures

In embeddings_task.py, the failure count calculation after catching ValueError from _batch_embed_text_annotations is incorrect when some annotations already failed before the ValueError was raised (e.g., a previous sub-batch had None vectors):

# Current (buggy when result["failed"] > 0 from prior sub-batches)
batch_skipped = result["skipped"] - skipped_before_batch
result["failed"] += len(text_only_annots) - (
    result["succeeded"] + batch_skipped
)

If sub-batch 1 produces 1 None vector (recording result["failed"] += 1) and sub-batch 2 raises ValueError, the formula adds len(text_only_annots) - succeeded - skipped to result["failed"], but this already-counted failure gets double-counted. The formula should also subtract the failures already recorded:

failures_before_valueerror = result["failed"] - failures_before_batch  # track failures_before_batch like skipped_before_batch
result["failed"] += len(text_only_annots) - (
    result["succeeded"] + batch_skipped + (result["failed"] - failures_before_batch)
)

Issue: multimodal_annots not marked failed on ValueError early return

When a ValueError is caught and the function does return result, the multimodal_annots list has not yet been processed. Their count is never added to result["failed"], so succeeded + failed + skipped will not equal total for batches containing mixed text-only and multimodal annotations. This is an edge case but will produce confusing log output.


Issue: Whole-task retry may re-embed already-succeeded annotations

When _batch_embed_text_annotations re-raises a transient error (Timeout, ConnectionError, EmbeddingServerError), it propagates through calculate_embeddings_for_annotation_batch and triggers the autoretry_for=(Exception,) Celery decorator — retrying the entire task. Sub-batches that already successfully stored embeddings will be re-processed. This is safe only if add_embedding is idempotent (upserts). Worth a comment confirming that assumption, or consider catching transient errors at the task level and short-circuiting already-processed annotation IDs.


Informational: Asymmetric retry semantics between batch and single-text paths

embed_texts_batch in MicroserviceEmbedder now raises EmbeddingServerError on 5xx (allowing Celery retry), but _embed_text_impl (used by the multimodal per-annotation path) still returns None on 5xx. This means multimodal annotations silently fail on transient server errors without retrying. Intentional or oversight? If intentional, a comment explaining the asymmetry would help future maintainers.


Minor: Constants file with runtime side effects

# opencontractserver/constants/document_processing.py
if EMBEDDING_API_BATCH_SIZE > MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE:
    raise ValueError(...)

Raising at import time from a constants file is unconventional and will cause unexpected ImportError-style failures in unrelated tests that import this module. Consider moving this validation into a __init_subclass__ hook, a Django AppConfig.ready() check, or a dedicated validate_settings() function called at startup.


Minor: FailingBatchEmbedder raises built-in ConnectionError, not requests.exceptions.ConnectionError

The test embedder raises ConnectionError("Service unavailable") (Python built-in OSError subclass), which falls into the generic except Exception handler in _batch_embed_text_annotations — correctly treated as non-retriable. However, the naming suggests a network failure, which in production would be requests.exceptions.ConnectionError (retriable). The docstring or name should clarify this distinction to avoid future confusion (e.g., NonRetriableExceptionEmbedder).


Minor: No tests for embed_images_batch changes in multimodal_microservice.py

The 3D shape check (shape[1] != 1 → return None) and the EmbeddingServerError raise-on-5xx changes added to embed_images_batch are not covered by the new test file. The parallel changes in embed_texts_batch are well tested — the image batch path deserves the same.


Positives worth noting

  • The partitioning of text-only vs. multimodal annotations is clean and the logic is easy to follow
  • _batch_embed_text_annotations is well-extracted with a clear contract documented in the docstring
  • _get_service_config refactor eliminates real duplication cleanly
  • The EMBEDDING_API_BATCH_SIZE vs EMBEDDING_BATCH_SIZE distinction is well-documented
  • Test coverage is comprehensive — the MicroserviceEmbedder unit tests in particular cover a good range of edge cases (3D squeeze, NaN per-item, count mismatch, missing key)
  • Retry propagation semantics are clearly documented in both the helper and task docstrings

JSv4 added 2 commits March 24, 2026 02:11
CI fixes:
- Update test_embed_texts_batch_bad_status_code and
  test_embed_images_batch_bad_status_code to expect EmbeddingServerError
  instead of None return, matching the intentional behavior change where
  batch methods raise on 5xx for Celery task-level retry
- Add tests for 3D array bad shape guard in both embed_texts_batch and
  embed_images_batch (shape[1] != 1 returns None)

Review comment fixes:
- Fix ValueError handler double-counting failures by snapshotting
  succeeded and failed counts before batch call, not just skipped
- Add unprocessed multimodal_annots count to failures on ValueError
  early return, since those annotations were never attempted
- Document add_embedding() idempotency (upsert via store_embedding)
  to confirm Celery retry safety
- Document intentional asymmetry between single-text (returns None)
  and batch (raises EmbeddingServerError) error handling in
  _embed_text_impl docstring
- Explain import-time validation rationale in document_processing.py
  constants (fail-fast on deployment misconfiguration)
- Clarify FailingBatchEmbedder docstring: uses built-in ConnectionError
  (not requests.exceptions.ConnectionError) for non-retriable exception
  simulation
@claude
Copy link
Copy Markdown

claude bot commented Mar 24, 2026

PR Review: Batch Embedding API Calls

Overall this is a well-designed change with clear rationale, good error-handling layering, and solid test coverage. A few issues worth addressing before merge:


Issues

1. EmbeddingServerError belongs in a shared location (medium)

EmbeddingServerError is defined in multimodal_microservice.py but is now imported into both sent_transformer_microservice.py and embeddings_task.py. This creates a coupling where the task-infrastructure layer depends on a concrete embedder implementation module. If a third embedder is added, it must import its own exception type from an unrelated embedder file.

The requests import in embeddings_task.py is a symptom of the same issue — the task is now aware of HTTP-level exception types that belong in the embedder layer.

Suggested fix: Move EmbeddingServerError to opencontractserver/pipeline/base/exceptions.py (or a new opencontractserver/pipeline/embedders/exceptions.py) and import from there. Both microservice embedders and the task would then depend on the shared base.

2. Missing integration test for the multimodal partition path (medium)

TestCalculateEmbeddingsForAnnotationBatch tests the text-only batch path and the dual-embedding path, but there is no test covering the case where can_embed_images=True and annotations are actually partitioned into text_only_annots + multimodal_annots. The multimodal annotations go through _create_embedding_for_annotation() individually — this path is untested at the task level.

At minimum, a test where one annotation has ContentModality.IMAGE in its content_modalities and the embedder is multimodal (is_multimodal=True, supports_images=True) would cover this branch.

3. Import-time constant validation is unusual for Django (minor)

# opencontractserver/constants/document_processing.py
if EMBEDDING_API_BATCH_SIZE > MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE:
    raise ValueError(...)

The comment explains the intent clearly, but raising at module import time is atypical and can produce confusing tracebacks during Django startup, management commands, migrations, and test collection. Django's system check framework (AppConfig.ready() + @register()) is the conventional place for this kind of deployment invariant check and produces cleaner, actionable output. Not a blocker, but worth considering.


Observations / Non-blocking Feedback

Error asymmetry is well-documented but subtle. The decision to have _embed_text_impl return None on 5xx while embed_texts_batch raises EmbeddingServerError is reasonable (single-annotation tasks preserve remaining work; batch failures affect the whole chunk so retry is better). The docstring on _embed_text_impl in multimodal_microservice.py explaining this asymmetry is a good addition — consider adding an equivalent note to sent_transformer_microservice.py's _embed_text_impl since the same asymmetry applies there.

FailingBatchEmbedder uses the built-in ConnectionError, not requests.exceptions.ConnectionError. The comment explains this is intentional to land in the catch-all except Exception handler (not the retry-triggering transient-error handler). The naming is confusing since requests.exceptions.ConnectionError exists. A clearer name like RuntimeError or ValueError with a comment would make the intent less ambiguous for future test maintainers.

_make_mock_annotation's content_modalities default is always truthy. The production code uses annot.content_modalities or [ContentModality.TEXT.value] to handle None, but the test helper always returns a non-empty list, so the None fallback is never exercised in these tests. Minor gap.

embed_texts_batch in BaseEmbedder catches all exceptions from embed_text and returns None per item. This means if a subclass's embed_text raises requests.exceptions.Timeout, the base fallback will silently swallow it instead of propagating for retry. The base fallback is intended as a compatibility shim for non-HTTP embedders, so this may be acceptable — but it's worth noting that the retry semantics only work correctly when a subclass overrides embed_texts_batch directly (as the microservice embedders do).


What's Good

  • The partition logic (text-only vs multimodal) is clean and the content_modalities check is explicit.
  • The _batch_embed_text_annotations helper function has a well-defined contract and handles all the interesting failure modes: None return, count mismatch, per-item None, store failure, and transient HTTP errors.
  • The EMBEDDING_API_BATCH_SIZE / MICROSERVICE_EMBEDDER_MAX_BATCH_SIZE separation and the explanation in the constants file is a nice touch.
  • _get_service_config() extraction is a clean DRY improvement.
  • The shape[1] != 1 guard on 3D array squeeze is a good defensive addition.
  • Test coverage is thorough for the happy path and most failure modes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants