USE 337 - application level batching for embeddings #37

ghukill · 2026-01-20T15:34:51Z

Purpose and background context

The timdex-embeddings app needs to support local, Fargate ECS (cpu), and EC2 (gpu) compute environments. Each have their own knobs and dials to turn when it comes to performance and resources. One area we will encounter issues with is memory consumption if we’re not careful.

The model opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte, currently are only supported model, has a “batch size” configuration. This configures how many records are processed at once to create embeddings, which can effect performance, but does not have much effect on memory management. This is partially because a fully materialized list of records is required to be passed to the model, which effectively pull them all into memory.

Take the example of a job that wants to create 10k embeddings. We have an iterator of those embeddins that TDA will pull in a memory safe fashion, but we cannot materialize them all into memory. Even if we could, passing them all to the model for embeddings would likely blow up memory even though the “batch size” is 2-5 records.

What we need is a higher level batching layer, at our application layer, that manages the iterator of input records, passes memory safe batches to the model, writes the results, then is fully done with those records.

This PR introduces batching at the application layer, ensuring that even large jobs are completed successfully.

The method create_embeddings() requires an iterator of EmbeddingInput's to embed. By sending batches of these to the ML model, then writing out the results, we are completely done with them and we move onto the next batch, thereby managing memory.

How can a reviewer manually see the effects of these changes?

1- Run make install for updated dependencies.

2- Set Dev1 AWS credentials in terminal.

3- If not done already, download model:

embeddings --verbose download-model \
--model-uri opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte \
--model-path /tmp/te-model

4- Create embeddings with a small batch size:

embeddings --verbose create-embeddings \
--model-uri opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte \
--model-path /tmp/te-model \
--strategy full_record \
--dataset-location=s3://timdex-extract-dev-222053980223/dataset \
--run-id=60b76094-5412-4f4b-8bde-24dc3753005a \
--record-limit=10 \
--batch-size 3

Some logging analysis:

INFO embeddings.models.os_neural_sparse_doc_v3_gte.create_embeddings() line 199: Num workers: 1, application batch size: 3, model batch size: 4, device: cpu, pool: None

We are sending batches of 3 to the model, and the model is batching records in groups of 4
As noted here and there, testing has confirmed that our model opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte does not benefit from large batches. Even that internal model batching of 4 may get bumped down to 1-2 at some point. But the application batching of 3 ensures that memory consumpt is low.

DEBUG embeddings.models.os_neural_sparse_doc_v3_gte.create_embeddings() line 231: Embeddings batch 1: 3 records, elapsed: 2.36s, records/sec: 1.27
DEBUG embeddings.models.os_neural_sparse_doc_v3_gte.create_embeddings() line 231: Embeddings batch 2: 3 records, elapsed: 0.69s, records/sec: 4.35
DEBUG embeddings.models.os_neural_sparse_doc_v3_gte.create_embeddings() line 231: Embeddings batch 3: 3 records, elapsed: 1.45s, records/sec: 2.07
DEBUG timdex_dataset_api.dataset.read_batches_iter() line 498: read_batches_iter batch 1, yielded: 10 @ 0 records/second, total yielded: 10
DEBUG timdex_dataset_api.dataset.read_batches_iter() line 503: read_batches_iter() elapsed: 20.74s
DEBUG embeddings.models.os_neural_sparse_doc_v3_gte.create_embeddings() line 231: Embeddings batch 4: 1 records, elapsed: 0.20s, records/sec: 5.03
INFO embeddings.models.os_neural_sparse_doc_v3_gte.create_embeddings() line 248: Inference elapsed: 20.944523875019513s
DEBUG timdex_dataset_api.embeddings.create_embedding_batches() line 322: Yielding batch 1 for dataset writing.
INFO timdex_dataset_api.embeddings.log_write_statistics() line 337: Dataset write complete - elapsed: 22.6s, total files: 1, total rows: 10, total size: 82816
INFO embeddings.cli.create_embeddings() line 294: Embeddings written to TIMDEX dataset.
INFO embeddings.cli.create_embeddings() line 296: Embeddings creation complete.

4 batches to complete 10 records (3, 3, 3, 1)
Indiciation that TDA is performing reads when needed, between some embedding batches

Local inference is slow enough that it's tricky to test larger runs and see that memory consumption is low, but can confirm that GPU runs are successful at 1k, 5k, even 10k, which formerly exceeded memory.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: memory management is safe locally, Fargate ECS (cpu), and EC2 (gpu)

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-337

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

ghukill · 2026-01-20T15:48:39Z

tests/conftest.py

+        for embedding_inputs_batch in batched(embedding_inputs, batch_size):
+            logger.debug(f"Processing batch of {len(embedding_inputs_batch)} inputs")
+            for embedding_input in embedding_inputs_batch:
+                yield self.create_embedding(embedding_input)


This update to the testing model fixture shows very simply and directly how batching is applied.

Why these changes are being introduced: There are two levels at which "batching" may come into play: 1. Many ML models have internal batching. You can provide 100 in an array, but it might only create embeddings or 10 at a time. However, sometimes you pay the full memory pressure weight of the original 100. 2. Our `create_embeddings()` wrapper method can send batches to the ML embedding model method. Because the input `embedding_inputs` is an iterator, this ensures that we keep memory pressure low even for large numbers of records to embed. We need to keep memory pressure low as we move into supporting local, Fargate, and GPU contexts for embedding creation. How this addresses that need: We have introduced batching at the `create_embeddings()` method layer, leveraging the records iterator that is used as input. In doing so, we ensure that the ML model is only seeing small(ish) batches for create embeddings for. To reiterate, this is distinct from the ML model itself which may have some internal batching. For example, we may send a batch of 100 to the model, but it might still create embeddings via internal batching of 2-5 records. Side effects of this change: * Confirmed low memory usage locally, Fargate ECS, and GPU backed EC2 contexts for runs up to 10k records. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-337

ehanson8

Works as advertised, code looks good to me, approved!

Update dependencies

0cb5626

ghukill commented Jan 20, 2026

View reviewed changes

ghukill force-pushed the USE-337-micro-batch-embedding-creation branch from b57c988 to 6bdfea4 Compare January 20, 2026 16:11

ghukill requested a review from a team January 20, 2026 16:11

ghukill marked this pull request as ready for review January 20, 2026 16:12

ehanson8 approved these changes Jan 20, 2026

View reviewed changes

ghukill merged commit 0d22b53 into main Jan 20, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USE 337 - application level batching for embeddings #37

USE 337 - application level batching for embeddings #37

Uh oh!

ghukill commented Jan 20, 2026 •

edited

Loading

Uh oh!

ghukill Jan 20, 2026

Uh oh!

ehanson8 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

USE 337 - application level batching for embeddings #37

USE 337 - application level batching for embeddings #37

Uh oh!

Conversation

ghukill commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

ghukill Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghukill commented Jan 20, 2026 •

edited

Loading