Skip to content

Conversation

@ghukill
Copy link
Collaborator

@ghukill ghukill commented Jan 20, 2026

Purpose and background context

The timdex-embeddings app needs to support local, Fargate ECS (cpu), and EC2 (gpu) compute environments. Each have their own knobs and dials to turn when it comes to performance and resources. One area we will encounter issues with is memory consumption if we’re not careful.

The model opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte, currently are only supported model, has a “batch size” configuration. This configures how many records are processed at once to create embeddings, which can effect performance, but does not have much effect on memory management. This is partially because a fully materialized list of records is required to be passed to the model, which effectively pull them all into memory.

Take the example of a job that wants to create 10k embeddings. We have an iterator of those embeddins that TDA will pull in a memory safe fashion, but we cannot materialize them all into memory. Even if we could, passing them all to the model for embeddings would likely blow up memory even though the “batch size” is 2-5 records.

What we need is a higher level batching layer, at our application layer, that manages the iterator of input records, passes memory safe batches to the model, writes the results, then is fully done with those records.

This PR introduces batching at the application layer, ensuring that even large jobs are completed successfully.

The method create_embeddings() requires an iterator of EmbeddingInput's to embed. By sending batches of these to the ML model, then writing out the results, we are completely done with them and we move onto the next batch, thereby managing memory.

How can a reviewer manually see the effects of these changes?

1- Run make install for updated dependencies.

2- Set Dev1 AWS credentials in terminal.

3- If not done already, download model:

embeddings --verbose download-model \
--model-uri opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte \
--model-path /tmp/te-model

4- Create embeddings with a small batch size:

embeddings --verbose create-embeddings \
--model-uri opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte \
--model-path /tmp/te-model \
--strategy full_record \
--dataset-location=s3://timdex-extract-dev-222053980223/dataset \
--run-id=60b76094-5412-4f4b-8bde-24dc3753005a \
--record-limit=10 \
--batch-size 3

Some logging analysis:

INFO embeddings.models.os_neural_sparse_doc_v3_gte.create_embeddings() line 199: Num workers: 1, application batch size: 3, model batch size: 4, device: cpu, pool: None
  • We are sending batches of 3 to the model, and the model is batching records in groups of 4
  • As noted here and there, testing has confirmed that our model opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte does not benefit from large batches. Even that internal model batching of 4 may get bumped down to 1-2 at some point. But the application batching of 3 ensures that memory consumpt is low.
DEBUG embeddings.models.os_neural_sparse_doc_v3_gte.create_embeddings() line 231: Embeddings batch 1: 3 records, elapsed: 2.36s, records/sec: 1.27
DEBUG embeddings.models.os_neural_sparse_doc_v3_gte.create_embeddings() line 231: Embeddings batch 2: 3 records, elapsed: 0.69s, records/sec: 4.35
DEBUG embeddings.models.os_neural_sparse_doc_v3_gte.create_embeddings() line 231: Embeddings batch 3: 3 records, elapsed: 1.45s, records/sec: 2.07
DEBUG timdex_dataset_api.dataset.read_batches_iter() line 498: read_batches_iter batch 1, yielded: 10 @ 0 records/second, total yielded: 10
DEBUG timdex_dataset_api.dataset.read_batches_iter() line 503: read_batches_iter() elapsed: 20.74s
DEBUG embeddings.models.os_neural_sparse_doc_v3_gte.create_embeddings() line 231: Embeddings batch 4: 1 records, elapsed: 0.20s, records/sec: 5.03
INFO embeddings.models.os_neural_sparse_doc_v3_gte.create_embeddings() line 248: Inference elapsed: 20.944523875019513s
DEBUG timdex_dataset_api.embeddings.create_embedding_batches() line 322: Yielding batch 1 for dataset writing.
INFO timdex_dataset_api.embeddings.log_write_statistics() line 337: Dataset write complete - elapsed: 22.6s, total files: 1, total rows: 10, total size: 82816
INFO embeddings.cli.create_embeddings() line 294: Embeddings written to TIMDEX dataset.
INFO embeddings.cli.create_embeddings() line 296: Embeddings creation complete.
  • 4 batches to complete 10 records (3, 3, 3, 1)
  • Indiciation that TDA is performing reads when needed, between some embedding batches

Local inference is slow enough that it's tricky to test larger runs and see that memory consumption is low, but can confirm that GPU runs are successful at 1k, 5k, even 10k, which formerly exceeded memory.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: memory management is safe locally, Fargate ECS (cpu), and EC2 (gpu)

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Comment on lines +68 to +71
for embedding_inputs_batch in batched(embedding_inputs, batch_size):
logger.debug(f"Processing batch of {len(embedding_inputs_batch)} inputs")
for embedding_input in embedding_inputs_batch:
yield self.create_embedding(embedding_input)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update to the testing model fixture shows very simply and directly how batching is applied.

Why these changes are being introduced:

There are two levels at which "batching" may come into play:

1. Many ML models have internal batching.  You can provide 100 in an array, but
it might only create embeddings or 10 at a time.  However, sometimes you pay the
full memory pressure weight of the original 100.

2. Our `create_embeddings()` wrapper method can send batches to the ML
embedding model method.  Because the input `embedding_inputs` is an iterator,
this ensures that we keep memory pressure low even for large numbers of
records to embed.

We need to keep memory pressure low as we move into supporting local, Fargate,
and GPU contexts for embedding creation.

How this addresses that need:

We have introduced batching at the `create_embeddings()` method layer,
leveraging the records iterator that is used as input.  In doing so, we ensure
that the ML model is only seeing small(ish) batches for create embeddings for.
To reiterate, this is distinct from the ML model itself which may have some
internal batching.  For example, we may send a batch of 100 to the model, but
it might still create embeddings via internal batching of 2-5 records.

Side effects of this change:
* Confirmed low memory usage locally, Fargate ECS, and GPU backed EC2 contexts
for runs up to 10k records.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-337
@ghukill ghukill force-pushed the USE-337-micro-batch-embedding-creation branch from b57c988 to 6bdfea4 Compare January 20, 2026 16:11
@ghukill ghukill requested a review from a team January 20, 2026 16:11
@ghukill ghukill marked this pull request as ready for review January 20, 2026 16:12
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as advertised, code looks good to me, approved!

@ghukill ghukill merged commit 0d22b53 into main Jan 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants