USE 337 - application level batching for embeddings #37
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose and background context
The timdex-embeddings app needs to support local, Fargate ECS (cpu), and EC2 (gpu) compute environments. Each have their own knobs and dials to turn when it comes to performance and resources. One area we will encounter issues with is memory consumption if we’re not careful.
The model
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte, currently are only supported model, has a “batch size” configuration. This configures how many records are processed at once to create embeddings, which can effect performance, but does not have much effect on memory management. This is partially because a fully materialized list of records is required to be passed to the model, which effectively pull them all into memory.Take the example of a job that wants to create 10k embeddings. We have an iterator of those embeddins that TDA will pull in a memory safe fashion, but we cannot materialize them all into memory. Even if we could, passing them all to the model for embeddings would likely blow up memory even though the “batch size” is 2-5 records.
What we need is a higher level batching layer, at our application layer, that manages the iterator of input records, passes memory safe batches to the model, writes the results, then is fully done with those records.
This PR introduces batching at the application layer, ensuring that even large jobs are completed successfully.
The method
create_embeddings()requires an iterator ofEmbeddingInput's to embed. By sending batches of these to the ML model, then writing out the results, we are completely done with them and we move onto the next batch, thereby managing memory.How can a reviewer manually see the effects of these changes?
1- Run
make installfor updated dependencies.2- Set Dev1 AWS credentials in terminal.
3- If not done already, download model:
4- Create embeddings with a small batch size:
Some logging analysis:
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gtedoes not benefit from large batches. Even that internal model batching of 4 may get bumped down to 1-2 at some point. But the application batching of 3 ensures that memory consumpt is low.Local inference is slow enough that it's tricky to test larger runs and see that memory consumption is low, but can confirm that GPU runs are successful at 1k, 5k, even 10k, which formerly exceeded memory.
Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES: memory management is safe locally, Fargate ECS (cpu), and EC2 (gpu)
What are the relevant tickets?
Code review