-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Refine indexing pressure accounting in semantic bulk inference filter #129320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Previously, we conservatively estimated that inference results would double the document _source size. This can led to unnecessary circuit breaker exceptions, even when the node had sufficient memory to handle the operation. This PR replaces the rough estimation with the actual size of the _source after the update. Since inference calls use a batch size of 1 MB, we rely on the real circuit breaker to ensure that results fit in memory before applying indexing pressure accounting. Additionally, this PR introduces a new counter in InferenceStats to track the number of rejections caused by indexing pressure from inference results.
Pinging @elastic/search-eng (Team:SearchOrg) |
Pinging @elastic/search-relevance (Team:Search - Relevance) |
Hi @jimczi, I've created a changelog YAML for you. |
…' into indexing_pressure_bulk_inference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before merging this PR, I would like to examine the assumption that doubling document source size for indexing pressure purposes is incorrect. This was done during development of #125517 because the original source is pooled and the memory for it is released only after bulk request handling is complete. We accounted for this by adding indexing memory pressure for the additional copy of source generated to insert embeddings into. What new information do we have now that allows us to change this approach?
try { | ||
coordinatingIndexingPressure.increment(0, modifiedSourceSize - originalSource.ramBytesUsed()); | ||
coordinatingIndexingPressure.increment(1, modifiedSourceSize - originalSource.ramBytesUsed()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this approach, we are accounting only for the size of the embeddings (+ field data structures around them) in source. We are not accounting for the extra copy of source that we generate to add the embeddings to. Per our discussions when working on #125517, the original source (i.e. the source without embeddings) is pooled, so overwriting it in the index request does not free that memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder, I confused this with our earlier discussion about considering the input text for inference and mistakenly thought this was part of our solution. I’ll revert this change and just add the counter and the new message so we can better track the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add a comment about it, it is certainly a non-obvious detail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most def, on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed c85718d to copy the source in place when possible. This way we don't reserve the bytes during the inference calls since we don't use them yet and the request might take some time to finish.
WDYT?
In #125517, we estimated that inference results would double the document _source size since they are pooled by the bulk action.
This PR reduces the memory needed to perform the update by reusing the original source array when possible. This way we can only account for the extra inference fields and reduce the overall indexing pressure.
Additionally, this PR introduces a new counter in InferenceStats to track the number of rejections caused by indexing pressure from inference results.