-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix client processing time calculation in ForceMerge Runner #470
Conversation
Signed-off-by: saimedhi <saimedhi@amazon.com>
@VijayanB Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@IanHoang, please take a look. Thank you. |
@VijayanB were you able to verify that an exception is raised when force merge encounters a timeout with this PR's changes? |
@IanHoang Sorry i missed your comment. Yes, it is reproducible in vector search for large segments. |
@@ -698,6 +698,7 @@ async def __call__(self, opensearch, params): | |||
tasks = await opensearch.tasks.list(params={"actions": "indices:admin/forcemerge"}) | |||
if len(tasks["nodes"]) == 0: | |||
# empty nodes response indicates no tasks | |||
request_context_holder.on_client_request_end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be called just prior to the preceding pass
statement, else there will be a mismatch between service time and the client processing time when a timeout exception is caught.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @gkamat, I've updated some code in my previous PR to ensure that request start is always calculated after the client request start, and that request end is always before the client request end. I'm sorry if I didn't address your concern properly. I believe that to get accurate metrics, request_context_holder.on_client_request_end() should be added after the task is completed. What do you think, @IanHoang and @VijayanB? Would love to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the force merge process will continue in the background If the client connection is lost before completion. IMO, if we move the request end above pass, we are giving false impression that force merge is completed though it is still being processed and most of the time spent in client side instead of server. In short, i believe we should keep it at 701.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VijayanB, the time for the force merge reported is going to be inaccurate in any case, since the resolution with be the poll period, which is 10 sec. It is just the degree of error, which should be weighed against the following.
The issue is how the client processing time is computed for multi-part operations. In this case there is a truncated operation caused by the exception. That should be handled on its own and the corresponding service time and associated client overhead computed.
Since there is a subsequent loop, both of these variables should be accumulators. Each time through the loop the poll period should be added to the service time. Each call to the list
operation will be a contributor as well, furthermore the client overhead of that call will be another metric that needs to be updated. Finally, the client response overhead of the truncated force merge call should be carried forward and passed back to the caller after updating the request-context
data structure appropriately.
The root cause of this is how multi-part requests are handled, as was indicated to @saimedhi earlier. Otherwise the concept of what the client overhead is for a particular call becomes ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is holding the release back and the submitted fix will resolve the immediate problem, we can merge this in for now and address the larger issue subsequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gkamat Polling is not part of force merge, it is a mechanism to see whether force merge is completed or not. I don't think this is multi part operation. It is single async operation. I agree we are not going to be accurate, but i will be cautious and comfortable to report slightly higher than actual value over lesser than real force merge duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VijayanB yes, polling is not part of force merge but the issue being referred to is not so much the inaccuracy in the duration reported as the accuracy of client overhead computation. This is somewhat similar to, say, how scroll queries are handled, where the same issues are at play and needs ton be fixed as well.
Description
Enhance client processing time calculation in the ForceMerge Runner to address a critical issue. This improvement rectifies an error encountered when utilizing polling, where the force merge times out during the initial call without raising an exception, and subsequently completes successfully. Previously, the request_context_holder.on_client_request_end() wasn't appropriately calculated in such scenarios. For further context, please refer to the discussion here.
Issues Resolved
#450 (comment)
Testing
[Describe how this change was tested]
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.