Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix client processing time calculation in ForceMerge Runner #470

Merged
merged 1 commit into from
Mar 7, 2024

Conversation

saimedhi
Copy link
Contributor

Description

Enhance client processing time calculation in the ForceMerge Runner to address a critical issue. This improvement rectifies an error encountered when utilizing polling, where the force merge times out during the initial call without raising an exception, and subsequently completes successfully. Previously, the request_context_holder.on_client_request_end() wasn't appropriately calculated in such scenarios. For further context, please refer to the discussion here.

Issues Resolved

#450 (comment)

Testing

[Describe how this change was tested]


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: saimedhi <saimedhi@amazon.com>
@saimedhi
Copy link
Contributor Author

@VijayanB Please take a look.

Copy link
Member

@VijayanB VijayanB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@saimedhi
Copy link
Contributor Author

@IanHoang, please take a look. Thank you.

@IanHoang
Copy link
Collaborator

@VijayanB were you able to verify that an exception is raised when force merge encounters a timeout with this PR's changes?

@VijayanB
Copy link
Member

@IanHoang Sorry i missed your comment. Yes, it is reproducible in vector search for large segments.

@@ -698,6 +698,7 @@ async def __call__(self, opensearch, params):
tasks = await opensearch.tasks.list(params={"actions": "indices:admin/forcemerge"})
if len(tasks["nodes"]) == 0:
# empty nodes response indicates no tasks
request_context_holder.on_client_request_end()
Copy link
Collaborator

@gkamat gkamat Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called just prior to the preceding pass statement, else there will be a mismatch between service time and the client processing time when a timeout exception is caught.

Copy link
Contributor Author

@saimedhi saimedhi Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @gkamat, I've updated some code in my previous PR to ensure that request start is always calculated after the client request start, and that request end is always before the client request end. I'm sorry if I didn't address your concern properly. I believe that to get accurate metrics, request_context_holder.on_client_request_end() should be added after the task is completed. What do you think, @IanHoang and @VijayanB? Would love to hear your thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the force merge process will continue in the background If the client connection is lost before completion. IMO, if we move the request end above pass, we are giving false impression that force merge is completed though it is still being processed and most of the time spent in client side instead of server. In short, i believe we should keep it at 701.

Copy link
Collaborator

@gkamat gkamat Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VijayanB, the time for the force merge reported is going to be inaccurate in any case, since the resolution with be the poll period, which is 10 sec. It is just the degree of error, which should be weighed against the following.

The issue is how the client processing time is computed for multi-part operations. In this case there is a truncated operation caused by the exception. That should be handled on its own and the corresponding service time and associated client overhead computed.

Since there is a subsequent loop, both of these variables should be accumulators. Each time through the loop the poll period should be added to the service time. Each call to the list operation will be a contributor as well, furthermore the client overhead of that call will be another metric that needs to be updated. Finally, the client response overhead of the truncated force merge call should be carried forward and passed back to the caller after updating the request-context data structure appropriately.

The root cause of this is how multi-part requests are handled, as was indicated to @saimedhi earlier. Otherwise the concept of what the client overhead is for a particular call becomes ambiguous.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is holding the release back and the submitted fix will resolve the immediate problem, we can merge this in for now and address the larger issue subsequently.

Copy link
Member

@VijayanB VijayanB Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkamat Polling is not part of force merge, it is a mechanism to see whether force merge is completed or not. I don't think this is multi part operation. It is single async operation. I agree we are not going to be accurate, but i will be cautious and comfortable to report slightly higher than actual value over lesser than real force merge duration

Copy link
Collaborator

@gkamat gkamat Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VijayanB yes, polling is not part of force merge but the issue being referred to is not so much the inaccuracy in the duration reported as the accuracy of client overhead computation. This is somewhat similar to, say, how scroll queries are handled, where the same issues are at play and needs ton be fixed as well.

@gkamat gkamat merged commit 042fd5e into opensearch-project:main Mar 7, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants