Fix client processing time calculation in ForceMerge Runner #470

saimedhi · 2024-02-26T18:29:03Z

Description

Enhance client processing time calculation in the ForceMerge Runner to address a critical issue. This improvement rectifies an error encountered when utilizing polling, where the force merge times out during the initial call without raising an exception, and subsequently completes successfully. Previously, the request_context_holder.on_client_request_end() wasn't appropriately calculated in such scenarios. For further context, please refer to the discussion here.

Issues Resolved

#450 (comment)

Testing

[Describe how this change was tested]

Tested by @VijayanB here

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: saimedhi <saimedhi@amazon.com>

saimedhi · 2024-02-26T18:45:34Z

@VijayanB Please take a look.

VijayanB

LGTM

saimedhi · 2024-02-26T21:53:04Z

@IanHoang, please take a look. Thank you.

IanHoang · 2024-02-28T21:18:57Z

@VijayanB were you able to verify that an exception is raised when force merge encounters a timeout with this PR's changes?

VijayanB · 2024-02-28T23:03:19Z

@IanHoang Sorry i missed your comment. Yes, it is reproducible in vector search for large segments.

gkamat · 2024-03-07T06:38:18Z

osbenchmark/worker_coordinator/runner.py

@@ -698,6 +698,7 @@ async def __call__(self, opensearch, params):
                tasks = await opensearch.tasks.list(params={"actions": "indices:admin/forcemerge"})
                if len(tasks["nodes"]) == 0:
                    # empty nodes response indicates no tasks
+                    request_context_holder.on_client_request_end()


This should be called just prior to the preceding pass statement, else there will be a mismatch between service time and the client processing time when a timeout exception is caught.

Hey @gkamat, I've updated some code in my previous PR to ensure that request start is always calculated after the client request start, and that request end is always before the client request end. I'm sorry if I didn't address your concern properly. I believe that to get accurate metrics, request_context_holder.on_client_request_end() should be added after the task is completed. What do you think, @IanHoang and @VijayanB? Would love to hear your thoughts.

I believe the force merge process will continue in the background If the client connection is lost before completion. IMO, if we move the request end above pass, we are giving false impression that force merge is completed though it is still being processed and most of the time spent in client side instead of server. In short, i believe we should keep it at 701.

@VijayanB, the time for the force merge reported is going to be inaccurate in any case, since the resolution with be the poll period, which is 10 sec. It is just the degree of error, which should be weighed against the following.

The issue is how the client processing time is computed for multi-part operations. In this case there is a truncated operation caused by the exception. That should be handled on its own and the corresponding service time and associated client overhead computed.

Since there is a subsequent loop, both of these variables should be accumulators. Each time through the loop the poll period should be added to the service time. Each call to the list operation will be a contributor as well, furthermore the client overhead of that call will be another metric that needs to be updated. Finally, the client response overhead of the truncated force merge call should be carried forward and passed back to the caller after updating the request-context data structure appropriately.

The root cause of this is how multi-part requests are handled, as was indicated to @saimedhi earlier. Otherwise the concept of what the client overhead is for a particular call becomes ambiguous.

Since this is holding the release back and the submitted fix will resolve the immediate problem, we can merge this in for now and address the larger issue subsequently.

@gkamat Polling is not part of force merge, it is a mechanism to see whether force merge is completed or not. I don't think this is multi part operation. It is single async operation. I agree we are not going to be accurate, but i will be cautious and comfortable to report slightly higher than actual value over lesser than real force merge duration

@VijayanB yes, polling is not part of force merge but the issue being referred to is not so much the inaccuracy in the duration reported as the accuracy of client overhead computation. This is somewhat similar to, say, how scroll queries are handled, where the same issues are at play and needs ton be fixed as well.

Fix client processing time calculation in ForceMerge Runner

2255b68

Signed-off-by: saimedhi <saimedhi@amazon.com>

saimedhi requested review from IanHoang, gkamat, beaioun and cgchinmay as code owners February 26, 2024 18:29

VijayanB approved these changes Feb 26, 2024

View reviewed changes

gkamat requested changes Mar 7, 2024

View reviewed changes

gkamat approved these changes Mar 7, 2024

View reviewed changes

gkamat merged commit 042fd5e into opensearch-project:main Mar 7, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix client processing time calculation in ForceMerge Runner #470

Fix client processing time calculation in ForceMerge Runner #470

saimedhi commented Feb 26, 2024

saimedhi commented Feb 26, 2024

VijayanB left a comment

saimedhi commented Feb 26, 2024

IanHoang commented Feb 28, 2024

VijayanB commented Feb 28, 2024

gkamat Mar 7, 2024 •

edited

Loading

saimedhi Mar 7, 2024 •

edited

Loading

VijayanB Mar 7, 2024

gkamat Mar 7, 2024 •

edited

Loading

gkamat Mar 7, 2024

VijayanB Mar 7, 2024 •

edited

Loading

gkamat Mar 7, 2024 •

edited

Loading

Fix client processing time calculation in ForceMerge Runner #470

Fix client processing time calculation in ForceMerge Runner #470

Conversation

saimedhi commented Feb 26, 2024

Description

Issues Resolved

Testing

saimedhi commented Feb 26, 2024

VijayanB left a comment

Choose a reason for hiding this comment

saimedhi commented Feb 26, 2024

IanHoang commented Feb 28, 2024

VijayanB commented Feb 28, 2024

gkamat Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

saimedhi Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

VijayanB Mar 7, 2024

Choose a reason for hiding this comment

gkamat Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

gkamat Mar 7, 2024

Choose a reason for hiding this comment

VijayanB Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

gkamat Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

gkamat Mar 7, 2024 •

edited

Loading

saimedhi Mar 7, 2024 •

edited

Loading

gkamat Mar 7, 2024 •

edited

Loading

VijayanB Mar 7, 2024 •

edited

Loading

gkamat Mar 7, 2024 •

edited

Loading