-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Adjust Test_wait_interrupted_user_apc test timeout to handle deviation due to lowres timers. #116066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Adjust Test_wait_interrupted_user_apc test timeout to handle deviation due to lowres timers. #116066
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adjusts the wait timeout test to better accommodate timing deviations observed on Windows lanes using low-res timers. Key changes include lowering the minimum expected wait time to 1500 ms, adding a local variable for elapsed milliseconds, and enhancing the log output in case of a timeout error.
/azp run runtime-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
Tagging subscribers to this area: @mangod9 |
63bea4a
to
0d86cac
Compare
/azp run runtime-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-coreclr jitstress |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-coreclr r2r |
Azure Pipelines successfully started running 1 pipeline(s). |
Have run the following additional test suites> /azp run runtime-coreclr outerloop Failures seems unrelated to this test since its only running on Windows and failures are on none Windows platforms. Any more test suites that would need to be executed before we can merge this PR and re-enable the test @mangod9, @jkotas ? |
This sounds like too generous tolerance. The precision of the low-res timer on Windows is no worse than 20ms, so the tolerance for waiting less should not be more than that. (The tolerance for waiting more can be large - to handle overloaded machines.) I am wondering whether there is a subtle bug in the wait implementation that causes the error to accumulate: runtime/src/coreclr/vm/threads.cpp Lines 3314 to 3355 in f1bff2a
if (tryNonblockingWaitFirst) and should we update the start time with the value that we have read after the wait instead?
|
Didn't look too deep into CoreCLR wait implementation, but if we don't except to much deviation, then we can harden it more and potentially reduce deviation in CoreCLR wait implementation. Question is what an acceptable tolerance would be without introduce flakiness in case there is high loads on the machines running the test. Also, the purpose of the test was not the measure the exact diffs in wait, but to make sure custom APC's didn't prematurely break waits and from that perspective the current tolerance is enough, since test queues an APC every 100ms, so if APC's incorrectly breaks wait, we will notice it with current tolerance as well. |
High load on the machine can make us to wait significantly longer time, but it should never make us wait significantly shorter time. If we wait significantly shorter time, it is a bug that we should be fixed. I think we should:
|
OK, I fix up the wait implementation in this PR as well as accept any waits greater than 1980 ms (the wait is set to 2000 ms in the test) as acceptable and then re-run all the test suites to make sure they still pass. As pointed out, regardless of machine load we should never observe early wakeups, but waits might end up longer, that is fine since this test is only interested in early wakeups. |
Looks like outer loop x64 and arm64 Windows lanes hit issues with one of the tests added in #116001.
Test validates that waits are not broken too early by queued APC's by measuring time it spends waiting compared to requested timeout. Test uses higres timer to measure, but it appears that CoreCLR uses lowres timers calculating the wait timeout. Test probably need to include some error margin to handle timer resolution differences.
PR adds logging to the amount of time waited in case of error and increased the acceptance deviations to 500 ms, should be enough to trigger multiple APC's triggering retry of the internal wait with recalculated timeout.
Fixes #116060