Skip to content

Adjust Test_wait_interrupted_user_apc test timeout to handle deviation due to lowres timers. #116066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

lateralusX
Copy link
Member

@lateralusX lateralusX commented May 28, 2025

Looks like outer loop x64 and arm64 Windows lanes hit issues with one of the tests added in #116001.

Test validates that waits are not broken too early by queued APC's by measuring time it spends waiting compared to requested timeout. Test uses higres timer to measure, but it appears that CoreCLR uses lowres timers calculating the wait timeout. Test probably need to include some error margin to handle timer resolution differences.

PR adds logging to the amount of time waited in case of error and increased the acceptance deviations to 500 ms, should be enough to trigger multiple APC's triggering retry of the internal wait with recalculated timeout.

Fixes #116060

@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 28, 2025
@lateralusX lateralusX marked this pull request as ready for review May 28, 2025 12:57
@Copilot Copilot AI review requested due to automatic review settings May 28, 2025 12:57
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adjusts the wait timeout test to better accommodate timing deviations observed on Windows lanes using low-res timers. Key changes include lowering the minimum expected wait time to 1500 ms, adding a local variable for elapsed milliseconds, and enhancing the log output in case of a timeout error.

@lateralusX
Copy link
Member Author

lateralusX commented May 28, 2025

/azp run runtime-coreclr outerloop

@dotnet dotnet deleted a comment from azure-pipelines bot May 28, 2025
@dotnet dotnet deleted a comment from azure-pipelines bot May 28, 2025
@dotnet dotnet deleted a comment from azure-pipelines bot May 28, 2025
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@teo-tsirpanis teo-tsirpanis added area-System.Threading and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 31, 2025
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@lateralusX lateralusX force-pushed the lateralusX/adjust-115178-timeout branch from 63bea4a to 0d86cac Compare June 3, 2025 16:54
@lateralusX
Copy link
Member Author

/azp run runtime-coreclr outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lateralusX
Copy link
Member Author

/azp run runtime-coreclr jitstress

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lateralusX
Copy link
Member Author

/azp run runtime-coreclr r2r

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lateralusX
Copy link
Member Author

lateralusX commented Jun 9, 2025

Have run the following additional test suites>

/azp run runtime-coreclr outerloop
/azp run runtime-coreclr jitstress
/azp run runtime-coreclr r2r

Failures seems unrelated to this test since its only running on Windows and failures are on none Windows platforms.

Any more test suites that would need to be executed before we can merge this PR and re-enable the test @mangod9, @jkotas ?

@jkotas
Copy link
Member

jkotas commented Jun 11, 2025

increased the acceptance deviations to 500 ms,

This sounds like too generous tolerance. The precision of the low-res timer on Windows is no worse than 20ms, so the tolerance for waiting less should not be more than that. (The tolerance for waiting more can be large - to handle overloaded machines.)

I am wondering whether there is a subtle bug in the wait implementation that causes the error to accumulate:

retry:
if (millis != INFINITE)
{
dwStart = minipal_lowres_ticks();
}
if (tryNonblockingWaitFirst)
{
// We have a final wait result from the nonblocking wait above
tryNonblockingWaitFirst = false;
}
else
{
ret = DoAppropriateAptStateWait(countHandles, handles, waitAll, millis, mode);
}
if (ret == WAIT_IO_COMPLETION)
{
_ASSERTE (alertable);
if (m_State & TS_Interrupted)
{
HandleThreadInterrupt();
}
// We could be woken by some spurious APC or an EE APC queued to
// interrupt us. In the latter case the TS_Interrupted bit will be set
// in the thread state bits. Otherwise we just go back to sleep again.
if (millis != INFINITE)
{
dwEnd = minipal_lowres_ticks();
if (dwEnd - dwStart >= millis)
{
ret = WAIT_TIMEOUT;
goto WaitCompleted;
}
else
{
millis -= (DWORD)(dwEnd - dwStart);
}
}
goto retry;
}
. Should the retry label be right in front of if (tryNonblockingWaitFirst) and should we update the start time with the value that we have read after the wait instead?

@lateralusX
Copy link
Member Author

lateralusX commented Jun 16, 2025

Didn't look too deep into CoreCLR wait implementation, but if we don't except to much deviation, then we can harden it more and potentially reduce deviation in CoreCLR wait implementation. Question is what an acceptable tolerance would be without introduce flakiness in case there is high loads on the machines running the test.

Also, the purpose of the test was not the measure the exact diffs in wait, but to make sure custom APC's didn't prematurely break waits and from that perspective the current tolerance is enough, since test queues an APC every 100ms, so if APC's incorrectly breaks wait, we will notice it with current tolerance as well.

@jkotas
Copy link
Member

jkotas commented Jun 17, 2025

High load on the machine can make us to wait significantly longer time, but it should never make us wait significantly shorter time. If we wait significantly shorter time, it is a bug that we should be fixed.

I think we should:

  • Set the tolerance for waiting less than specified to 20ms (it is more than 15.6ms that is GetTickCount resolution)
  • Improve precision of wait time calculation that I have pointed above (something like jkotas@f2efd2f)

@lateralusX
Copy link
Member Author

lateralusX commented Jun 17, 2025

OK, I fix up the wait implementation in this PR as well as accept any waits greater than 1980 ms (the wait is set to 2000 ms in the test) as acceptable and then re-run all the test suites to make sure they still pass. As pointed out, regardless of machine load we should never observe early wakeups, but waits might end up longer, that is fine since this test is only interested in early wakeups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failure: baseservices/threading/regressions/115178/115178/115178.cmd
4 participants