-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Fix macOS floating point corruption #75440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix macOS floating point corruption #75440
Conversation
There is a race condition between activation signal handling and returning from a hardware exception handler on macOS. It shows up intermittently in the Regression/coreclr/GitHub_16833 test in the CI and I am able to repro it on my local mac once in several thousands of iterations of the test when running with GC stress C. It turned out the issue is caused by the order in which we set parts of the context in the thread when returning from the hardware exception handler. MacOS can only set the floating point and control / integer portions separately. We were setting the control / integer portion first and the floating point portion after that. In the race condition, the signal handling code in the macOS extracts the context that contains the new control registers, but the old floating point ones (which is the state of those in the PAL_DispatchException). The signal handler for the activation injection then gets executed and when it returns later to our managed code, the floating point registers get restored to the wrong values. The fix is to change the context setting to first set the floating point registers and then the control / integer portion of the context. Close dotnet#66568
cc: @mangod9 |
cc: @AntonLapounov |
What prevents the opposite case, i.e. using the old control registers with the new floating point ones? Thanks for catching this! |
@@ -1569,6 +1487,65 @@ CONTEXT_SetThreadContextOnPort( | |||
} | |||
} | |||
|
|||
if (lpContext->ContextFlags & (CONTEXT_CONTROL|CONTEXT_INTEGER) & CONTEXT_AREA_MASK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be useful to add a note that the integer/control registers have to be set last.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/backport to release/7.0 |
Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3039605422 |
There is a race condition between activation signal handling and returning from a hardware exception handler on macOS. It shows up intermittently in the Regression/coreclr/GitHub_16833 test in the CI and I am able to repro it on my local mac once in several thousands of iterations of the test when running with GC stress C.
It turned out the issue is caused by the order in which we set parts of the context in the thread when returning from the hardware exception handler. MacOS can only set the floating point and control / integer portions separately. We were setting the control / integer portion first and the floating point portion after that. In the race condition, the signal handling code in the macOS extracts the context that contains the new control registers, but the old floating point ones (which is the state of those in the
PAL_DispatchException
). The signal handler for the activation injection then gets executed and when it returns later to our managed code, the floating point registers get restored to the wrong values.I have figured this out by adding some instrumentation to the PAL / runtime, logging state of the
PC
andD0
registers to the stresslog in activation handler / hardware exception handler. I could see that the hardware exception handler was getting correctD0
at aPC
and right away the activation handler was getting a corruptedD0
at the samePC
and theD0
matched what the formatted string contained in the failing test instead of the correct value (0xcccccccccccccccc or rarely 0).The fix is to change the context setting to first set the floating point registers and then the control / integer portion of the context.
With the fix, the test was running over the whole weekend without any crashes.
Close #66568