Skip to content

Fix macOS floating point corruption #75440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 12, 2022

Conversation

janvorli
Copy link
Member

There is a race condition between activation signal handling and returning from a hardware exception handler on macOS. It shows up intermittently in the Regression/coreclr/GitHub_16833 test in the CI and I am able to repro it on my local mac once in several thousands of iterations of the test when running with GC stress C.

It turned out the issue is caused by the order in which we set parts of the context in the thread when returning from the hardware exception handler. MacOS can only set the floating point and control / integer portions separately. We were setting the control / integer portion first and the floating point portion after that. In the race condition, the signal handling code in the macOS extracts the context that contains the new control registers, but the old floating point ones (which is the state of those in the PAL_DispatchException). The signal handler for the activation injection then gets executed and when it returns later to our managed code, the floating point registers get restored to the wrong values.
I have figured this out by adding some instrumentation to the PAL / runtime, logging state of the PC and D0 registers to the stresslog in activation handler / hardware exception handler. I could see that the hardware exception handler was getting correct D0 at a PC and right away the activation handler was getting a corrupted D0at the same PC and the D0 matched what the formatted string contained in the failing test instead of the correct value (0xcccccccccccccccc or rarely 0).

The fix is to change the context setting to first set the floating point registers and then the control / integer portion of the context.

With the fix, the test was running over the whole weekend without any crashes.

Close #66568

There is a race condition between activation signal handling and returning
from a hardware exception handler on macOS. It shows up intermittently in
the Regression/coreclr/GitHub_16833 test in the CI and I am able to repro
it on my local mac once in several thousands of iterations of the test when
running with GC stress C.

It turned out the issue is caused by the order in which we set parts of the
context in the thread when returning from the hardware exception handler.
MacOS can only set the floating point and control / integer portions separately.
We were setting the control / integer portion first and the floating point
portion after that. In the race condition, the signal handling code in the
macOS extracts the context that contains the new control registers, but the
old floating point ones (which is the state of those in the PAL_DispatchException).
The signal handler for the activation injection then gets executed and when it
returns later to our managed code, the floating point registers get restored
to the wrong values.

The fix is to change the context setting to first set the floating point
registers and then the control / integer portion of the context.

Close dotnet#66568
@janvorli janvorli added this to the 7.0.0 milestone Sep 12, 2022
@janvorli janvorli requested review from jkotas and VSadov September 12, 2022 08:57
@janvorli janvorli self-assigned this Sep 12, 2022
@janvorli
Copy link
Member Author

cc: @mangod9

@janvorli
Copy link
Member Author

cc: @AntonLapounov

@AntonLapounov
Copy link
Member

In the race condition, the signal handling code in the macOS extracts the context that contains the new control registers, but the old floating point ones

The fix is to change the context setting to first set the floating point registers and then the control / integer portion of the context.

What prevents the opposite case, i.e. using the old control registers with the new floating point ones? Thanks for catching this!

@@ -1569,6 +1487,65 @@ CONTEXT_SetThreadContextOnPort(
}
}

if (lpContext->ContextFlags & (CONTEXT_CONTROL|CONTEXT_INTEGER) & CONTEXT_AREA_MASK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful to add a note that the integer/control registers have to be set last.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@janvorli janvorli merged commit 96f5037 into dotnet:main Sep 12, 2022
@janvorli
Copy link
Member Author

/backport to release/7.0

@github-actions
Copy link
Contributor

Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3039605422

@ghost ghost locked as resolved and limited conversation to collaborators Oct 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failure Regressions/coreclr/GitHub_16833/Test16833/Test16833.sh
3 participants