-
Notifications
You must be signed in to change notification settings - Fork 5k
[Mono] Fix c11 ARM64 atomics to issue full memory barrier. #115573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Mono] Fix c11 ARM64 atomics to issue full memory barrier. #115573
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an issue with C11 atomic operations on ARM64 by ensuring that a full memory barrier is issued after each atomic operation, aligning the behavior with that of the Mono and CoreCLR JIT compilers. Key changes include:
- Adding C11_MEMORY_ORDER_SEQ_CST calls after atomic operations in src/mono/mono/utils/atomic.h.
- Introducing a functional test in the iOS device test project (using Mono interpreter) to catch memory reordering issues.
- Adding a new test in the ParallelForEachAsyncTests to validate the fix.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
src/tests/FunctionalTests/iOS/Device/ParallelForEachAsync/iOS.Device.ParallelForEachAsync.Test.csproj | New test project targeting iOS/arm64 with Mono interpreter configurations. |
src/tests/FunctionalTests/iOS/Device/ParallelForEachAsync/Program.cs | Functional test harness executing the ParallelForEachAsync workload. |
src/mono/mono/utils/atomic.h | Updated atomic operations to include a full memory barrier for ARM64. |
src/libraries/System.Threading.Tasks.Parallel/tests/ParallelForEachAsyncTests.cs | Added new test validating the atomic ordering fix. |
src/libraries/System.Threading.Tasks.Parallel/tests/ParallelForEachAsyncTests.cs
Show resolved
Hide resolved
Kudos to @kotlarmilos for the work on reproduction steps. |
@dotnet/jit-contrib could somebody from the JIT team review this change please? |
@kunalspathak can you review? |
{ | ||
[DllImport("__Internal")] | ||
public static extern void mono_ios_set_summary (string value); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since this test is not validating any observed result, but just hoping that the test don't crash, it might be good to include a line or 2 on what this test is getting validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Few things:
That is why a functional test was included as well, since we can control the exact configuration needed to reproduce the issue and prevent future regressions from happening.
Are there extra legs that should be run to make sure this is well tested?
I am curious, was there any performance measurements done to see impact of adding full memory barrier, because they are expensive. Of course, the correctness is more important than performance, but we can identify just certain spots where full barrier is needed and can remove from other places. Just a thought.
Not related to this PR specifically, since it follows patterns we use all over, but we may want to get some clarity on this from ARM folks. The spec indeed says ACQ/REL, so in theory As I remember problems with LL/SC were with effects leaking into the span between LL and SC. It is hard to see how that can happen with a single instruction, but I would not guess all the possibilities of how that works in hardware - can Basically, should we be worried about one day needing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The iOS.Device.ParallelForEachAsync.Test functional test is part of the smoke tests that run on this PR and any subsequent one when it lands. The test added to System.Threading.Tasks will run on rolling builds or when runtime-extra-platforms is manually started.
To my knowledge this has yet to run in the lab. I'll see if we have any numbers and if they are significant, will follow up after the fact. |
/backport to release/9.0 |
Started backporting to release/9.0: https://github.com/dotnet/runtime/actions/runs/15057610527 |
In .NET 8 we used GCC atomics that we made fully sequential consistent, meaning we added full memory barriers to both sides of the atomic instruction on ARM and ARM64, so moving back to just having one barrier will still be more performant compared to .NET 8, but a little less for .NET 9 that on the other hand have correctness issues due to being to weak. Since Mono have been running with these strong guarantees on its internal atomic operations over the last 9 years, use both by internal runtime code as well as Interlocked implementation, we would need a full audit and change to alternative weaker atomic implementation where possible and probably do it case by case to not regress again. |
Jupp, this was my concern as well. The issue has not been observed when using SWPAL/CASAL/LDADDAL, those instructions give acquire/release semantics and not a full memory barrier. LDXR/STXR/CMP sequence gives acquire/release semantics too, without full memory barrier, but our code generation makes an explicit full memory barrier after since that is what expected by our interlocked xchg/cas operations. On Mono we currently don't use LSE instructions in our JIT and with this PR we will add a full barrier after calling the C11 atomics as well, meaning that even if the underlying compiler decides to use LSE instructions for our C11 atomic usage, we will end up with a full memory barrier. Our LLVM code generation is even stronger and inline with our old GCC atomic support, full barrier before/after atomic operations. CoreCLR on the other hand handles it differenlty. If LSE instructions are used in CodeGen::genLockedInstructions and CodeGen::genCodeForCmpXchg we will only end up with acquire/release semantics without full memory barrier support. |
Can I get this for dotnet9 i somewhere just to try it out and if that fixes our issues ? |
In .NET 9 Mono changed its internal atomic implementation in
atomic.h
to use C11 atomics on a number of platforms and CPU architectures, c48aa81, but it was not until we implemented I4 Interlocked.Exchange intrinsics in interpreter, 12ecfe7, that triggered issue to become more visible.Before .NET 9, Mono used GCC atomics, they are also plagued with assumption that acquire/release barriers are enough to be sequential consistent, so Mono added full barriers around these atomic calls, issuing full memory barrier before and after atomic instruction. This is even a stronger guarantee than what Mono and CoreCLR JIT compilers offers when compiling the Interlocked functions, so could be relaxed, at least on ARM64.
The default behavior of C11 atomics should be sequential consistent, but the implementation of C11 atomics on ARM64(clang/gcc), settle on half barriers, acquire/release semantics, but won't issue a full memory barrier. Mono's atomic implementation is used both within runtime as well as from
Interlocked
functions in icalls as well as interpreter implementation ofInterlocked
op codes. The callers of these functions expect a full memory barrier, meaning no load/stores can move past the atomic operation, for example C11atomic_exchange
on ARM64 ends up with the following assembly sequence on iOS and Android platforms when CPU doesn’t support ARMv8 with LSE instructions set:This doesn't include a full memory barrier after the exchange, breaking the assumptions of the callers that the function should be sequential consistent without any ordering of load/stores across the atomic operation.
Android includes runtime checks to switch to a different set of instructions,
swpal
/casal
, if CPU supports ARMv8 + LSE. The memory reorder issues observed in #114262 have not been reproduced when the LSE assembly instructions are used, probably because it is less likely with one instruction instead of 2+ sequence, and load/stores can't be reordered past the full acquire/release barriers but can be reordered into the barriers. The guarantee offered byswpal
/casal
is still acquire/release semantics without a full barrier, so in theory it could end up with similar reorder issues. The issue has not been reproduced on iOS simulator since compiler will use ARMv8 + LSE instruction for that build. On iOS device build it will compile without LSE support, generating the more fragile assembler sequence, making it more likely to hit the issue on any iOS ARM64 device.Mono JIT code generation already handles this scenario on ARM64, it will make sure exchange and CAS operations end up with a full memory barrier, CoreCLR does the same thing for its lowering of Interlocked functions and that is why the issue won't reproduce when using AOT compiled code, only using interpreter. Having that said, since the
mono_atomic_*
functions are used within runtime, it might still manifest itself under other configurations, probably leading to memory corruption and weird crashes.Fix aligns code generated by C11 atomics with what's being generated by JIT compilers, ending with a full memory barrier:
This is inline with the code generated by both Mono and CoreCLR JIT ARM64 compilers for Interlocked functions.
Android ARM32 was also checked, and it appears that C11 atomics generate the expected full memory barrier on this architecture, meaning no additional memory barriers are needed.
PR adds two new tests (identical), one into libraries Tasks
ParallelForEachAsyncTests
, but since this issue requires iOS device + ARM64 + interpreter to reproduce, PR also adds the same test as a functional test, where we can force it to use interpreter only and only run test on Mono. It is also worth noting that before implementing this fix, runningParallelForEachAsyncTests
on iOS device + ARM64 + interpreter was observed to cause hangs, GC crashes and other unstable runtime behaviour (due to memory reorder). With this fix,ParallelForEachAsyncTests
pass without any observed issues.It appears that CI doesn't run tests using iOS device + AMR64 + interpreter, so it won’t expose these issues. That is why a functional test was included as well, since we can control the exact configuration needed to reproduce the issue and prevent future regressions from happening.