[Mono] Fix c11 ARM64 atomics to issue full memory barrier. #115573

lateralusX · 2025-05-14T18:25:38Z

In .NET 9 Mono changed its internal atomic implementation in atomic.h to use C11 atomics on a number of platforms and CPU architectures, c48aa81, but it was not until we implemented I4 Interlocked.Exchange intrinsics in interpreter, 12ecfe7, that triggered issue to become more visible.

Before .NET 9, Mono used GCC atomics, they are also plagued with assumption that acquire/release barriers are enough to be sequential consistent, so Mono added full barriers around these atomic calls, issuing full memory barrier before and after atomic instruction. This is even a stronger guarantee than what Mono and CoreCLR JIT compilers offers when compiling the Interlocked functions, so could be relaxed, at least on ARM64.

The default behavior of C11 atomics should be sequential consistent, but the implementation of C11 atomics on ARM64(clang/gcc), settle on half barriers, acquire/release semantics, but won't issue a full memory barrier. Mono's atomic implementation is used both within runtime as well as from Interlocked functions in icalls as well as interpreter implementation of Interlocked op codes. The callers of these functions expect a full memory barrier, meaning no load/stores can move past the atomic operation, for example C11 atomic_exchange on ARM64 ends up with the following assembly sequence on iOS and Android platforms when CPU doesn’t support ARMv8 with LSE instructions set:

ldaxr       // Load exclusive with acquire semantics.
stlxr       // Store exclusive with release sematnics
cbnz        // Retry on failure.

This doesn't include a full memory barrier after the exchange, breaking the assumptions of the callers that the function should be sequential consistent without any ordering of load/stores across the atomic operation.

Android includes runtime checks to switch to a different set of instructions, swpal/casal, if CPU supports ARMv8 + LSE. The memory reorder issues observed in #114262 have not been reproduced when the LSE assembly instructions are used, probably because it is less likely with one instruction instead of 2+ sequence, and load/stores can't be reordered past the full acquire/release barriers but can be reordered into the barriers. The guarantee offered by swpal/casal is still acquire/release semantics without a full barrier, so in theory it could end up with similar reorder issues. The issue has not been reproduced on iOS simulator since compiler will use ARMv8 + LSE instruction for that build. On iOS device build it will compile without LSE support, generating the more fragile assembler sequence, making it more likely to hit the issue on any iOS ARM64 device.

Mono JIT code generation already handles this scenario on ARM64, it will make sure exchange and CAS operations end up with a full memory barrier, CoreCLR does the same thing for its lowering of Interlocked functions and that is why the issue won't reproduce when using AOT compiled code, only using interpreter. Having that said, since the mono_atomic_* functions are used within runtime, it might still manifest itself under other configurations, probably leading to memory corruption and weird crashes.

Fix aligns code generated by C11 atomics with what's being generated by JIT compilers, ending with a full memory barrier:

ldaxr       // Load exclusive with acquire semantics.
stlxr       // Store exclusive with release semantics
cbnz        // Retry on failure.
dmb ish     // Full memory barrier.

This is inline with the code generated by both Mono and CoreCLR JIT ARM64 compilers for Interlocked functions.

Android ARM32 was also checked, and it appears that C11 atomics generate the expected full memory barrier on this architecture, meaning no additional memory barriers are needed.

PR adds two new tests (identical), one into libraries Tasks ParallelForEachAsyncTests, but since this issue requires iOS device + ARM64 + interpreter to reproduce, PR also adds the same test as a functional test, where we can force it to use interpreter only and only run test on Mono. It is also worth noting that before implementing this fix, running ParallelForEachAsyncTests on iOS device + ARM64 + interpreter was observed to cause hangs, GC crashes and other unstable runtime behaviour (due to memory reorder). With this fix, ParallelForEachAsyncTests pass without any observed issues.

It appears that CI doesn't run tests using iOS device + AMR64 + interpreter, so it won’t expose these issues. That is why a functional test was included as well, since we can control the exact configuration needed to reproduce the issue and prevent future regressions from happening.

Copilot

Pull Request Overview

This PR fixes an issue with C11 atomic operations on ARM64 by ensuring that a full memory barrier is issued after each atomic operation, aligning the behavior with that of the Mono and CoreCLR JIT compilers. Key changes include:

Adding C11_MEMORY_ORDER_SEQ_CST calls after atomic operations in src/mono/mono/utils/atomic.h.
Introducing a functional test in the iOS device test project (using Mono interpreter) to catch memory reordering issues.
Adding a new test in the ParallelForEachAsyncTests to validate the fix.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
src/tests/FunctionalTests/iOS/Device/ParallelForEachAsync/iOS.Device.ParallelForEachAsync.Test.csproj	New test project targeting iOS/arm64 with Mono interpreter configurations.
src/tests/FunctionalTests/iOS/Device/ParallelForEachAsync/Program.cs	Functional test harness executing the ParallelForEachAsync workload.
src/mono/mono/utils/atomic.h	Updated atomic operations to include a full memory barrier for ARM64.
src/libraries/System.Threading.Tasks.Parallel/tests/ParallelForEachAsyncTests.cs	Added new test validating the atomic ordering fix.

src/mono/mono/utils/atomic.h

src/libraries/System.Threading.Tasks.Parallel/tests/ParallelForEachAsyncTests.cs

lateralusX · 2025-05-14T19:23:33Z

Kudos to @kotlarmilos for the work on reproduction steps.

vitek-karas · 2025-05-14T21:05:54Z

@dotnet/jit-contrib could somebody from the JIT team review this change please?

…o JIT.

AndyAyersMS · 2025-05-15T17:42:15Z

@kunalspathak can you review?

JulieLeeMSFT · 2025-05-15T21:06:27Z

Ping @VSadov for a code review. CC @mangod9. Kunal is reviewing it, but it would be great if you can review it too. Steve needs to get this in for June servicing, due tomorrow.

kunalspathak · 2025-05-15T21:09:14Z

src/tests/FunctionalTests/iOS/Device/ParallelForEachAsync/Program.cs

+{
+    [DllImport("__Internal")]
+    public static extern void mono_ios_set_summary (string value);
+


nit: Since this test is not validating any observed result, but just hoping that the test don't crash, it might be good to include a line or 2 on what this test is getting validated.

kunalspathak

LGTM. Few things:

That is why a functional test was included as well, since we can control the exact configuration needed to reproduce the issue and prevent future regressions from happening.

Are there extra legs that should be run to make sure this is well tested?

I am curious, was there any performance measurements done to see impact of adding full memory barrier, because they are expensive. Of course, the correctness is more important than performance, but we can identify just certain spots where full barrier is needed and can remove from other places. Just a thought.

VSadov · 2025-05-15T21:59:57Z

The memory reorder issues observed in #114262 have not been reproduced when the LSE assembly instructions are used, probably because it is less likely with one instruction instead of 2+ sequence, and load/stores can't be reordered past the full acquire/release barriers but can be reordered into the barriers. The guarantee offered by swpal/casal is still acquire/release semantics without a full barrier, so in theory it could end up with similar reorder issues.

Not related to this PR specifically, since it follows patterns we use all over, but we may want to get some clarity on this from ARM folks.

The spec indeed says ACQ/REL, so in theory casal can cause all the same troubles as LL/SC implementation. In practice we do not see that happening, possibly due to implementation detail of all current implementations (i.e. using cache coherence protocols to implement it, maybe?).

As I remember problems with LL/SC were with effects leaking into the span between LL and SC. It is hard to see how that can happen with a single instruction, but I would not guess all the possibilities of how that works in hardware - can casal be microcoded into something like LL/SC?

Basically, should we be worried about one day needing dmb after LSE atomics?

VSadov

LGTM!

steveisok · 2025-05-15T22:03:54Z

Are there extra legs that should be run to make sure this is well tested?

The iOS.Device.ParallelForEachAsync.Test functional test is part of the smoke tests that run on this PR and any subsequent one when it lands. The test added to System.Threading.Tasks will run on rolling builds or when runtime-extra-platforms is manually started.

I am curious, was there any performance measurements done to see impact of adding full memory barrier, because they are expensive. Of course, the correctness is more important than performance, but we can identify just certain spots where full barrier is needed and can remove from other places. Just a thought.

To my knowledge this has yet to run in the lab. I'll see if we have any numbers and if they are significant, will follow up after the fact.

steveisok · 2025-05-16T00:06:44Z

/backport to release/9.0

github-actions · 2025-05-16T00:06:57Z

Started backporting to release/9.0: https://github.com/dotnet/runtime/actions/runs/15057610527

lateralusX · 2025-05-16T07:49:01Z

I am curious, was there any performance measurements done to see impact of adding full memory barrier, because they are expensive. Of course, the correctness is more important than performance, but we can identify just certain spots where full barrier is needed and can remove from other places. Just a thought.

In .NET 8 we used GCC atomics that we made fully sequential consistent, meaning we added full memory barriers to both sides of the atomic instruction on ARM and ARM64, so moving back to just having one barrier will still be more performant compared to .NET 8, but a little less for .NET 9 that on the other hand have correctness issues due to being to weak. Since Mono have been running with these strong guarantees on its internal atomic operations over the last 9 years, use both by internal runtime code as well as Interlocked implementation, we would need a full audit and change to alternative weaker atomic implementation where possible and probably do it case by case to not regress again.

lateralusX · 2025-05-16T08:15:23Z

The memory reorder issues observed in #114262 have not been reproduced when the LSE assembly instructions are used, probably because it is less likely with one instruction instead of 2+ sequence, and load/stores can't be reordered past the full acquire/release barriers but can be reordered into the barriers. The guarantee offered by swpal/casal is still acquire/release semantics without a full barrier, so in theory it could end up with similar reorder issues.

Not related to this PR specifically, since it follows patterns we use all over, but we may want to get some clarity on this from ARM folks.

The spec indeed says ACQ/REL, so in theory casal can cause all the same troubles as LL/SC implementation. In practice we do not see that happening, possibly due to implementation detail of all current implementations (i.e. using cache coherence protocols to implement it, maybe?).

As I remember problems with LL/SC were with effects leaking into the span between LL and SC. It is hard to see how that can happen with a single instruction, but I would not guess all the possibilities of how that works in hardware - can casal be microcoded into something like LL/SC?

Basically, should we be worried about one day needing dmb after LSE atomics?

Jupp, this was my concern as well. The issue has not been observed when using SWPAL/CASAL/LDADDAL, those instructions give acquire/release semantics and not a full memory barrier. LDXR/STXR/CMP sequence gives acquire/release semantics too, without full memory barrier, but our code generation makes an explicit full memory barrier after since that is what expected by our interlocked xchg/cas operations. On Mono we currently don't use LSE instructions in our JIT and with this PR we will add a full barrier after calling the C11 atomics as well, meaning that even if the underlying compiler decides to use LSE instructions for our C11 atomic usage, we will end up with a full memory barrier. Our LLVM code generation is even stronger and inline with our old GCC atomic support, full barrier before/after atomic operations. CoreCLR on the other hand handles it differenlty. If LSE instructions are used in CodeGen::genLockedInstructions and CodeGen::genCodeForCmpXchg we will only end up with acquire/release semantics without full memory barrier support.

linnkrb · 2025-05-22T11:59:26Z

Can I get this for dotnet9 i somewhere just to try it out and if that fixes our issues ?

vitek-karas · 2025-05-23T14:03:08Z

@linnkrb We have a private build and instructions available here: https://gist.github.com/kotlarmilos/96dc91fdb440a564eec6d2e61a0d6ce5

This is for apps targeting .NET 9.

It's a private build, so not ready for production, please use it only for testing purposes.

lateralusX added 3 commits May 13, 2025 20:04

Fix C11 atomic seq cst on ARM64.

cd7b4d9

Add library regression test.

9f24f44

Add functional regression test.

fbd1700

Copilot AI review requested due to automatic review settings May 14, 2025 18:25

lateralusX requested review from steveisok and vitek-karas as code owners May 14, 2025 18:25

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 14, 2025

dotnet-policy-service bot assigned lateralusX May 14, 2025

Copilot AI reviewed May 14, 2025

View reviewed changes

src/mono/mono/utils/atomic.h Show resolved Hide resolved

src/libraries/System.Threading.Tasks.Parallel/tests/ParallelForEachAsyncTests.cs Show resolved Hide resolved

Adjust functional test to work under NAOT.

a5191ea

Add comment to C11_MEMORY_ORDER_SEQ_CST + align atomic_load barrier t…

97c1c3c

…o JIT.

This was referenced May 15, 2025

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

SIGKILL System.Runtime.Serialization.Formatters.Tests #114287

Open

BrzVlad approved these changes May 15, 2025

View reviewed changes

kunalspathak requested a review from VSadov May 15, 2025 20:50

kunalspathak reviewed May 15, 2025

View reviewed changes

kunalspathak approved these changes May 15, 2025

View reviewed changes

VSadov approved these changes May 15, 2025

View reviewed changes

steveisok merged commit 1307a67 into dotnet:main May 16, 2025
129 of 132 checks passed

github-actions bot mentioned this pull request May 16, 2025

[release/9.0] [Mono] Fix c11 ARM64 atomics to issue full memory barrier. #115635

Merged

4 tasks

steveisok mentioned this pull request May 16, 2025

[net9] System.Private.CoreLib occurs InvalidOperationException and InvalidCastException #114262

Open

linnkrb mentioned this pull request May 26, 2025

SIGSEGV crash in mono_interp_exec_method on .NET 9 iOS #115991

Open

[Mono] Fix c11 ARM64 atomics to issue full memory barrier. #115573

[Mono] Fix c11 ARM64 atomics to issue full memory barrier. #115573

Uh oh!

Conversation

lateralusX commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

lateralusX commented May 14, 2025

Uh oh!

vitek-karas commented May 14, 2025

Uh oh!

AndyAyersMS commented May 15, 2025

Uh oh!

JulieLeeMSFT commented May 15, 2025

Uh oh!

kunalspathak May 15, 2025

Choose a reason for hiding this comment

Uh oh!

kunalspathak left a comment

Choose a reason for hiding this comment

Uh oh!

VSadov commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov left a comment

Choose a reason for hiding this comment

Uh oh!

steveisok commented May 15, 2025

Uh oh!

steveisok commented May 16, 2025

Uh oh!

github-actions bot commented May 16, 2025

Uh oh!

Uh oh!

lateralusX commented May 16, 2025

Uh oh!

lateralusX commented May 16, 2025

Uh oh!

linnkrb commented May 22, 2025

Uh oh!

vitek-karas commented May 23, 2025

Uh oh!

Uh oh!

lateralusX commented May 14, 2025 •

edited

Loading

VSadov commented May 15, 2025 •

edited

Loading