Skip to content

[Mono] Fix c11 ARM64 atomics to issue full memory barrier. #115573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 16, 2025

Conversation

lateralusX
Copy link
Member

@lateralusX lateralusX commented May 14, 2025

In .NET 9 Mono changed its internal atomic implementation in atomic.h to use C11 atomics on a number of platforms and CPU architectures, c48aa81, but it was not until we implemented I4 Interlocked.Exchange intrinsics in interpreter, 12ecfe7, that triggered issue to become more visible.

Before .NET 9, Mono used GCC atomics, they are also plagued with assumption that acquire/release barriers are enough to be sequential consistent, so Mono added full barriers around these atomic calls, issuing full memory barrier before and after atomic instruction. This is even a stronger guarantee than what Mono and CoreCLR JIT compilers offers when compiling the Interlocked functions, so could be relaxed, at least on ARM64.

The default behavior of C11 atomics should be sequential consistent, but the implementation of C11 atomics on ARM64(clang/gcc), settle on half barriers, acquire/release semantics, but won't issue a full memory barrier. Mono's atomic implementation is used both within runtime as well as from Interlocked functions in icalls as well as interpreter implementation of Interlocked op codes. The callers of these functions expect a full memory barrier, meaning no load/stores can move past the atomic operation, for example C11 atomic_exchange on ARM64 ends up with the following assembly sequence on iOS and Android platforms when CPU doesn’t support ARMv8 with LSE instructions set:

ldaxr       // Load exclusive with acquire semantics.
stlxr       // Store exclusive with release sematnics
cbnz        // Retry on failure.

This doesn't include a full memory barrier after the exchange, breaking the assumptions of the callers that the function should be sequential consistent without any ordering of load/stores across the atomic operation.

Android includes runtime checks to switch to a different set of instructions, swpal/casal, if CPU supports ARMv8 + LSE. The memory reorder issues observed in #114262 have not been reproduced when the LSE assembly instructions are used, probably because it is less likely with one instruction instead of 2+ sequence, and load/stores can't be reordered past the full acquire/release barriers but can be reordered into the barriers. The guarantee offered by swpal/casal is still acquire/release semantics without a full barrier, so in theory it could end up with similar reorder issues. The issue has not been reproduced on iOS simulator since compiler will use ARMv8 + LSE instruction for that build. On iOS device build it will compile without LSE support, generating the more fragile assembler sequence, making it more likely to hit the issue on any iOS ARM64 device.

Mono JIT code generation already handles this scenario on ARM64, it will make sure exchange and CAS operations end up with a full memory barrier, CoreCLR does the same thing for its lowering of Interlocked functions and that is why the issue won't reproduce when using AOT compiled code, only using interpreter. Having that said, since the mono_atomic_* functions are used within runtime, it might still manifest itself under other configurations, probably leading to memory corruption and weird crashes.

Fix aligns code generated by C11 atomics with what's being generated by JIT compilers, ending with a full memory barrier:

ldaxr       // Load exclusive with acquire semantics.
stlxr       // Store exclusive with release semantics
cbnz        // Retry on failure.
dmb ish     // Full memory barrier.

This is inline with the code generated by both Mono and CoreCLR JIT ARM64 compilers for Interlocked functions.

Android ARM32 was also checked, and it appears that C11 atomics generate the expected full memory barrier on this architecture, meaning no additional memory barriers are needed.

PR adds two new tests (identical), one into libraries Tasks ParallelForEachAsyncTests, but since this issue requires iOS device + ARM64 + interpreter to reproduce, PR also adds the same test as a functional test, where we can force it to use interpreter only and only run test on Mono. It is also worth noting that before implementing this fix, running ParallelForEachAsyncTests on iOS device + ARM64 + interpreter was observed to cause hangs, GC crashes and other unstable runtime behaviour (due to memory reorder). With this fix, ParallelForEachAsyncTests pass without any observed issues.

It appears that CI doesn't run tests using iOS device + AMR64 + interpreter, so it won’t expose these issues. That is why a functional test was included as well, since we can control the exact configuration needed to reproduce the issue and prevent future regressions from happening.

@Copilot Copilot AI review requested due to automatic review settings May 14, 2025 18:25
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 14, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue with C11 atomic operations on ARM64 by ensuring that a full memory barrier is issued after each atomic operation, aligning the behavior with that of the Mono and CoreCLR JIT compilers. Key changes include:

  • Adding C11_MEMORY_ORDER_SEQ_CST calls after atomic operations in src/mono/mono/utils/atomic.h.
  • Introducing a functional test in the iOS device test project (using Mono interpreter) to catch memory reordering issues.
  • Adding a new test in the ParallelForEachAsyncTests to validate the fix.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/tests/FunctionalTests/iOS/Device/ParallelForEachAsync/iOS.Device.ParallelForEachAsync.Test.csproj New test project targeting iOS/arm64 with Mono interpreter configurations.
src/tests/FunctionalTests/iOS/Device/ParallelForEachAsync/Program.cs Functional test harness executing the ParallelForEachAsync workload.
src/mono/mono/utils/atomic.h Updated atomic operations to include a full memory barrier for ARM64.
src/libraries/System.Threading.Tasks.Parallel/tests/ParallelForEachAsyncTests.cs Added new test validating the atomic ordering fix.

@lateralusX
Copy link
Member Author

Kudos to @kotlarmilos for the work on reproduction steps.

@vitek-karas
Copy link
Member

@dotnet/jit-contrib could somebody from the JIT team review this change please?

@AndyAyersMS
Copy link
Member

@kunalspathak can you review?

@kunalspathak kunalspathak requested a review from VSadov May 15, 2025 20:50
@JulieLeeMSFT
Copy link
Member

Ping @VSadov for a code review. CC @mangod9. Kunal is reviewing it, but it would be great if you can review it too. Steve needs to get this in for June servicing, due tomorrow.

{
[DllImport("__Internal")]
public static extern void mono_ios_set_summary (string value);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since this test is not validating any observed result, but just hoping that the test don't crash, it might be good to include a line or 2 on what this test is getting validated.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Few things:

That is why a functional test was included as well, since we can control the exact configuration needed to reproduce the issue and prevent future regressions from happening.

Are there extra legs that should be run to make sure this is well tested?

I am curious, was there any performance measurements done to see impact of adding full memory barrier, because they are expensive. Of course, the correctness is more important than performance, but we can identify just certain spots where full barrier is needed and can remove from other places. Just a thought.

@VSadov
Copy link
Member

VSadov commented May 15, 2025

The memory reorder issues observed in #114262 have not been reproduced when the LSE assembly instructions are used, probably because it is less likely with one instruction instead of 2+ sequence, and load/stores can't be reordered past the full acquire/release barriers but can be reordered into the barriers. The guarantee offered by swpal/casal is still acquire/release semantics without a full barrier, so in theory it could end up with similar reorder issues.

Not related to this PR specifically, since it follows patterns we use all over, but we may want to get some clarity on this from ARM folks.

The spec indeed says ACQ/REL, so in theory casal can cause all the same troubles as LL/SC implementation. In practice we do not see that happening, possibly due to implementation detail of all current implementations (i.e. using cache coherence protocols to implement it, maybe?).

As I remember problems with LL/SC were with effects leaking into the span between LL and SC. It is hard to see how that can happen with a single instruction, but I would not guess all the possibilities of how that works in hardware - can casal be microcoded into something like LL/SC?

Basically, should we be worried about one day needing dmb after LSE atomics?

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@steveisok
Copy link
Member

Are there extra legs that should be run to make sure this is well tested?

The iOS.Device.ParallelForEachAsync.Test functional test is part of the smoke tests that run on this PR and any subsequent one when it lands. The test added to System.Threading.Tasks will run on rolling builds or when runtime-extra-platforms is manually started.

I am curious, was there any performance measurements done to see impact of adding full memory barrier, because they are expensive. Of course, the correctness is more important than performance, but we can identify just certain spots where full barrier is needed and can remove from other places. Just a thought.

To my knowledge this has yet to run in the lab. I'll see if we have any numbers and if they are significant, will follow up after the fact.

@steveisok
Copy link
Member

/backport to release/9.0

Copy link
Contributor

Started backporting to release/9.0: https://github.com/dotnet/runtime/actions/runs/15057610527

@steveisok steveisok merged commit 1307a67 into dotnet:main May 16, 2025
129 of 132 checks passed
@lateralusX
Copy link
Member Author

I am curious, was there any performance measurements done to see impact of adding full memory barrier, because they are expensive. Of course, the correctness is more important than performance, but we can identify just certain spots where full barrier is needed and can remove from other places. Just a thought.

In .NET 8 we used GCC atomics that we made fully sequential consistent, meaning we added full memory barriers to both sides of the atomic instruction on ARM and ARM64, so moving back to just having one barrier will still be more performant compared to .NET 8, but a little less for .NET 9 that on the other hand have correctness issues due to being to weak. Since Mono have been running with these strong guarantees on its internal atomic operations over the last 9 years, use both by internal runtime code as well as Interlocked implementation, we would need a full audit and change to alternative weaker atomic implementation where possible and probably do it case by case to not regress again.

@lateralusX
Copy link
Member Author

The memory reorder issues observed in #114262 have not been reproduced when the LSE assembly instructions are used, probably because it is less likely with one instruction instead of 2+ sequence, and load/stores can't be reordered past the full acquire/release barriers but can be reordered into the barriers. The guarantee offered by swpal/casal is still acquire/release semantics without a full barrier, so in theory it could end up with similar reorder issues.

Not related to this PR specifically, since it follows patterns we use all over, but we may want to get some clarity on this from ARM folks.

The spec indeed says ACQ/REL, so in theory casal can cause all the same troubles as LL/SC implementation. In practice we do not see that happening, possibly due to implementation detail of all current implementations (i.e. using cache coherence protocols to implement it, maybe?).

As I remember problems with LL/SC were with effects leaking into the span between LL and SC. It is hard to see how that can happen with a single instruction, but I would not guess all the possibilities of how that works in hardware - can casal be microcoded into something like LL/SC?

Basically, should we be worried about one day needing dmb after LSE atomics?

Jupp, this was my concern as well. The issue has not been observed when using SWPAL/CASAL/LDADDAL, those instructions give acquire/release semantics and not a full memory barrier. LDXR/STXR/CMP sequence gives acquire/release semantics too, without full memory barrier, but our code generation makes an explicit full memory barrier after since that is what expected by our interlocked xchg/cas operations. On Mono we currently don't use LSE instructions in our JIT and with this PR we will add a full barrier after calling the C11 atomics as well, meaning that even if the underlying compiler decides to use LSE instructions for our C11 atomic usage, we will end up with a full memory barrier. Our LLVM code generation is even stronger and inline with our old GCC atomic support, full barrier before/after atomic operations. CoreCLR on the other hand handles it differenlty. If LSE instructions are used in CodeGen::genLockedInstructions and CodeGen::genCodeForCmpXchg we will only end up with acquire/release semantics without full memory barrier support.

@linnkrb
Copy link

linnkrb commented May 22, 2025

Can I get this for dotnet9 i somewhere just to try it out and if that fixes our issues ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants