Add memory barrier to `Mutex#unlock` on aarch64 #14272

jgaskins · 2024-01-30T00:08:42Z

This solution is the same as the one used in #13050.

The following code is expected to output 1000000 preceded by the time it took to perform it:

mutex = Mutex.new
numbers = Array(Int32).new(initial_capacity: 1_000_000)
done = Channel(Nil).new
concurrency = 20
iterations = 1_000_000 // concurrency
concurrency.times do
  spawn do
    iterations.times { mutex.synchronize { numbers << 0 } }
  ensure
    done.send nil
  end
end

start = Time.monotonic
concurrency.times { done.receive }
print Time.monotonic - start
print ' '
sleep 100.milliseconds # Wait just a bit longer to be sure the discrepancy isn't due to a *different* race condition
pp numbers.size

Before this commit, on an Apple M1 CPU, the array size would be anywhere from 880k-970k, but I never observed it reach 1M. Here is a sample:

$ repeat 20 (CRYSTAL_WORKERS=10 ./mutex_check)
00:00:00.119271625 881352
00:00:00.111249083 936709
00:00:00.102355208 946428
00:00:00.116415166 926724
00:00:00.127152583 899899
00:00:00.097160792 964577
00:00:00.120564958 930859
00:00:00.122803000 917583
00:00:00.093986834 954112
00:00:00.079212333 967772
00:00:00.093168208 953491
00:00:00.102553834 962147
00:00:00.091601625 967304
00:00:00.108157208 954855
00:00:00.080879666 944870
00:00:00.114638042 930429
00:00:00.093617083 956496
00:00:00.112108959 940205
00:00:00.092837875 944993
00:00:00.097882625 916220

This indicates that some of the mutex locks were getting through when they should not have been. With this commit, using the exact same parameters (built with --release -Dpreview_mt and run with CRYSTAL_WORKERS=10 to spread out across all 10 cores) these are the results I'm seeing:

00:00:00.078898166 1000000
00:00:00.072308084 1000000
00:00:00.047157000 1000000
00:00:00.088043834 1000000
00:00:00.060784625 1000000
00:00:00.067710250 1000000
00:00:00.081070750 1000000
00:00:00.065572208 1000000
00:00:00.065006958 1000000
00:00:00.061041541 1000000
00:00:00.059648291 1000000
00:00:00.078100125 1000000
00:00:00.050676250 1000000
00:00:00.049395875 1000000
00:00:00.069352334 1000000
00:00:00.063897833 1000000
00:00:00.067534333 1000000
00:00:00.070290833 1000000
00:00:00.067361500 1000000
00:00:00.078021833 1000000

Note that it's not only correct, but also significantly faster.

Fixes #13055

This solution is the same as the one used in crystal-lang#13050. The following code is expected to output `1000000` preceded by the time it took to perform it: ``` mutex = Mutex.new numbers = Array(Int32).new(initial_capacity: 1_000_000) done = Channel(Nil).new concurrency = 20 iterations = 1_000_000 // concurrency concurrency.times do spawn do iterations.times { mutex.synchronize { numbers << 0 } } ensure done.send nil end end start = Time.monotonic concurrency.times { done.receive } print Time.monotonic - start print ' ' sleep 100.milliseconds # Wait just a bit longer to be sure the discrepancy isn't due to a *different* race condition pp numbers.size ``` Before this commit, on an Apple M1 CPU, the array size would be anywhere from 880k-970k, but I never observed it reach 1M. Here is a sample: ``` $ repeat 20 (CRYSTAL_WORKERS=10 ./mutex_check) 00:00:00.119271625 881352 00:00:00.111249083 936709 00:00:00.102355208 946428 00:00:00.116415166 926724 00:00:00.127152583 899899 00:00:00.097160792 964577 00:00:00.120564958 930859 00:00:00.122803000 917583 00:00:00.093986834 954112 00:00:00.079212333 967772 00:00:00.093168208 953491 00:00:00.102553834 962147 00:00:00.091601625 967304 00:00:00.108157208 954855 00:00:00.080879666 944870 00:00:00.114638042 930429 00:00:00.093617083 956496 00:00:00.112108959 940205 00:00:00.092837875 944993 00:00:00.097882625 916220 ``` This indicates that some of the mutex locks were getting through when they should not have been. With this commit, using the exact same parameters (built with `--release -Dpreview_mt` and run with `CRYSTAL_WORKERS=10` to spread out across all 10 cores) these are the results I'm seeing: ``` 00:00:00.078898166 1000000 00:00:00.072308084 1000000 00:00:00.047157000 1000000 00:00:00.088043834 1000000 00:00:00.060784625 1000000 00:00:00.067710250 1000000 00:00:00.081070750 1000000 00:00:00.065572208 1000000 00:00:00.065006958 1000000 00:00:00.061041541 1000000 00:00:00.059648291 1000000 00:00:00.078100125 1000000 00:00:00.050676250 1000000 00:00:00.049395875 1000000 00:00:00.069352334 1000000 00:00:00.063897833 1000000 00:00:00.067534333 1000000 00:00:00.070290833 1000000 00:00:00.067361500 1000000 00:00:00.078021833 1000000 ``` Note that it's not only correct, but also significantly faster.

straight-shoota · 2024-01-30T11:48:51Z

Are you sure this resolves #13055 entirely and there are no other places that may need barriers?

beta-ziliani · 2024-01-30T13:00:42Z

👀 @ysbaddaden

ysbaddaden · 2024-01-30T14:03:06Z

@jgaskins

What if you replace the lazy set (@state.lazy_set(0)) with an explicit one set (@state.set(0))? Do you still need the memory barrier?

Here is for example what the linux kernel source code (v4.4) has to say:

About ARM32:

A memory barrier is required after we get a lock, and before we release it, because V6 CPUs are assumed to have weakly ordered memory.

I assume this stands for V7 CPUs too.

About ARM64:

The memory barriers are implicit with the load-acquire and store-release instructions.

We use sequential consistency instead of acquire/release but that should only impact performance & seq-cst is stronger than acquire/release anyway.

My understanding is that the atomic is enough as long as we don't break the contract (without a barrier the CPU may reorder lazy set before we increment the counter).

jgaskins · 2024-01-30T14:22:07Z

Are you sure this resolves #13055 entirely and there are no other places that may need barriers?

@straight-shoota I’m sure that it fixes the issues I’ve observed with thread-safety on aarch64 in load tests I’ve performed on my software. I don’t know of a way to prove that it’s fixed in all scenarios.

If you’re referring to the wording in the title of the PR, I can change it to “add memory barriers” as in #13050.

What if you replace the lazy set (@state.lazy_set(0)) with an explicit one set (@state.set(0))? Do you still need the memory barrier?

In my tests last night, that did give me the expected values, but was slower. I don’t know how much that matters since correctness > speed (up to a point), but this implementation gave us both.

ysbaddaden · 2024-01-30T15:26:30Z

@jgaskins nice, at least it proves that it's working. The speed improvement with a barrier is weird 🤔

I'd be interested to see the performance impact when using acquire/release semantics on the atomics (without the barrier) instead sequential consistency 👀

ysbaddaden · 2024-01-30T16:09:36Z

We might get better performance by using LSE atomics from ARMv8.1 (e.g. ldadda) instead of the legacy LL/SC (e.g. ldaxr + stxr). Disassembling cross compiled objects, I noticed that LLVM generates LL/SC atomics by defaults. Apparently we can use mattr=+lse to generate lse atomics 👀

EDIT: confirmed, by default llvm will generate ll/sc atomics but specifying --mattr=+lse will use the lse atomics insted. I'm looking into the fix & performance issue.

ysbaddaden · 2024-01-30T17:59:53Z

I ran the example code from the PR description on a Neoverse-N1 server 🤩 with 16 worker threads.

Crystal 1.11.2 is the stock release (no patches)
jgaskins patch is this PR (add a memory barrier to mutex)
ysbaddaden patch is replacing #lazy_set for #set(0) and removing the
barriers in both Mutex and Crystal::SpinLock.

With LL/SC atomics (-Dpreview_mt --release):

Crystal 1.11.2 (LL/SC)	jgaskins patch (LL/SC)	ysbaddaden patch (LL/SC)
00:00:00.385133632 999870	00:00:00.499416919 1000000	00:00:00.550988029 1000000
00:00:00.371160988 999891	00:00:00.482909860 1000000	00:00:00.490515706 1000000
00:00:00.452127314 999990	00:00:00.343948665 1000000	00:00:00.439060716 1000000
00:00:00.347059963 999991	00:00:00.434351488 1000000	00:00:00.434563770 1000000
00:00:00.455184212 999994	00:00:00.440151883 1000000	00:00:00.452553157 1000000
00:00:00.484056906 999895	00:00:00.526242680 1000000	00:00:00.390584986 1000000
00:00:00.516859382 999990	00:00:00.376327340 1000000	00:00:00.446540081 1000000
00:00:00.536798222 999931	00:00:00.475414134 1000000	00:00:00.468770775 1000000
00:00:00.451565270 999997	00:00:00.426166719 1000000	00:00:00.440710567 1000000
00:00:00.449864220 999828	00:00:00.400186963 1000000	00:00:00.423722185 1000000
avg: 0.444981010	avg: 0.440511665	avg: 0.453800997

With LSE atomics (-Dpreview_mt --release --mattr=+lse):

Crystal 1.11.2 (LSE)	jgaskins patch (LSE)	ysbaddaden patch (LSE)
00:00:00.216061856 999332	00:00:00.249694139 1000000	00:00:00.240949127 1000000
00:00:00.219081074 993259	00:00:00.239127756 1000000	00:00:00.226352440 1000000
00:00:00.215822334 992114	00:00:00.248496972 1000000	00:00:00.246995643 1000000
00:00:00.221506808 989608	00:00:00.239560918 1000000	00:00:00.211989393 1000000
00:00:00.220899165 994043	00:00:00.235796576 1000000	00:00:00.234099366 1000000
00:00:00.217702506 992565	00:00:00.231499750 1000000	00:00:00.236619821 1000000
00:00:00.213177758 995030	00:00:00.242951419 1000000	00:00:00.261796132 1000000
00:00:00.231702350 990755	00:00:00.243229460 1000000	00:00:00.234265326 1000000
00:00:00.217356223 994786	00:00:00.256804222 1000000	00:00:00.248510852 1000000
00:00:00.219489556 997060	00:00:00.243536342 1000000	00:00:00.240165962 1000000
avg: 0.219279962	avg: 0.243069755	avg: 0.238174406

Take aways:

Locks on Crystal 1.11.2 are completely off 😱
LSE atomics are incredibly faster than LL/SC with 16 threads (they were similar with 4 threads) 🚀
Memory barriers are indeed implicit on AArch64 👍
I don't see a noticeable performance difference between lazy set + memory barriers and proper set with no barriers (neither on acquire or release) on the Neoverse N1;
I'm eager to compare acquire/release vs seq-cst memory orders 👀

NOTE: we might consider enabling LSE by default for AArch64, and having a -Dwithout_lse_atomics flag, and/or check the CPU flags to see if the feature is available at compile time.

jgaskins · 2024-01-30T23:20:11Z

Weird. With LSE it was slower on my M1 Mac, but ~18% faster than this PR on an Ampere Arm server on Google Cloud (T2A VM, 8 cores), which is fascinating.

jgaskins	ysbaddaden
00:00:00.408768519 1000000	00:00:00.329878465 1000000
00:00:00.335315272 1000000	00:00:00.268396861 1000000
00:00:00.360981995 1000000	00:00:00.272967381 1000000
00:00:00.324802272 1000000	00:00:00.330465706 1000000
00:00:00.321439072 1000000	00:00:00.275313622 1000000
00:00:00.349576434 1000000	00:00:00.364220269 1000000
00:00:00.439792642 1000000	00:00:00.365180429 1000000
00:00:00.312492590 1000000	00:00:00.339958506 1000000
00:00:00.363270475 1000000	00:00:00.338208266 1000000
00:00:00.523387650 1000000	00:00:00.285177182 1000000
avg: 00:00:00.37398269209999996	avg: 00:00:00.3169766687

HertzDevil · 2024-02-01T13:22:17Z

The other part of #13055 is Crystal::RWLock, which is only used for garbage collection

straight-shoota

LGTM then!

Sija · 2024-02-01T15:27:41Z

Would be nice to have some spec coverage.

jgaskins · 2024-02-02T17:30:01Z

There's a spec for it, but CI doesn't use -Dpreview_mt, so it doesn't catch this.

There is one CI entry that uses -Dpreview_mt, but it's Linux x86_64-only.

jgaskins added 2 commits January 29, 2024 17:46

Merge branch 'master' into fix-mutex-on-aarch64

d71150c

Blacksmoke16 added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:concurrency platform:aarch64 labels Jan 30, 2024

HertzDevil added the topic:multithreading label Feb 1, 2024

straight-shoota approved these changes Feb 1, 2024

View reviewed changes

jgaskins changed the title ~~Fix Mutex on aarch64~~ Add memory barrier to Mutex#unlock on aarch64 Feb 8, 2024

HertzDevil approved these changes Feb 8, 2024

View reviewed changes

straight-shoota added this to the 1.12.0 milestone Feb 8, 2024

straight-shoota merged commit db67d71 into crystal-lang:master Feb 10, 2024

ysbaddaden mentioned this pull request Feb 12, 2024

Fix: Atomics and Locks (ARM, AArch64, X86) #14293

Merged

jgaskins deleted the fix-mutex-on-aarch64 branch February 12, 2024 18:12

BrewTestBot mentioned this pull request Apr 9, 2024

crystal 1.12.0 Homebrew/homebrew-core#168465

Merged

1 task

Uh oh!

Add memory barrier to Mutex#unlock on aarch64 #14272

Add memory barrier to Mutex#unlock on aarch64 #14272

Uh oh!

Conversation

jgaskins commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

straight-shoota commented Jan 30, 2024

Uh oh!

beta-ziliani commented Jan 30, 2024

Uh oh!

ysbaddaden commented Jan 30, 2024

Uh oh!

jgaskins commented Jan 30, 2024

Uh oh!

ysbaddaden commented Jan 30, 2024

Uh oh!

ysbaddaden commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysbaddaden commented Jan 30, 2024

Uh oh!

jgaskins commented Jan 30, 2024

Uh oh!

HertzDevil commented Feb 1, 2024

Uh oh!

straight-shoota left a comment

Choose a reason for hiding this comment

Uh oh!

Sija commented Feb 1, 2024

Uh oh!

jgaskins commented Feb 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Add memory barrier to `Mutex#unlock` on aarch64 #14272

Add memory barrier to `Mutex#unlock` on aarch64 #14272

jgaskins commented Jan 30, 2024 •

edited

Loading

ysbaddaden commented Jan 30, 2024 •

edited

Loading