Skip to content

Conversation

jgaskins
Copy link
Contributor

@jgaskins jgaskins commented Jan 30, 2024

This solution is the same as the one used in #13050.

The following code is expected to output 1000000 preceded by the time it took to perform it:

mutex = Mutex.new
numbers = Array(Int32).new(initial_capacity: 1_000_000)
done = Channel(Nil).new
concurrency = 20
iterations = 1_000_000 // concurrency
concurrency.times do
  spawn do
    iterations.times { mutex.synchronize { numbers << 0 } }
  ensure
    done.send nil
  end
end

start = Time.monotonic
concurrency.times { done.receive }
print Time.monotonic - start
print ' '
sleep 100.milliseconds # Wait just a bit longer to be sure the discrepancy isn't due to a *different* race condition
pp numbers.size

Before this commit, on an Apple M1 CPU, the array size would be anywhere from 880k-970k, but I never observed it reach 1M. Here is a sample:

$ repeat 20 (CRYSTAL_WORKERS=10 ./mutex_check)
00:00:00.119271625 881352
00:00:00.111249083 936709
00:00:00.102355208 946428
00:00:00.116415166 926724
00:00:00.127152583 899899
00:00:00.097160792 964577
00:00:00.120564958 930859
00:00:00.122803000 917583
00:00:00.093986834 954112
00:00:00.079212333 967772
00:00:00.093168208 953491
00:00:00.102553834 962147
00:00:00.091601625 967304
00:00:00.108157208 954855
00:00:00.080879666 944870
00:00:00.114638042 930429
00:00:00.093617083 956496
00:00:00.112108959 940205
00:00:00.092837875 944993
00:00:00.097882625 916220

This indicates that some of the mutex locks were getting through when they should not have been. With this commit, using the exact same parameters (built with --release -Dpreview_mt and run with CRYSTAL_WORKERS=10 to spread out across all 10 cores) these are the results I'm seeing:

00:00:00.078898166 1000000
00:00:00.072308084 1000000
00:00:00.047157000 1000000
00:00:00.088043834 1000000
00:00:00.060784625 1000000
00:00:00.067710250 1000000
00:00:00.081070750 1000000
00:00:00.065572208 1000000
00:00:00.065006958 1000000
00:00:00.061041541 1000000
00:00:00.059648291 1000000
00:00:00.078100125 1000000
00:00:00.050676250 1000000
00:00:00.049395875 1000000
00:00:00.069352334 1000000
00:00:00.063897833 1000000
00:00:00.067534333 1000000
00:00:00.070290833 1000000
00:00:00.067361500 1000000
00:00:00.078021833 1000000

Note that it's not only correct, but also significantly faster.

Fixes #13055

This solution is the same as the one used in crystal-lang#13050.

The following code is expected to output `1000000` preceded by the time
it took to perform it:

```
mutex = Mutex.new
numbers = Array(Int32).new(initial_capacity: 1_000_000)
done = Channel(Nil).new
concurrency = 20
iterations = 1_000_000 // concurrency
concurrency.times do
  spawn do
    iterations.times { mutex.synchronize { numbers << 0 } }
  ensure
    done.send nil
  end
end

start = Time.monotonic
concurrency.times { done.receive }
print Time.monotonic - start
print ' '
sleep 100.milliseconds # Wait just a bit longer to be sure the discrepancy isn't due to a *different* race condition
pp numbers.size
```

Before this commit, on an Apple M1 CPU, the array size would be anywhere
from 880k-970k, but I never observed it reach 1M. Here is a sample:

```
$ repeat 20 (CRYSTAL_WORKERS=10 ./mutex_check)
00:00:00.119271625 881352
00:00:00.111249083 936709
00:00:00.102355208 946428
00:00:00.116415166 926724
00:00:00.127152583 899899
00:00:00.097160792 964577
00:00:00.120564958 930859
00:00:00.122803000 917583
00:00:00.093986834 954112
00:00:00.079212333 967772
00:00:00.093168208 953491
00:00:00.102553834 962147
00:00:00.091601625 967304
00:00:00.108157208 954855
00:00:00.080879666 944870
00:00:00.114638042 930429
00:00:00.093617083 956496
00:00:00.112108959 940205
00:00:00.092837875 944993
00:00:00.097882625 916220
```

This indicates that some of the mutex locks were getting through when
they should not have been. With this commit, using the exact same
parameters (built with `--release -Dpreview_mt` and run with
`CRYSTAL_WORKERS=10` to spread out across all 10 cores) these are the
results I'm seeing:

```
00:00:00.078898166 1000000
00:00:00.072308084 1000000
00:00:00.047157000 1000000
00:00:00.088043834 1000000
00:00:00.060784625 1000000
00:00:00.067710250 1000000
00:00:00.081070750 1000000
00:00:00.065572208 1000000
00:00:00.065006958 1000000
00:00:00.061041541 1000000
00:00:00.059648291 1000000
00:00:00.078100125 1000000
00:00:00.050676250 1000000
00:00:00.049395875 1000000
00:00:00.069352334 1000000
00:00:00.063897833 1000000
00:00:00.067534333 1000000
00:00:00.070290833 1000000
00:00:00.067361500 1000000
00:00:00.078021833 1000000
```

Note that it's not only correct, but also significantly faster.
@Blacksmoke16 Blacksmoke16 added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:concurrency platform:aarch64 labels Jan 30, 2024
@straight-shoota
Copy link
Member

Are you sure this resolves #13055 entirely and there are no other places that may need barriers?

@beta-ziliani
Copy link
Member

👀 @ysbaddaden

@ysbaddaden
Copy link
Contributor

@jgaskins

What if you replace the lazy set (@state.lazy_set(0)) with an explicit one set (@state.set(0))? Do you still need the memory barrier?

Here is for example what the linux kernel source code (v4.4) has to say:

About ARM32:

A memory barrier is required after we get a lock, and before we release it, because V6 CPUs are assumed to have weakly ordered memory.

I assume this stands for V7 CPUs too.

About ARM64:

The memory barriers are implicit with the load-acquire and store-release instructions.

We use sequential consistency instead of acquire/release but that should only impact performance & seq-cst is stronger than acquire/release anyway.

My understanding is that the atomic is enough as long as we don't break the contract (without a barrier the CPU may reorder lazy set before we increment the counter).

@jgaskins
Copy link
Contributor Author

Are you sure this resolves #13055 entirely and there are no other places that may need barriers?

@straight-shoota I’m sure that it fixes the issues I’ve observed with thread-safety on aarch64 in load tests I’ve performed on my software. I don’t know of a way to prove that it’s fixed in all scenarios.

If you’re referring to the wording in the title of the PR, I can change it to “add memory barriers” as in #13050.

What if you replace the lazy set (@state.lazy_set(0)) with an explicit one set (@state.set(0))? Do you still need the memory barrier?

In my tests last night, that did give me the expected values, but was slower. I don’t know how much that matters since correctness > speed (up to a point), but this implementation gave us both.

@ysbaddaden
Copy link
Contributor

@jgaskins nice, at least it proves that it's working. The speed improvement with a barrier is weird 🤔

I'd be interested to see the performance impact when using acquire/release semantics on the atomics (without the barrier) instead sequential consistency 👀

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Jan 30, 2024

We might get better performance by using LSE atomics from ARMv8.1 (e.g. ldadda) instead of the legacy LL/SC (e.g. ldaxr + stxr). Disassembling cross compiled objects, I noticed that LLVM generates LL/SC atomics by defaults. Apparently we can use mattr=+lse to generate lse atomics 👀

EDIT: confirmed, by default llvm will generate ll/sc atomics but specifying --mattr=+lse will use the lse atomics insted. I'm looking into the fix & performance issue.

@ysbaddaden
Copy link
Contributor

I ran the example code from the PR description on a Neoverse-N1 server 🤩 with 16 worker threads.

  • Crystal 1.11.2 is the stock release (no patches)
  • jgaskins patch is this PR (add a memory barrier to mutex)
  • ysbaddaden patch is replacing #lazy_set for #set(0) and removing the
    barriers in both Mutex and Crystal::SpinLock.

With LL/SC atomics (-Dpreview_mt --release):

Crystal 1.11.2 (LL/SC) jgaskins patch (LL/SC) ysbaddaden patch (LL/SC)
00:00:00.385133632 999870 00:00:00.499416919 1000000 00:00:00.550988029 1000000
00:00:00.371160988 999891 00:00:00.482909860 1000000 00:00:00.490515706 1000000
00:00:00.452127314 999990 00:00:00.343948665 1000000 00:00:00.439060716 1000000
00:00:00.347059963 999991 00:00:00.434351488 1000000 00:00:00.434563770 1000000
00:00:00.455184212 999994 00:00:00.440151883 1000000 00:00:00.452553157 1000000
00:00:00.484056906 999895 00:00:00.526242680 1000000 00:00:00.390584986 1000000
00:00:00.516859382 999990 00:00:00.376327340 1000000 00:00:00.446540081 1000000
00:00:00.536798222 999931 00:00:00.475414134 1000000 00:00:00.468770775 1000000
00:00:00.451565270 999997 00:00:00.426166719 1000000 00:00:00.440710567 1000000
00:00:00.449864220 999828 00:00:00.400186963 1000000 00:00:00.423722185 1000000
avg: 0.444981010 avg: 0.440511665 avg: 0.453800997

With LSE atomics (-Dpreview_mt --release --mattr=+lse):

Crystal 1.11.2 (LSE) jgaskins patch (LSE) ysbaddaden patch (LSE)
00:00:00.216061856 999332 00:00:00.249694139 1000000 00:00:00.240949127 1000000
00:00:00.219081074 993259 00:00:00.239127756 1000000 00:00:00.226352440 1000000
00:00:00.215822334 992114 00:00:00.248496972 1000000 00:00:00.246995643 1000000
00:00:00.221506808 989608 00:00:00.239560918 1000000 00:00:00.211989393 1000000
00:00:00.220899165 994043 00:00:00.235796576 1000000 00:00:00.234099366 1000000
00:00:00.217702506 992565 00:00:00.231499750 1000000 00:00:00.236619821 1000000
00:00:00.213177758 995030 00:00:00.242951419 1000000 00:00:00.261796132 1000000
00:00:00.231702350 990755 00:00:00.243229460 1000000 00:00:00.234265326 1000000
00:00:00.217356223 994786 00:00:00.256804222 1000000 00:00:00.248510852 1000000
00:00:00.219489556 997060 00:00:00.243536342 1000000 00:00:00.240165962 1000000
avg: 0.219279962 avg: 0.243069755 avg: 0.238174406

Take aways:

  • Locks on Crystal 1.11.2 are completely off 😱
  • LSE atomics are incredibly faster than LL/SC with 16 threads (they were similar with 4 threads) 🚀
  • Memory barriers are indeed implicit on AArch64 👍
  • I don't see a noticeable performance difference between lazy set + memory barriers and proper set with no barriers (neither on acquire or release) on the Neoverse N1;
  • I'm eager to compare acquire/release vs seq-cst memory orders 👀

NOTE: we might consider enabling LSE by default for AArch64, and having a -Dwithout_lse_atomics flag, and/or check the CPU flags to see if the feature is available at compile time.

@jgaskins
Copy link
Contributor Author

Weird. With LSE it was slower on my M1 Mac, but ~18% faster than this PR on an Ampere Arm server on Google Cloud (T2A VM, 8 cores), which is fascinating.

jgaskins ysbaddaden
00:00:00.408768519 1000000 00:00:00.329878465 1000000
00:00:00.335315272 1000000 00:00:00.268396861 1000000
00:00:00.360981995 1000000 00:00:00.272967381 1000000
00:00:00.324802272 1000000 00:00:00.330465706 1000000
00:00:00.321439072 1000000 00:00:00.275313622 1000000
00:00:00.349576434 1000000 00:00:00.364220269 1000000
00:00:00.439792642 1000000 00:00:00.365180429 1000000
00:00:00.312492590 1000000 00:00:00.339958506 1000000
00:00:00.363270475 1000000 00:00:00.338208266 1000000
00:00:00.523387650 1000000 00:00:00.285177182 1000000
avg: 00:00:00.37398269209999996 avg: 00:00:00.3169766687

@HertzDevil
Copy link
Contributor

The other part of #13055 is Crystal::RWLock, which is only used for garbage collection

Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM then!

@Sija
Copy link
Contributor

Sija commented Feb 1, 2024

Would be nice to have some spec coverage.

@jgaskins
Copy link
Contributor Author

jgaskins commented Feb 2, 2024

There's a spec for it, but CI doesn't use -Dpreview_mt, so it doesn't catch this.

There is one CI entry that uses -Dpreview_mt, but it's Linux x86_64-only.

@jgaskins jgaskins changed the title Fix Mutex on aarch64 Add memory barrier to Mutex#unlock on aarch64 Feb 8, 2024
@straight-shoota straight-shoota added this to the 1.12.0 milestone Feb 8, 2024
@straight-shoota straight-shoota merged commit db67d71 into crystal-lang:master Feb 10, 2024
@jgaskins jgaskins deleted the fix-mutex-on-aarch64 branch February 12, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:bug A bug in the code. Does not apply to documentation, specs, etc. platform:aarch64 topic:multithreading topic:stdlib:concurrency

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

More places might need memory barriers on AArch64

7 participants