[AArch64]: Atomic Exchange Allows Reordering past Acquire Fence

Consider the following litmus test that exposes buggy behaviour (this is a nasty bug):
```
C test

{ *x = 0; *y = 0; } // fixed initial state where x and y are 0

P0 (atomic_int* y,atomic_int* x) {
  atomic_store_explicit(x,1,memory_order_relaxed);
  atomic_thread_fence(memory_order_release);
  atomic_store_explicit(y,1,memory_order_relaxed);
}
P1 (atomic_int* y,atomic_int* x) {
  atomic_exchange_explicit(y,2,memory_order_release);
  atomic_thread_fence(memory_order_acquire);
  int r0 = atomic_load_explicit(x,memory_order_relaxed);
}

exists (P1:r0=0 /\ y=2)

```
where 'P0:r0 = 0' means thread P0, local variable r0 has value 0.

When simulating this test under the C/C++ model from its initial state, the outcome of execution in the exists clause is forbidden by the source model. The allowed outcomes are:
```
{ P1:r0=0; y=1; }
{ P1:r0=1; y=1; }
{ P1:r0=1; y=2; }
```

If we compile this with clang trunk for armv8.2-a using -O2 (https://godbolt.org/z/Grha1988f), we get the following test:
```
AArch64 test

{ [P1_r0]=0;[x]=0;[y]=0;
  uint64_t %P0_x=x; uint64_t %P0_y=y;
  uint64_t %P1_P1_r0=P1_r0;uint64_t %P1_x=x;uint64_t %P1_y=y }

  P0                |  P1                    ;
   MOV W9,#1        |   MOV W9,#2            ;
   STR W9,[X%P0_x]  |   SWP W9, WZR, [X%P1_y];
   DMB ISH          |   DMB ISHLD            ;
   STR W9,[X%P0_y]  |   LDR W8,[X%P1_x]      ;
   RET              |   STR W8,[X%P1_P1_r0]  ;
                    |   RET                  ;


exists (P1_r0=0 /\ [y]=2)
```
Which under the aarch64 model permits the following outcomes:
```
{ P1:r0=0; [y]=1; }
{ P1:r0=0; [y]=2; } <-- Forbidden by source model, a bug!
{ P1:r0=1; [y]=1; }
{ P1:r0=1; [y]=2; }
```
So what is going on? The SWP instruction as introduced in armv8.2-a with LSE is not regarded as doing a read for the acquire fence when the destination register is the zero register. In this case the write to y can be reordered past the `DMB ISHLD` fence so that we observe the load of x on P1 before the store to y. This results in the outcome `{P1_r0 = 0; y=2}` when the load of x on P1 occurs before the write of x on P0, and the write to y on P1 occurs after the store to y on P0.

This is caused by the LLVM dead register definitions [pass](https://llvm.org/docs/doxygen/AArch64DeadRegisterDefinitionsPass_8cpp_source.html), that zeros out the destination register of SWP, since it is not used in the source program. This optimisation is applied at `-O1` and above, and hence we do not observe this bug with `clang -O0`. This does not occur in GCC (see Godbolt above), since the destination register is not zero'd out. 

My advice is to not apply this optimisation to `SWP` so that the destination register is not zero'd out like in clang `-O0`

The official Arm AArch64 memory model has been updated to reflect this - hence my catching it now.

So what does this mean? Well it means there are thread-local optimisations that are influencing the semantics of concurrent programs. This is a standard re-ordering bug as far as concurrency is concerned, but you can only observe this when you don't keep the value of the `atomic_exchange` around to observe it. In other words it is a new class of [heisenbug](https://en.wikipedia.org/wiki/Heisenbug)! This raises questions of whether there are more bugs like this.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64]: Atomic Exchange Allows Reordering past Acquire Fence #68428

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AArch64]: Atomic Exchange Allows Reordering past Acquire Fence #68428

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions