Skip to content

[AArch64]: Atomic Exchange Allows Reordering past Acquire Fence #68428

@lukeg101

Description

@lukeg101

Consider the following litmus test that exposes buggy behaviour (this is a nasty bug):

C test

{ *x = 0; *y = 0; } // fixed initial state where x and y are 0

P0 (atomic_int* y,atomic_int* x) {
  atomic_store_explicit(x,1,memory_order_relaxed);
  atomic_thread_fence(memory_order_release);
  atomic_store_explicit(y,1,memory_order_relaxed);
}
P1 (atomic_int* y,atomic_int* x) {
  atomic_exchange_explicit(y,2,memory_order_release);
  atomic_thread_fence(memory_order_acquire);
  int r0 = atomic_load_explicit(x,memory_order_relaxed);
}

exists (P1:r0=0 /\ y=2)

where 'P0:r0 = 0' means thread P0, local variable r0 has value 0.

When simulating this test under the C/C++ model from its initial state, the outcome of execution in the exists clause is forbidden by the source model. The allowed outcomes are:

{ P1:r0=0; y=1; }
{ P1:r0=1; y=1; }
{ P1:r0=1; y=2; }

If we compile this with clang trunk for armv8.2-a using -O2 (https://godbolt.org/z/Grha1988f), we get the following test:

AArch64 test

{ [P1_r0]=0;[x]=0;[y]=0;
  uint64_t %P0_x=x; uint64_t %P0_y=y;
  uint64_t %P1_P1_r0=P1_r0;uint64_t %P1_x=x;uint64_t %P1_y=y }

  P0                |  P1                    ;
   MOV W9,#1        |   MOV W9,#2            ;
   STR W9,[X%P0_x]  |   SWP W9, WZR, [X%P1_y];
   DMB ISH          |   DMB ISHLD            ;
   STR W9,[X%P0_y]  |   LDR W8,[X%P1_x]      ;
   RET              |   STR W8,[X%P1_P1_r0]  ;
                    |   RET                  ;


exists (P1_r0=0 /\ [y]=2)

Which under the aarch64 model permits the following outcomes:

{ P1:r0=0; [y]=1; }
{ P1:r0=0; [y]=2; } <-- Forbidden by source model, a bug!
{ P1:r0=1; [y]=1; }
{ P1:r0=1; [y]=2; }

So what is going on? The SWP instruction as introduced in armv8.2-a with LSE is not regarded as doing a read for the acquire fence when the destination register is the zero register. In this case the write to y can be reordered past the DMB ISHLD fence so that we observe the load of x on P1 before the store to y. This results in the outcome {P1_r0 = 0; y=2} when the load of x on P1 occurs before the write of x on P0, and the write to y on P1 occurs after the store to y on P0.

This is caused by the LLVM dead register definitions pass, that zeros out the destination register of SWP, since it is not used in the source program. This optimisation is applied at -O1 and above, and hence we do not observe this bug with clang -O0. This does not occur in GCC (see Godbolt above), since the destination register is not zero'd out.

My advice is to not apply this optimisation to SWP so that the destination register is not zero'd out like in clang -O0

The official Arm AArch64 memory model has been updated to reflect this - hence my catching it now.

So what does this mean? Well it means there are thread-local optimisations that are influencing the semantics of concurrent programs. This is a standard re-ordering bug as far as concurrency is concerned, but you can only observe this when you don't keep the value of the atomic_exchange around to observe it. In other words it is a new class of heisenbug! This raises questions of whether there are more bugs like this.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions