Use fetch_xxx for atomic access if possible

# Tasks

Inspect existing atomic memory access where `compare_exchange` or `fetch_update` is used.

-   Use `fetch_add`, `fetch_sub`, `fetch_and`, `fetch_or`, `fetch_xor`, `swap`, etc. if appropriate.
-   Use `fetch_update` instead of `compare_exchange` if possible.
-   Use `MetadataSpec::fetch_{and,or}_metadata` instead of `MetadataSpec::store_atomic`.

# Rationale

## Unconditional atomic read-modify-write (RMW) operations

Simple unconditional atomic RMW operations, including `fetch_add`, `fetch_sub`, `fetch_and`, `fetch_or`, `fetch_xor`, `swap`, etc., can be compiled into single machine instructions without using loop, if the machine supports such operations directly.  For example,

-   `fetch_or` can be implemented with
    -   x86: `LOCK OR`
    -   ARMv8: `LDSET`
    -   RISC-V: `AMOOR`

They are relatively more intuitive and easier to use than implementing them manually using `fetch_update` or `compare_exchange`.  All of those operations return the old value in the memory before the operation, which is usually what we expect.  What's more, those operations always succeed, which greatly simplifies the control flow.

And they are faster.  If the architecture provides those instructions directly. they are noticeably faster than implementing them manually using `fetch_update` or `compare_exchange`.  Even if the architecture does not have dedicated instructions for those operations, the compiler (or the runtime) provides implementations for those operations that only use one level of loop with LL/SC, whereas implementing them manually in Rust with `fetch_update` and `compare_exchange` will typically result in an unnecessary two-level loop.  (See below for more details)

I (@wks) did some experiments and it shows obvious performance improvement after replacing `compare_exchange` loops with simple `fetch_xxx` operations if the operation happens on hot paths.  See: https://github.com/mmtk/mmtk-core/pull/849#issuecomment-1601023886

## Prefer `fetch_update` over `compare_exchange`

`fetch_update` is more concise than `compare_exchange`.

Compare exchange is usually used in a loop.  For example,

```rust
let a: &AtomicUsize = ...;

loop {
    let old_value = a.load(...);

    if !need_change(old_value) {
        break;
    }

    let new_value = compute(old_value);
    let result = a.compare_exchange_weak(old_value, new_value, ...);
    if result.is_ok() {
        break;
    }
}
```

 The same idea in the code above can be expressed with `fetch_update` more concisely, without an explicit loop.

```rust
let a: &AtomicUsize = ...;

a.fetch_update(/* insert memory orders here */, |old_value| {
    if need_change(old_value) {
        let new_value = compute(old_value);
        Some(new_value)
    } else {
        None
    }
});

// or more concisely
a.fetch_update(/* insert memory orders here */, |old_value| {
    need_change(old_value).then(|old_value| { compute(old_value) })
});
```

`fetch_update` may be implemented faster than `compare_exchange`.  If an architecture uses [load-link/store-conditional (LL/SC)](https://en.wikipedia.org/wiki/Load-link/store-conditional), the same idea as the code above can be implemented as the following pseudo-code:

```rust
let a: &AtomicUsize = ...;

loop {
    let old_value = a.load_link();  // pseudocode. Load from `a` and link `a` with the current processor.  Break previous links of `a`.

    if !need_change(old_value) {
        break;
    }

    let new_value = compute(old_value);
    let successful = a.store_conditional(new_value);  // pseudocode.  Store into `a` only if `a` is still linked to the current processor.
    if successful {
        break;
    }
}
```

Using `fetch_update` gives the compiler the opportunity to do so.  Unfortunately, the `rustc` compiler (currently 1.70) still compiles `fetch_update` into a two-level loop with cmpxchg implemented by the inner loop for both the AArch64 target and the RISCV64gc target.

## Use atomic and/or operation instead of store for metadata access.

Some metadata, such as the mark bits, the forwarding bits and the valid-object (VO) bits, are only one or two bits per object.  We provided atomic load/store operations as well as atomic and/or/xor operations for those metadata.  However, unlike atomic RMW operations on whole bytes, if we are only setting or clearing some bits, then `store` may not be the most efficient tool.

For example, if we only want to write bit 2 and bit 3, we need to

1.  Load the `old_value` from the memory
2.  Mask and construct the `new_value` to be `old_value & !0b00001100 | (bits_to_set << 2)`, assuming `bits_to_set` only has two its.
3.  If the memory still contains `old_value`, store `new_value` to it (i.e. compare exchange); otherwise go back to step 1.

That involves a compare-exchange loop which is inefficient.  However, the same operation can be done with a single `fetch_or` operation:

```rust
metadata_byte.fetch_or(bits_to_set << 2);
```

The `or` operation will not change unrelated bytes.  Similarly, clearing bits can be implemented with `fetch_and`, for example:

```rust
metadata_byte.fetch_and(!(bits_to_clear << 2));
```

The `fetch_xxx` API has been added to metadata (see https://github.com/mmtk/mmtk-core/issues/651).  Now we just need to use it.

# Appendix

The following program can be used to observe the generated assembly code on different architectures.

```rust
use std::sync::atomic::{AtomicU32, Ordering};

#[inline(never)]
#[no_mangle]
fn fetch_and(loc: &mut AtomicU32, num: u32) {
    loc.fetch_and(num, Ordering::SeqCst);
}

#[inline(never)]
#[no_mangle]
fn fetch_and_with_update(loc: &mut AtomicU32, num: u32) {
    loc.fetch_update(Ordering::SeqCst, Ordering::SeqCst, |old| Some(old & num))
        .unwrap();
}

#[inline(never)]
#[no_mangle]
fn fetch_and_with_cas(loc: &mut AtomicU32, num: u32) {
    loop {
        let old = loc.load(Ordering::SeqCst);
        let new = old & num;
        let result = loc.compare_exchange_weak(old, new, Ordering::SeqCst, Ordering::SeqCst);
        if result.is_ok() {
            break;
        }
    }
}

fn main() {
    let mut myval = AtomicU32::new(0x12345678);
    fetch_and(&mut myval, 0x55aa55aa);
    println!("{:x}", myval.load(Ordering::SeqCst));

    let mut myval2 = AtomicU32::new(0x12345678);
    fetch_and_with_update(&mut myval2, 0x55aa55aa);
    println!("{:x}", myval2.load(Ordering::SeqCst));

    let mut myval3 = AtomicU32::new(0x12345678);
    fetch_and_with_cas(&mut myval3, 0x55aa55aa);
    println!("{:x}", myval3.load(Ordering::SeqCst));
}
```

Save it as `fetch_xxx_test.rs` and compile it with

```
rustc -O --emit asm --target aarch64-unknown-linux-gnu fetch_xxx_test.rs
```

You can also try other targets, such as `riscv64gc-unknown-linux-gnu`.

You can also try it on https://godbolt.org/, and you need to specify a Rust compiler and set the compiler options to specify optimisation level and target.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use fetch_xxx for atomic access if possible #840

Tasks

Rationale

Unconditional atomic read-modify-write (RMW) operations

Prefer `fetch_update` over `compare_exchange`

Use atomic and/or operation instead of store for metadata access.

Appendix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use fetch_xxx for atomic access if possible #840

Description

Tasks

Rationale

Unconditional atomic read-modify-write (RMW) operations

Prefer fetch_update over compare_exchange

Use atomic and/or operation instead of store for metadata access.

Appendix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Prefer `fetch_update` over `compare_exchange`