Skip to content

Use fetch_xxx for atomic access if possible #840

Open
@wks

Description

@wks

Tasks

Inspect existing atomic memory access where compare_exchange or fetch_update is used.

  • Use fetch_add, fetch_sub, fetch_and, fetch_or, fetch_xor, swap, etc. if appropriate.
  • Use fetch_update instead of compare_exchange if possible.
  • Use MetadataSpec::fetch_{and,or}_metadata instead of MetadataSpec::store_atomic.

Rationale

Unconditional atomic read-modify-write (RMW) operations

Simple unconditional atomic RMW operations, including fetch_add, fetch_sub, fetch_and, fetch_or, fetch_xor, swap, etc., can be compiled into single machine instructions without using loop, if the machine supports such operations directly. For example,

  • fetch_or can be implemented with
    • x86: LOCK OR
    • ARMv8: LDSET
    • RISC-V: AMOOR

They are relatively more intuitive and easier to use than implementing them manually using fetch_update or compare_exchange. All of those operations return the old value in the memory before the operation, which is usually what we expect. What's more, those operations always succeed, which greatly simplifies the control flow.

And they are faster. If the architecture provides those instructions directly. they are noticeably faster than implementing them manually using fetch_update or compare_exchange. Even if the architecture does not have dedicated instructions for those operations, the compiler (or the runtime) provides implementations for those operations that only use one level of loop with LL/SC, whereas implementing them manually in Rust with fetch_update and compare_exchange will typically result in an unnecessary two-level loop. (See below for more details)

I (@wks) did some experiments and it shows obvious performance improvement after replacing compare_exchange loops with simple fetch_xxx operations if the operation happens on hot paths. See: #849 (comment)

Prefer fetch_update over compare_exchange

fetch_update is more concise than compare_exchange.

Compare exchange is usually used in a loop. For example,

let a: &AtomicUsize = ...;

loop {
    let old_value = a.load(...);

    if !need_change(old_value) {
        break;
    }

    let new_value = compute(old_value);
    let result = a.compare_exchange_weak(old_value, new_value, ...);
    if result.is_ok() {
        break;
    }
}

The same idea in the code above can be expressed with fetch_update more concisely, without an explicit loop.

let a: &AtomicUsize = ...;

a.fetch_update(/* insert memory orders here */, |old_value| {
    if need_change(old_value) {
        let new_value = compute(old_value);
        Some(new_value)
    } else {
        None
    }
});

// or more concisely
a.fetch_update(/* insert memory orders here */, |old_value| {
    need_change(old_value).then(|old_value| { compute(old_value) })
});

fetch_update may be implemented faster than compare_exchange. If an architecture uses load-link/store-conditional (LL/SC), the same idea as the code above can be implemented as the following pseudo-code:

let a: &AtomicUsize = ...;

loop {
    let old_value = a.load_link();  // pseudocode. Load from `a` and link `a` with the current processor.  Break previous links of `a`.

    if !need_change(old_value) {
        break;
    }

    let new_value = compute(old_value);
    let successful = a.store_conditional(new_value);  // pseudocode.  Store into `a` only if `a` is still linked to the current processor.
    if successful {
        break;
    }
}

Using fetch_update gives the compiler the opportunity to do so. Unfortunately, the rustc compiler (currently 1.70) still compiles fetch_update into a two-level loop with cmpxchg implemented by the inner loop for both the AArch64 target and the RISCV64gc target.

Use atomic and/or operation instead of store for metadata access.

Some metadata, such as the mark bits, the forwarding bits and the valid-object (VO) bits, are only one or two bits per object. We provided atomic load/store operations as well as atomic and/or/xor operations for those metadata. However, unlike atomic RMW operations on whole bytes, if we are only setting or clearing some bits, then store may not be the most efficient tool.

For example, if we only want to write bit 2 and bit 3, we need to

  1. Load the old_value from the memory
  2. Mask and construct the new_value to be old_value & !0b00001100 | (bits_to_set << 2), assuming bits_to_set only has two its.
  3. If the memory still contains old_value, store new_value to it (i.e. compare exchange); otherwise go back to step 1.

That involves a compare-exchange loop which is inefficient. However, the same operation can be done with a single fetch_or operation:

metadata_byte.fetch_or(bits_to_set << 2);

The or operation will not change unrelated bytes. Similarly, clearing bits can be implemented with fetch_and, for example:

metadata_byte.fetch_and(!(bits_to_clear << 2));

The fetch_xxx API has been added to metadata (see #651). Now we just need to use it.

Appendix

The following program can be used to observe the generated assembly code on different architectures.

use std::sync::atomic::{AtomicU32, Ordering};

#[inline(never)]
#[no_mangle]
fn fetch_and(loc: &mut AtomicU32, num: u32) {
    loc.fetch_and(num, Ordering::SeqCst);
}

#[inline(never)]
#[no_mangle]
fn fetch_and_with_update(loc: &mut AtomicU32, num: u32) {
    loc.fetch_update(Ordering::SeqCst, Ordering::SeqCst, |old| Some(old & num))
        .unwrap();
}

#[inline(never)]
#[no_mangle]
fn fetch_and_with_cas(loc: &mut AtomicU32, num: u32) {
    loop {
        let old = loc.load(Ordering::SeqCst);
        let new = old & num;
        let result = loc.compare_exchange_weak(old, new, Ordering::SeqCst, Ordering::SeqCst);
        if result.is_ok() {
            break;
        }
    }
}

fn main() {
    let mut myval = AtomicU32::new(0x12345678);
    fetch_and(&mut myval, 0x55aa55aa);
    println!("{:x}", myval.load(Ordering::SeqCst));

    let mut myval2 = AtomicU32::new(0x12345678);
    fetch_and_with_update(&mut myval2, 0x55aa55aa);
    println!("{:x}", myval2.load(Ordering::SeqCst));

    let mut myval3 = AtomicU32::new(0x12345678);
    fetch_and_with_cas(&mut myval3, 0x55aa55aa);
    println!("{:x}", myval3.load(Ordering::SeqCst));
}

Save it as fetch_xxx_test.rs and compile it with

rustc -O --emit asm --target aarch64-unknown-linux-gnu fetch_xxx_test.rs

You can also try other targets, such as riscv64gc-unknown-linux-gnu.

You can also try it on https://godbolt.org/, and you need to specify a Rust compiler and set the compiler options to specify optimisation level and target.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-generalArea: all code base (issues with this label may be divided into more concrete issues)C-refactoringCategory: RefactoringF-good-first-issueCall For Participation: Suitable issues for first-time contributorsG-performanceGoal: PerformanceG-safetyGoal: SafetyP-lowPriority: Low. A low-priority issue won't be scheduled and assigned. Any help is welcome.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions