Description
Tasks
Inspect existing atomic memory access where compare_exchange
or fetch_update
is used.
- Use
fetch_add
,fetch_sub
,fetch_and
,fetch_or
,fetch_xor
,swap
, etc. if appropriate. - Use
fetch_update
instead ofcompare_exchange
if possible. - Use
MetadataSpec::fetch_{and,or}_metadata
instead ofMetadataSpec::store_atomic
.
Rationale
Unconditional atomic read-modify-write (RMW) operations
Simple unconditional atomic RMW operations, including fetch_add
, fetch_sub
, fetch_and
, fetch_or
, fetch_xor
, swap
, etc., can be compiled into single machine instructions without using loop, if the machine supports such operations directly. For example,
fetch_or
can be implemented with- x86:
LOCK OR
- ARMv8:
LDSET
- RISC-V:
AMOOR
- x86:
They are relatively more intuitive and easier to use than implementing them manually using fetch_update
or compare_exchange
. All of those operations return the old value in the memory before the operation, which is usually what we expect. What's more, those operations always succeed, which greatly simplifies the control flow.
And they are faster. If the architecture provides those instructions directly. they are noticeably faster than implementing them manually using fetch_update
or compare_exchange
. Even if the architecture does not have dedicated instructions for those operations, the compiler (or the runtime) provides implementations for those operations that only use one level of loop with LL/SC, whereas implementing them manually in Rust with fetch_update
and compare_exchange
will typically result in an unnecessary two-level loop. (See below for more details)
I (@wks) did some experiments and it shows obvious performance improvement after replacing compare_exchange
loops with simple fetch_xxx
operations if the operation happens on hot paths. See: #849 (comment)
Prefer fetch_update
over compare_exchange
fetch_update
is more concise than compare_exchange
.
Compare exchange is usually used in a loop. For example,
let a: &AtomicUsize = ...;
loop {
let old_value = a.load(...);
if !need_change(old_value) {
break;
}
let new_value = compute(old_value);
let result = a.compare_exchange_weak(old_value, new_value, ...);
if result.is_ok() {
break;
}
}
The same idea in the code above can be expressed with fetch_update
more concisely, without an explicit loop.
let a: &AtomicUsize = ...;
a.fetch_update(/* insert memory orders here */, |old_value| {
if need_change(old_value) {
let new_value = compute(old_value);
Some(new_value)
} else {
None
}
});
// or more concisely
a.fetch_update(/* insert memory orders here */, |old_value| {
need_change(old_value).then(|old_value| { compute(old_value) })
});
fetch_update
may be implemented faster than compare_exchange
. If an architecture uses load-link/store-conditional (LL/SC), the same idea as the code above can be implemented as the following pseudo-code:
let a: &AtomicUsize = ...;
loop {
let old_value = a.load_link(); // pseudocode. Load from `a` and link `a` with the current processor. Break previous links of `a`.
if !need_change(old_value) {
break;
}
let new_value = compute(old_value);
let successful = a.store_conditional(new_value); // pseudocode. Store into `a` only if `a` is still linked to the current processor.
if successful {
break;
}
}
Using fetch_update
gives the compiler the opportunity to do so. Unfortunately, the rustc
compiler (currently 1.70) still compiles fetch_update
into a two-level loop with cmpxchg implemented by the inner loop for both the AArch64 target and the RISCV64gc target.
Use atomic and/or operation instead of store for metadata access.
Some metadata, such as the mark bits, the forwarding bits and the valid-object (VO) bits, are only one or two bits per object. We provided atomic load/store operations as well as atomic and/or/xor operations for those metadata. However, unlike atomic RMW operations on whole bytes, if we are only setting or clearing some bits, then store
may not be the most efficient tool.
For example, if we only want to write bit 2 and bit 3, we need to
- Load the
old_value
from the memory - Mask and construct the
new_value
to beold_value & !0b00001100 | (bits_to_set << 2)
, assumingbits_to_set
only has two its. - If the memory still contains
old_value
, storenew_value
to it (i.e. compare exchange); otherwise go back to step 1.
That involves a compare-exchange loop which is inefficient. However, the same operation can be done with a single fetch_or
operation:
metadata_byte.fetch_or(bits_to_set << 2);
The or
operation will not change unrelated bytes. Similarly, clearing bits can be implemented with fetch_and
, for example:
metadata_byte.fetch_and(!(bits_to_clear << 2));
The fetch_xxx
API has been added to metadata (see #651). Now we just need to use it.
Appendix
The following program can be used to observe the generated assembly code on different architectures.
use std::sync::atomic::{AtomicU32, Ordering};
#[inline(never)]
#[no_mangle]
fn fetch_and(loc: &mut AtomicU32, num: u32) {
loc.fetch_and(num, Ordering::SeqCst);
}
#[inline(never)]
#[no_mangle]
fn fetch_and_with_update(loc: &mut AtomicU32, num: u32) {
loc.fetch_update(Ordering::SeqCst, Ordering::SeqCst, |old| Some(old & num))
.unwrap();
}
#[inline(never)]
#[no_mangle]
fn fetch_and_with_cas(loc: &mut AtomicU32, num: u32) {
loop {
let old = loc.load(Ordering::SeqCst);
let new = old & num;
let result = loc.compare_exchange_weak(old, new, Ordering::SeqCst, Ordering::SeqCst);
if result.is_ok() {
break;
}
}
}
fn main() {
let mut myval = AtomicU32::new(0x12345678);
fetch_and(&mut myval, 0x55aa55aa);
println!("{:x}", myval.load(Ordering::SeqCst));
let mut myval2 = AtomicU32::new(0x12345678);
fetch_and_with_update(&mut myval2, 0x55aa55aa);
println!("{:x}", myval2.load(Ordering::SeqCst));
let mut myval3 = AtomicU32::new(0x12345678);
fetch_and_with_cas(&mut myval3, 0x55aa55aa);
println!("{:x}", myval3.load(Ordering::SeqCst));
}
Save it as fetch_xxx_test.rs
and compile it with
rustc -O --emit asm --target aarch64-unknown-linux-gnu fetch_xxx_test.rs
You can also try other targets, such as riscv64gc-unknown-linux-gnu
.
You can also try it on https://godbolt.org/, and you need to specify a Rust compiler and set the compiler options to specify optimisation level and target.