Description
A few months ago, I noticed some problems with the code gen of leading_zeros
for RISCV and some other architectures, and tried to fix it in a compiler-builtins PR. Today, I was looking at the function for another reason and decided to see what the codegen currently looks like. I assumed that LLVM used __clzsi2
for usize::leading_zeros
and then used a recursive wrapper to extend it to larger integers. It seems that it is using its own method sometimes, and other times using __clzsi2
. I double checked that the codegen in compiler-builtins
is still correct. To see what I am talking about, run this code in a empty library crate with cargo rustc --release --target=[target triple] -- --emit=asm
and check the assembly file under target/[target triple]/release/deps/
:
#![no_std]
pub fn lz_u128(x: u128) -> usize {
x.leading_zeros() as usize
}
pub fn actual_lz(x: usize) -> usize {
x.leading_zeros() as usize
}
fn usize_leading_zeros_riscv(x: usize) -> usize {
let mut x = x;
// the number of potential leading zeros
let mut z = usize::MAX.count_ones() as usize;
// a temporary
let mut t: usize;
#[cfg(target_pointer_width = "64")]
{
// If the upper 32 bits of `x` are not all 0, `t` is set to `1 << 5`, otherwise
// `t` is set to 0.
t = ((x >= (1 << 32)) as usize) << 5;
// If `t` was set to `1 << 5`, then the upper 32 bits are shifted down for the
// next step to process.
x >>= t;
// If `t` was set to `1 << 5`, then we subtract 32 from the number of potential
// leading zeros
z -= t;
}
#[cfg(any(target_pointer_width = "32", target_pointer_width = "64"))]
{
t = ((x >= (1 << 16)) as usize) << 4;
x >>= t;
z -= t;
}
t = ((x >= (1 << 8)) as usize) << 3;
x >>= t;
z -= t;
t = ((x >= (1 << 4)) as usize) << 2;
x >>= t;
z -= t;
t = ((x >= (1 << 2)) as usize) << 1;
x >>= t;
z -= t;
t = (x >= (1 << 1)) as usize;
x >>= t;
z -= t;
// All bits except the LSB are guaranteed to be zero for this final bisection step.
// If `x != 0` then `x == 1` and subtracts one potential zero from `z`.
z - x
}
fn usize_leading_zeros_default(x: usize) -> usize {
let mut x = x;
// the number of potential leading zeros
let mut z = usize::MAX.count_ones() as usize;
// a temporary
let mut t: usize;
#[cfg(target_pointer_width = "64")]
{
t = x >> 32;
if t != 0 {
z -= 32;
x = t;
}
}
#[cfg(any(target_pointer_width = "32", target_pointer_width = "64"))]
{
t = x >> 16;
if t != 0 {
z -= 16;
x = t;
}
}
t = x >> 8;
if t != 0 {
z -= 8;
x = t;
}
t = x >> 4;
if t != 0 {
z -= 4;
x = t;
}
t = x >> 2;
if t != 0 {
z -= 2;
x = t;
}
// the last two bisections are combined into one conditional
t = x >> 1;
if t != 0 {
z - 2
} else {
z - x
}
}
pub fn expected_lz(x: usize) -> usize {
if cfg!(any(target_arch = "riscv32", target_arch = "riscv64")) {
usize_leading_zeros_riscv(x)
} else {
usize_leading_zeros_default(x)
}
}
The expected_lz
function is equivalent to what compiler-builtins
does, while actual_lz
show what a plain leading_zeros
compiles down to. lz_u128
demonstrates another issue that I have known for some time, but never got fixed. I think I remember nagisa opening an LLVM issue for it, but can't find the link.
The worst behavior is on riscv32i-unknown-none-elf
where expected_lz
compiles down to the expected branchless 27 assembly instructions, but actual_lz
has more assembly instructions and includes a call to __mulsi3
which probably results in it being about as expensive to compute as a full integer division. lz_u128
is truly horrifying.
riscv64gc-unknown-none-elf
does roughly the same thing, and while the multiplication could be expected to run much faster, expected_lz
is still much better. Whatever inlining lead to lz_u128
is not good. The correct lz_u128
looks like what is produced by:
pub fn lz_u128_expected(x: u128) -> usize {
let mut x = x;
let mut z = 0;
if ((x >> 64) as u64) != 0 {
z += 64;
} else {
x >>= 64;
}
// note: `usize` needs to have bitwidth 64 here
z + expected_lz(x as usize) as usize
}
thumbv8m.base-none-eabi
(I decided to use this as a non-riscv example that has no hardware clz) uses calls to __clzsi2
instead of rolling its own thing. Take a look at the number of __clzsi2
calls and extra instructions lz_u128
makes vs. what the lz_u128_expected
I coded above shows should be possible. I suspect that instead of two separate bugs, there is one bug where LLVM defines larger bitwidth clz's in terms of small clz's in such a way that it generates horrible code when smaller clz's get inlined upwards into larger clz's. Whenever LLVM decides to roll its own code instead of __clzsi2
, it starts with some small clz on an integer like u8
, and by the time u64
is reached the code is extremely garbled.
aarch64-pc-windows-msvc
and x86_64 call hardware clz and do the correct extension for lz_u128
. expected_lz
can be ignored, since __clzsi2
is only used when a hardware clz cannot be used.
It is important for the embedded architectures to have a better clz, because the performance of that function is the base of many other kinds of arithmetic emulation such as soft floats.