Skip to content

Extremely bad codegen of leading_zeros on several architectures #85879

Open
@AaronKutch

Description

@AaronKutch

A few months ago, I noticed some problems with the code gen of leading_zeros for RISCV and some other architectures, and tried to fix it in a compiler-builtins PR. Today, I was looking at the function for another reason and decided to see what the codegen currently looks like. I assumed that LLVM used __clzsi2 for usize::leading_zeros and then used a recursive wrapper to extend it to larger integers. It seems that it is using its own method sometimes, and other times using __clzsi2. I double checked that the codegen in compiler-builtins is still correct. To see what I am talking about, run this code in a empty library crate with cargo rustc --release --target=[target triple] -- --emit=asm and check the assembly file under target/[target triple]/release/deps/:

#![no_std]

pub fn lz_u128(x: u128) -> usize {
    x.leading_zeros() as usize
}

pub fn actual_lz(x: usize) -> usize {
    x.leading_zeros() as usize
}

fn usize_leading_zeros_riscv(x: usize) -> usize {
    let mut x = x;
    // the number of potential leading zeros
    let mut z = usize::MAX.count_ones() as usize;
    // a temporary
    let mut t: usize;
    #[cfg(target_pointer_width = "64")]
    {
        // If the upper 32 bits of `x` are not all 0, `t` is set to `1 << 5`, otherwise
        // `t` is set to 0.
        t = ((x >= (1 << 32)) as usize) << 5;
        // If `t` was set to `1 << 5`, then the upper 32 bits are shifted down for the
        // next step to process.
        x >>= t;
        // If `t` was set to `1 << 5`, then we subtract 32 from the number of potential
        // leading zeros
        z -= t;
    }
    #[cfg(any(target_pointer_width = "32", target_pointer_width = "64"))]
    {
        t = ((x >= (1 << 16)) as usize) << 4;
        x >>= t;
        z -= t;
    }
    t = ((x >= (1 << 8)) as usize) << 3;
    x >>= t;
    z -= t;
    t = ((x >= (1 << 4)) as usize) << 2;
    x >>= t;
    z -= t;
    t = ((x >= (1 << 2)) as usize) << 1;
    x >>= t;
    z -= t;
    t = (x >= (1 << 1)) as usize;
    x >>= t;
    z -= t;
    // All bits except the LSB are guaranteed to be zero for this final bisection step.
    // If `x != 0` then `x == 1` and subtracts one potential zero from `z`.
    z - x
}

fn usize_leading_zeros_default(x: usize) -> usize {
    let mut x = x;
    // the number of potential leading zeros
    let mut z = usize::MAX.count_ones() as usize;
    // a temporary
    let mut t: usize;
    #[cfg(target_pointer_width = "64")]
    {
        t = x >> 32;
        if t != 0 {
            z -= 32;
            x = t;
        }
    }
    #[cfg(any(target_pointer_width = "32", target_pointer_width = "64"))]
    {
        t = x >> 16;
        if t != 0 {
            z -= 16;
            x = t;
        }
    }
    t = x >> 8;
    if t != 0 {
        z -= 8;
        x = t;
    }
    t = x >> 4;
    if t != 0 {
        z -= 4;
        x = t;
    }
    t = x >> 2;
    if t != 0 {
        z -= 2;
        x = t;
    }
    // the last two bisections are combined into one conditional
    t = x >> 1;
    if t != 0 {
        z - 2
    } else {
        z - x
    }
}

pub fn expected_lz(x: usize) -> usize {
    if cfg!(any(target_arch = "riscv32", target_arch = "riscv64")) {
        usize_leading_zeros_riscv(x)
    } else {
        usize_leading_zeros_default(x)
    }
}

The expected_lz function is equivalent to what compiler-builtins does, while actual_lz show what a plain leading_zeros compiles down to. lz_u128 demonstrates another issue that I have known for some time, but never got fixed. I think I remember nagisa opening an LLVM issue for it, but can't find the link.

The worst behavior is on riscv32i-unknown-none-elf where expected_lz compiles down to the expected branchless 27 assembly instructions, but actual_lz has more assembly instructions and includes a call to __mulsi3 which probably results in it being about as expensive to compute as a full integer division. lz_u128 is truly horrifying.

riscv64gc-unknown-none-elf does roughly the same thing, and while the multiplication could be expected to run much faster, expected_lz is still much better. Whatever inlining lead to lz_u128 is not good. The correct lz_u128 looks like what is produced by:

pub fn lz_u128_expected(x: u128) -> usize {
    let mut x = x;
    let mut z = 0;
    if ((x >> 64) as u64) != 0 {
        z += 64;
    } else {
        x >>= 64;
    }
    // note: `usize` needs to have bitwidth 64 here
    z + expected_lz(x as usize) as usize
}

thumbv8m.base-none-eabi (I decided to use this as a non-riscv example that has no hardware clz) uses calls to __clzsi2 instead of rolling its own thing. Take a look at the number of __clzsi2 calls and extra instructions lz_u128 makes vs. what the lz_u128_expected I coded above shows should be possible. I suspect that instead of two separate bugs, there is one bug where LLVM defines larger bitwidth clz's in terms of small clz's in such a way that it generates horrible code when smaller clz's get inlined upwards into larger clz's. Whenever LLVM decides to roll its own code instead of __clzsi2, it starts with some small clz on an integer like u8, and by the time u64 is reached the code is extremely garbled.

aarch64-pc-windows-msvc and x86_64 call hardware clz and do the correct extension for lz_u128. expected_lz can be ignored, since __clzsi2 is only used when a hardware clz cannot be used.

It is important for the embedded architectures to have a better clz, because the performance of that function is the base of many other kinds of arithmetic emulation such as soft floats.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.A-codegenArea: Code generationC-bugCategory: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions