Skip to content

ptr::align_offset generates surprisingly bad code #75579

Closed

Description

I happened to look at more closely at the generated assembly for some code that uses align_offset, and noticed... align_offset does not compile as efficiently as one might hope for the case of "align pointer to size_of::<usize>()"

For example (ignore that I omit handling of align_offset's error return value):

pub unsafe fn align_offset(p: *const u8) -> *const u8 {
    p.add(p.align_offset(core::mem::size_of::<usize>()))
}

compiles to

example::align_offset:
        movl    %edi, %ecx
        andl    $7, %ecx
        movl    $8, %eax
        subq    %rcx, %rax
        testq   %rcx, %rcx
        cmoveq  %rcx, %rax
        addq    %rdi, %rax
        retq

Whereas performing the same alignment manually (forgive my convoluted way of doing this, my usual pattern is very slightly different and I don't have a memorized idiom for this without going through usize, since, well, I figured I just wanted to use align_offset):

pub unsafe fn manual_align_offset(p: *const u8) -> *const u8 {
    // IIRC just doing `arithmetic(p as usize) as *const u8` makes LLVM sad
    let aligned = ((p as usize + USIZE_SIZE - 1) & !(USIZE_SIZE - 1)) as isize;
    p.offset(aligned - (p as isize))
}

compiles to

example::manual_align_offset:
        leaq    7(%rdi), %rax
        andq    $-8, %rax
        retq

Which is substantially better along a variety of metrics, including but not limited to actual runtime.

Taking a look at the source for align_offset reveals that it uh, well it does some stuff.

/// Any questions go to @nagisa.
#[lang = "align_offset"]
pub(crate) unsafe fn align_offset<T: Sized>(p: *const T, a: usize) -> usize {
/// Calculate multiplicative modular inverse of `x` modulo `m`.
///
/// This implementation is tailored for align_offset and has following preconditions:
///
/// * `m` is a power-of-two;
/// * `x < m`; (if `x ≥ m`, pass in `x % m` instead)
///
/// Implementation of this function shall not panic. Ever.
#[inline]
fn mod_inv(x: usize, m: usize) -> usize {
/// Multiplicative modular inverse table modulo 2⁴ = 16.
///
/// Note, that this table does not contain values where inverse does not exist (i.e., for
/// `0⁻¹ mod 16`, `2⁻¹ mod 16`, etc.)
const INV_TABLE_MOD_16: [u8; 8] = [1, 11, 13, 7, 9, 3, 5, 15];
/// Modulo for which the `INV_TABLE_MOD_16` is intended.
const INV_TABLE_MOD: usize = 16;
/// INV_TABLE_MOD²
const INV_TABLE_MOD_SQUARED: usize = INV_TABLE_MOD * INV_TABLE_MOD;
let table_inverse = INV_TABLE_MOD_16[(x & (INV_TABLE_MOD - 1)) >> 1] as usize;
if m <= INV_TABLE_MOD {
table_inverse & (m - 1)
} else {
// We iterate "up" using the following formula:
//
// $$ xy ≡ 1 (mod 2ⁿ) → xy (2 - xy) ≡ 1 (mod 2²ⁿ) $$
//
// until 2²ⁿ ≥ m. Then we can reduce to our desired `m` by taking the result `mod m`.
let mut inverse = table_inverse;
let mut going_mod = INV_TABLE_MOD_SQUARED;
loop {
// y = y * (2 - xy) mod n
//
// Note, that we use wrapping operations here intentionally – the original formula
// uses e.g., subtraction `mod n`. It is entirely fine to do them `mod
// usize::MAX` instead, because we take the result `mod n` at the end
// anyway.
inverse = inverse.wrapping_mul(2usize.wrapping_sub(x.wrapping_mul(inverse)));
if going_mod >= m {
return inverse & (m - 1);
}
going_mod = going_mod.wrapping_mul(going_mod);
}
}
}
let stride = mem::size_of::<T>();
let a_minus_one = a.wrapping_sub(1);
let pmoda = p as usize & a_minus_one;
if pmoda == 0 {
// Already aligned. Yay!
return 0;
}
if stride <= 1 {
return if stride == 0 {
// If the pointer is not aligned, and the element is zero-sized, then no amount of
// elements will ever align the pointer.
!0
} else {
a.wrapping_sub(pmoda)
};
}
let smoda = stride & a_minus_one;
// SAFETY: a is power-of-two so cannot be 0. stride = 0 is handled above.
let gcdpow = unsafe { intrinsics::cttz_nonzero(stride).min(intrinsics::cttz_nonzero(a)) };
let gcd = 1usize << gcdpow;
if p as usize & (gcd.wrapping_sub(1)) == 0 {
// This branch solves for the following linear congruence equation:
//
// ` p + so = 0 mod a `
//
// `p` here is the pointer value, `s` - stride of `T`, `o` offset in `T`s, and `a` - the
// requested alignment.
//
// With `g = gcd(a, s)`, and the above asserting that `p` is also divisible by `g`, we can
// denote `a' = a/g`, `s' = s/g`, `p' = p/g`, then this becomes equivalent to:
//
// ` p' + s'o = 0 mod a' `
// ` o = (a' - (p' mod a')) * (s'^-1 mod a') `
//
// The first term is "the relative alignment of `p` to `a`" (divided by the `g`), the second
// term is "how does incrementing `p` by `s` bytes change the relative alignment of `p`" (again
// divided by `g`).
// Division by `g` is necessary to make the inverse well formed if `a` and `s` are not
// co-prime.
//
// Furthermore, the result produced by this solution is not "minimal", so it is necessary
// to take the result `o mod lcm(s, a)`. We can replace `lcm(s, a)` with just a `a'`.
let a2 = a >> gcdpow;
let a2minus1 = a2.wrapping_sub(1);
let s2 = smoda >> gcdpow;
let minusp2 = a2.wrapping_sub(pmoda >> gcdpow);
return (minusp2.wrapping_mul(mod_inv(s2, a2))) & a2minus1;
}
// Cannot be aligned at all.
usize::MAX
}
... Anyway I'm just gonna take at face value that this all is needed since it tracks that sometimes you might need to GCD... And hell, maybe mine misses a case like "p + USIZE - 1 wraps around the address space but the aligned value wouldn't).

Anyway IIUC align_offset is really considered the way forward for all pointer aligning, as miri will throw your code straight into the trash if it catches it dereferencing a pointer that you manually aligned... (I have Opinions on this but I'll spare you from a rant).

So, for that and a lot of reasons, I think we'd like the codegen for align_offset to look a lot closer to what I provided at opt-level=3, even if it means special-casing when size_of::<T> == 1... (I mean, I'd also love for it not to generate the whole code mountain in debug builds, but one thing at a time I guess).

Anyway, the function's documentation comment tells me that @nagisa has taken up the sacred burden of "keeper of align_offset's secrets"... I have questions: is this fixable? And if not, is there an interface that we can provide that lets us produce good code here? Am I missing something?

P.S. Godbolt link: https://godbolt.org/z/388Enf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    C-bugCategory: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions