I happened to look at more closely at the generated assembly for some code that uses align_offset
, and noticed... align_offset
does not compile as efficiently as one might hope for the case of "align pointer to size_of::<usize>()
"
For example (ignore that I omit handling of align_offset's error return value):
pub unsafe fn align_offset(p: *const u8) -> *const u8 {
p.add(p.align_offset(core::mem::size_of::<usize>()))
}
compiles to
example::align_offset:
movl %edi, %ecx
andl $7, %ecx
movl $8, %eax
subq %rcx, %rax
testq %rcx, %rcx
cmoveq %rcx, %rax
addq %rdi, %rax
retq
Whereas performing the same alignment manually (forgive my convoluted way of doing this, my usual pattern is very slightly different and I don't have a memorized idiom for this without going through usize, since, well, I figured I just wanted to use align_offset):
pub unsafe fn manual_align_offset(p: *const u8) -> *const u8 {
// IIRC just doing `arithmetic(p as usize) as *const u8` makes LLVM sad
let aligned = ((p as usize + USIZE_SIZE - 1) & !(USIZE_SIZE - 1)) as isize;
p.offset(aligned - (p as isize))
}
compiles to
example::manual_align_offset:
leaq 7(%rdi), %rax
andq $-8, %rax
retq
Which is substantially better along a variety of metrics, including but not limited to actual runtime.
Taking a look at the source for align_offset
reveals that it uh, well it does some stuff.
|
/// Any questions go to @nagisa. |
|
#[lang = "align_offset"] |
|
pub(crate) unsafe fn align_offset<T: Sized>(p: *const T, a: usize) -> usize { |
|
/// Calculate multiplicative modular inverse of `x` modulo `m`. |
|
/// |
|
/// This implementation is tailored for align_offset and has following preconditions: |
|
/// |
|
/// * `m` is a power-of-two; |
|
/// * `x < m`; (if `x ≥ m`, pass in `x % m` instead) |
|
/// |
|
/// Implementation of this function shall not panic. Ever. |
|
#[inline] |
|
fn mod_inv(x: usize, m: usize) -> usize { |
|
/// Multiplicative modular inverse table modulo 2⁴ = 16. |
|
/// |
|
/// Note, that this table does not contain values where inverse does not exist (i.e., for |
|
/// `0⁻¹ mod 16`, `2⁻¹ mod 16`, etc.) |
|
const INV_TABLE_MOD_16: [u8; 8] = [1, 11, 13, 7, 9, 3, 5, 15]; |
|
/// Modulo for which the `INV_TABLE_MOD_16` is intended. |
|
const INV_TABLE_MOD: usize = 16; |
|
/// INV_TABLE_MOD² |
|
const INV_TABLE_MOD_SQUARED: usize = INV_TABLE_MOD * INV_TABLE_MOD; |
|
|
|
let table_inverse = INV_TABLE_MOD_16[(x & (INV_TABLE_MOD - 1)) >> 1] as usize; |
|
if m <= INV_TABLE_MOD { |
|
table_inverse & (m - 1) |
|
} else { |
|
// We iterate "up" using the following formula: |
|
// |
|
// $$ xy ≡ 1 (mod 2ⁿ) → xy (2 - xy) ≡ 1 (mod 2²ⁿ) $$ |
|
// |
|
// until 2²ⁿ ≥ m. Then we can reduce to our desired `m` by taking the result `mod m`. |
|
let mut inverse = table_inverse; |
|
let mut going_mod = INV_TABLE_MOD_SQUARED; |
|
loop { |
|
// y = y * (2 - xy) mod n |
|
// |
|
// Note, that we use wrapping operations here intentionally – the original formula |
|
// uses e.g., subtraction `mod n`. It is entirely fine to do them `mod |
|
// usize::MAX` instead, because we take the result `mod n` at the end |
|
// anyway. |
|
inverse = inverse.wrapping_mul(2usize.wrapping_sub(x.wrapping_mul(inverse))); |
|
if going_mod >= m { |
|
return inverse & (m - 1); |
|
} |
|
going_mod = going_mod.wrapping_mul(going_mod); |
|
} |
|
} |
|
} |
|
|
|
let stride = mem::size_of::<T>(); |
|
let a_minus_one = a.wrapping_sub(1); |
|
let pmoda = p as usize & a_minus_one; |
|
|
|
if pmoda == 0 { |
|
// Already aligned. Yay! |
|
return 0; |
|
} |
|
|
|
if stride <= 1 { |
|
return if stride == 0 { |
|
// If the pointer is not aligned, and the element is zero-sized, then no amount of |
|
// elements will ever align the pointer. |
|
!0 |
|
} else { |
|
a.wrapping_sub(pmoda) |
|
}; |
|
} |
|
|
|
let smoda = stride & a_minus_one; |
|
// SAFETY: a is power-of-two so cannot be 0. stride = 0 is handled above. |
|
let gcdpow = unsafe { intrinsics::cttz_nonzero(stride).min(intrinsics::cttz_nonzero(a)) }; |
|
let gcd = 1usize << gcdpow; |
|
|
|
if p as usize & (gcd.wrapping_sub(1)) == 0 { |
|
// This branch solves for the following linear congruence equation: |
|
// |
|
// ` p + so = 0 mod a ` |
|
// |
|
// `p` here is the pointer value, `s` - stride of `T`, `o` offset in `T`s, and `a` - the |
|
// requested alignment. |
|
// |
|
// With `g = gcd(a, s)`, and the above asserting that `p` is also divisible by `g`, we can |
|
// denote `a' = a/g`, `s' = s/g`, `p' = p/g`, then this becomes equivalent to: |
|
// |
|
// ` p' + s'o = 0 mod a' ` |
|
// ` o = (a' - (p' mod a')) * (s'^-1 mod a') ` |
|
// |
|
// The first term is "the relative alignment of `p` to `a`" (divided by the `g`), the second |
|
// term is "how does incrementing `p` by `s` bytes change the relative alignment of `p`" (again |
|
// divided by `g`). |
|
// Division by `g` is necessary to make the inverse well formed if `a` and `s` are not |
|
// co-prime. |
|
// |
|
// Furthermore, the result produced by this solution is not "minimal", so it is necessary |
|
// to take the result `o mod lcm(s, a)`. We can replace `lcm(s, a)` with just a `a'`. |
|
let a2 = a >> gcdpow; |
|
let a2minus1 = a2.wrapping_sub(1); |
|
let s2 = smoda >> gcdpow; |
|
let minusp2 = a2.wrapping_sub(pmoda >> gcdpow); |
|
return (minusp2.wrapping_mul(mod_inv(s2, a2))) & a2minus1; |
|
} |
|
|
|
// Cannot be aligned at all. |
|
usize::MAX |
|
} |
... Anyway I'm just gonna take at face value that this all is needed since it tracks that sometimes you might need to GCD... And hell, maybe mine misses a case like "
p + USIZE - 1
wraps around the address space but the aligned value wouldn't).
Anyway IIUC align_offset is really considered the way forward for all pointer aligning, as miri will throw your code straight into the trash if it catches it dereferencing a pointer that you manually aligned... (I have Opinions on this but I'll spare you from a rant).
So, for that and a lot of reasons, I think we'd like the codegen for align_offset to look a lot closer to what I provided at opt-level=3, even if it means special-casing when size_of::<T> == 1
... (I mean, I'd also love for it not to generate the whole code mountain in debug builds, but one thing at a time I guess).
Anyway, the function's documentation comment tells me that @nagisa has taken up the sacred burden of "keeper of align_offset
's secrets"... I have questions: is this fixable? And if not, is there an interface that we can provide that lets us produce good code here? Am I missing something?
P.S. Godbolt link: https://godbolt.org/z/388Enf