Description
openedon Sep 14, 2021
I have the following simple SIMD-powered function which tests whether points are inside one of bounding boxes:
pub unsafe fn foo(
x: &[__m256i; N],
y: &[__m256i; N],
z: &[__m256i; N],
bboxes: &[[__m256i; 6]],
) -> [__m256i; N] {
let mut res = [_mm256_setzero_si256(); N];
for bbox in bboxes {
for i in 0..N {
let tx = _mm256_and_si256(
_mm256_cmpgt_epi32(x[i], bbox[0]),
_mm256_cmpgt_epi32(bbox[1], x[i]),
);
let ty = _mm256_and_si256(
_mm256_cmpgt_epi32(y[i], bbox[2]),
_mm256_cmpgt_epi32(bbox[3], y[i]),
);
let t = _mm256_and_si256(tx, ty);
let tz = _mm256_and_si256(
_mm256_cmpgt_epi32(z[i], bbox[4]),
_mm256_cmpgt_epi32(bbox[5], z[i]),
);
let t = _mm256_and_si256(t, tz);
res[i] = _mm256_or_si256(res[i], t);
}
}
res
}
By inspecting the generated assembly we can see that for some reason it caches coordinates to stack and reads them from it each iteration instead of using the input pointers. The same behavior can be observed for a function which processes coordinate slices. This caching looks quite redundant to me, especially considering that noalias
is enabled (i.e. compiler should know that memory at which coordinates are stored can not change during function execution).
It looks like LLVM correctly moves coordinate loads from the inner loop using its infinite virtual registers. And it's exactly the behavior we want when there is enough physical registers. But when it's not true, it spills virtual register values to stack instead of relying on the original locations.
On Rust 1.51 code from the first link does not have this issue, but not from the second one.
UPD: See this comment for additional example affecting cryptographic code..