Description
So, I was trying to sum a slice of f32s quickly on stable Rust.
But like pretty much all floating-point reductions, the naive algorithm (floats.iter().sum::<f32>()
) does not autovectorize because its "natural" summation order introduces a serial dependency between successive sums. Which makes SIMD parallelization illegal in the eye of a compiler that guarantees bitwise floating point reproducibility like rustc does. Fair enough.
I saw this as a good motivation to move to explicit SIMD programming, but did not want to lose hardware portability (or, more precisely, wanted to keep it easy), so I tried to see how close I could get to stdsimd
on stable Rust with only autovectorization and a pinch of hardware-specific vector size tuning.
Some hours of trial and error later, I got into a reasonably satisfactory state. In particular, the core of the algorithm...
// Perform concurrent SIMD accumulation
let mut accumulators = [SimdF32::default(); CONCURRENCY];
for chunk in chunks {
for (accumulator, &vector) in accumulators.iter_mut().zip(chunk) {
*accumulator += vector;
}
}
// Merge the SIMD accumulators into one
let mut stride = CONCURRENCY / 2;
while stride > 0 {
for i in 0..stride {
accumulators[i] += accumulators[i + stride];
}
stride /= 2;
}
let mut accumulator = accumulators[0];
...translated pretty much into the assembly that I would have written by hand, which made me very happy...
.LBB0_17:
addps (%rdi), %xmm1
addps 16(%rdi), %xmm3
addps 32(%rdi), %xmm2
addps 48(%rdi), %xmm4
addq $64, %rdi
addq $4, %rcx
jne .LBB0_17
jmp .LBB0_18
...
.LBB0_18:
addps %xmm4, %xmm3
addps %xmm2, %xmm1
addps %xmm3, %xmm1
Nitpicky as I am, however, I was still a little bit unhappy about the part afterwards, which introduced a chain of serial dependencies that could become a bit long if I were to use a lot of accumulators...
for &vector in remainder {
accumulator += vector;
}
.LBB0_31:
addps (%rcx), %xmm1
addq $16, %rcx
incq %rax
jne .LBB0_31
...because I knew that in this particular case, there should be an easy way to avoid that, which is to interleave the SIMD accumulator merging with the summation of remaining data.
// Reinterprete input as aligned SIMD vectors + some extra floats.
let (peel, mut vectors, tail) = unsafe { input.align_to::<SimdF32>() };
// Perform concurrent SIMD accumulation, starting at maximal concurrency and
// decreasing once the number of input vectors gets too small
let mut accumulators = [SimdF32::default(); MAX_CONCURRENCY];
let mut concurrency = MAX_CONCURRENCY;
while concurrency > 0 {
// Set up accumulators and chunked SIMD data according to current concurrency
let accumulators = &mut accumulators[..concurrency];
let chunks = vectors.chunks_exact(concurrency);
let remainder = chunks.remainder();
// Perform concurrent SIMD accumulation
for chunk in chunks {
for (accumulator, &vector) in accumulators.iter_mut().zip(chunk) {
*accumulator += vector;
}
}
// Halve SIMD concurrency to accumulate remaining data
concurrency /= 2;
for i in 0..concurrency {
accumulators[i] += accumulators[i+concurrency];
}
vectors = remainder;
}
let accumulator = accumulators[0];
However, much to my surprise, performing this algorithmic optimization leads rustc to heavily pessimize the inner loop code by spilling all but one accumulator on every iteration:
.LBB0_16:
addps (%rax), %xmm1
movaps 16(%rsp), %xmm2
movaps 32(%rsp), %xmm3
movaps 48(%rsp), %xmm4
addps 16(%rax), %xmm2
movaps %xmm2, 16(%rsp)
addps 32(%rax), %xmm3
addps 48(%rax), %xmm4
movaps %xmm3, 32(%rsp)
movaps %xmm4, 48(%rsp)
addq $64, %rax
addq $4, %rdi
jne .LBB0_16
Why would that happen? The only explanation I have is that rustc is somehow unable to prove that the accumulators
slice does not alias with the vectors
/remainder
slices, and thus spills to memory just in case accumulator changes would affect the input of the next computations.
But this sounds like a bug to me: given that I have an &mut to the accumulators, my understanding is that rustc should be able to infer that no other code can see the accumulators, and thus they can remain resident in registers for the entire duration of the accumulation loop.
Can someone with more knowledge of how rustc and LLVM do their optimization magic cross-check this and tell if my understanding is correct or if the register spills are truly necessary to preserve the semantics of my code?
Also, this is on stable release 1.57.0. On beta and nightly, the generated code becomes even stranger:
.LBB0_16:
movq (%rdx), %rcx
movq 8(%rdx), %rbx
movd %ebx, %xmm5
shrq $32, %rbx
movd %ecx, %xmm6
shrq $32, %rcx
movd %ecx, %xmm7
punpckldq %xmm6, %xmm7
movd %ebx, %xmm6
punpckldq %xmm5, %xmm6
punpcklqdq %xmm7, %xmm6
addps %xmm6, %xmm2
addps 16(%rdx), %xmm1
addps 32(%rdx), %xmm4
addps 48(%rdx), %xmm3
addq $64, %rdx
addq $4, %rax
jne .LBB0_16
Here, rustc generates the code I would expect for the last three accumulators, but then it goes crazy with the first accumulator and generates the least efficient SSE load I have ever seen.
So it seems the aliasing issue got resolved, but was replaced by another issue beyond my comprehension... Here again, compiler expert help would be useful.