Description
Hello, after some discussion on Stack Overflow, I was suggested to file an issue here.
The following function performs unexpectedly poorly with N = 2.
fn fold<const N: usize>(v: Vec<Array<N>>) -> Vec<Array<N>> {
let result = v.iter().map(|a1| {
v.iter().fold(Array::default(), |acc, a2| {
let d = *a2 - *a1;
acc + d
})
});
result.collect()
}
Array is a simple wrapper implementing Add
and Sub
around an array of f32
s. The results are the same without that wrapper.
Here is a graph generated with criterion of the benchmark of that function:
Someone on the Stack Overflow discussion suggested that this was because:
"The shl, shr and or operations on rdx and rsi suggest that for N = 2 the two floats are stored in one 64 bit general purpose register, whereas in the other cases the value of a1 is persisted in N separate xmm registers".
Another user suggested that this wrong optimization was caused by opt-level=3, so on the LLVM side.
Here is the assembly code where a user marked important areas.
Here is a repository of the full code for the benchmark.
This was observed both in stable and nightly.