Open
Description
I tried this code:
pub fn vae_dec_linear(latent: &[f32; 4]) -> [f32; 3] {
let weights: [[f32; 4]; 3] = [
[49.5210, 29.0283, -23.9673, -39.4981],
[41.1373, 42.4951, 24.7349, -50.8279],
[40.2919, 18.9304, 30.0236, -81.9976],
];
// let bias = [99.9368, 99.8421, 99.5384];
let bias = [0.0, 0.0, 0.0];
affine_transform(&weights, &bias, &latent)
}
pub fn affine_transform<
T: Copy + std::ops::Add<Output = T> + std::ops::Mul<Output = T>,
const IN: usize,
const OUT: usize,
>(
weights: &[[T; IN]; OUT],
bias: &[T; OUT],
vec: &[T; IN],
) -> [T; OUT] {
std::array::from_fn(|i| (0..OUT).fold(bias[i], |acc, j| acc + vec[i] * weights[i][j]))
}
I expected to see this happen: Literally nothing :)
Instead, this happened:
example::vae_dec_linear:
mov rax, rdi
vmovss xmm0, dword ptr [rsi + 8]
vmulss xmm1, xmm0, dword ptr [rip + .LCPI0_0]
vxorps xmm2, xmm2, xmm2 ;; here
vaddss xmm1, xmm1, xmm2
vmulss xmm2, xmm0, dword ptr [rip + .LCPI0_1]
vmulss xmm0, xmm0, dword ptr [rip + .LCPI0_2]
vaddss xmm1, xmm2, xmm1
vaddss xmm0, xmm0, xmm1
vmovsd xmm1, qword ptr [rsi]
vmulps xmm2, xmm1, xmmword ptr [rip + .LCPI0_3]
vxorps xmm3, xmm3, xmm3 ;; and here
vaddps xmm2, xmm2, xmm3
vmulps xmm3, xmm1, xmmword ptr [rip + .LCPI0_4]
vmulps xmm1, xmm1, xmmword ptr [rip + .LCPI0_5]
vaddps xmm2, xmm3, xmm2
vaddsubps xmm1, xmm2, xmm1
vmovlps qword ptr [rdi], xmm1
vmovss dword ptr [rdi + 8], xmm0
ret
Note the two xor
/add
sequences adding zero to xmm1
or xmm2
, respectively, this is with -Copt-level=3 -C target-cpu=native
EDIT: Oh and nevermind that the code has a bug it should be
std::array::from_fn(|i| (0..IN).fold(bias[i], |acc, j| acc + vec[j] * weights[i][j]))
Changes the second sequence to
vxorps xmm5, xmm5, xmm5
vmulps xmm2, xmm2, xmmword ptr [rip + .LCPI0_5]
vaddps xmm0, xmm0, xmm5
vaddps xmm0, xmm0, xmm2
, not as easy to spot, the second vaddps also looks suspicious.
Meta
rustc --version --verbose
:
All I looked at in compiler explorer, including nightly
Backtrace
<backtrace>
Metadata
Metadata
Assignees
Labels
Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.Area: Code generationCategory: This is a bug.Issue: Problems and improvements with respect to binary size of generated code.Issue: Problems and improvements with respect to performance of generated code.Target: x86 processors, 32 bit (like i686-*) (IA-32)Target: x86-64 processors (like x86_64-*) (also known as amd64 and x64)