Open
Description
This code: Zig Godbolt LLVM Godbolt
export fn foo(a: @Vector(8, u64), b: @Vector(8, u64), c: @Vector(8, u64)) @Vector(8, u64) {
const x = a & b;
const y = a & b & c;
return x *% y;
}
define dso_local <8 x i64> @foo(<8 x i64> %0, <8 x i64> %1, <8 x i64> %2) local_unnamed_addr {
Entry:
%3 = and <8 x i64> %1, %0
%4 = and <8 x i64> %3, %2
%5 = mul <8 x i64> %4, %3
ret <8 x i64> %5
}
Compiles to:
vpandq zmm0, zmm1, zmm0
vpandq zmm1, zmm0, zmm2
vpmullq zmm0, zmm1, zmm0
Should be:
vpandq zmm3, zmm1, zmm0
vpternlogq zmm2, zmm1, zmm0, 128
vpmullq zmm0, zmm2, zmm3
In the current assembly, the second vpandq
relies on the input of the first one. vpternlogq
, on the other hand, can be computed in parallel to vpandq
.