Description
openedon Apr 2, 2021
It'd be great to get good support for SVE, especially as SVE2 will become standard for ARMv9.
However, early tests with using LLVM vector intrinsics on the A64FX did not go well.
Here is a minimal example on Godbolt, showing a vectorized (but not unrolled) dot product on the A64FX, which has 512 bit vectors.
The problem is that <8 x double>
gets translated into 4x <2 x double>
NEON instructions, instead of an SVE instruction.
v
registers are NEON, and see see that the single @llvm.fma.v8f64
was broken up into 4 separate fmla
instructions.
Based on this document, SVE registers would be denoted by z[0-31]
.
This makes me wonder if to actually get intrinsic support for SVE, if we'd need to use <vscale x 2 x double>
, etc, instead?
This isn't compelling in Julia (unlike C
/C++
/wherever folks distribute binaries), since we're probably compiling for the specific target machine anyway, and can easily find the appropriate vector length using @llvm.vscale.i64
.
Furthermore, we don't have any way to represent that at the moment. NTuple{L,Core.VecElement{T}}
<-> <L x T>
, but there's no vscale version at the moment.
Anyone have any insight into/knowledge about this?