Open
Description
AVX2 and ARM NEON have fused-multiply-and-add instructions, so it would be useful to be able to explicitly emit them with implementations of MulAdd and MulAddAssign. This is the basis of peak FLOP/s figures of merit, so it will likely improve performance on matrix multiplication benchmarks.