Open
Description
opened on Jan 4, 2024
Background and motivation
AVX-512 IFMA
is supported by Intel in the Cannon Lake and newer architectures, and by AMD in Zen 4.
These instructions are known to be useful for cryptography and large number processing, and as a faster compromised alternative for VPMULLQ
instruction that finishes 5x slower on Intel CPUs compared to AMD Zen 4, as VPMADD52LUQ
finishes in only 4 clock cycles.
API Proposal
namespace System.Runtime.Intrinsics.X86
{
public abstract class Avx512Ifma : Avx512F
{
public static bool IsSupported { get; }
public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
public abstract class VL : Avx512F.VL
{
public static new bool IsSupported { get; }
public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
}
}
}
API Usage
zmm0 = Avx512Ifma.MultiplyAdd52Low(zmm0, zmm2, zmm3);
zmm1 = Avx512Ifma.MultiplyAdd52High(zmm1, zmm2, zmm3);
An example of vectorized Montgomery reduction implementations using the equivalent C++ intrinsics:
Alternative Designs
Risks
None
Activity