Open
Description
I'm not sure if this is the right place for this proposal, but I would like to suggest some very useful commands to speed up cryptography and especially cryptographic and non-cryptographic hashing.
What are the instructions being proposed?
AES-NI (Advanced Encryption Standard New Instructions) is extended instruction set which accelerate AES encryption / decription.
v128.aes.enc(a, b)
Perform one round of an AES decryption flow
general:
fn aesenc(v128 a, v128 b) {
return MixColumns(ShiftRows(SubBytes(a))) ^ b
}
x86: aesenc
ARM: AESMC + AESE + EOR
PPC: vcipher
v128.aes.enc_last(a, b)
Perform the last round of an AES decryption flow
general:
fn aesenc_last(v128 a, v128 b) {
return ShiftRows(SubBytes(a)) ^ b
}
x86: aesenclast
ARM: AESE + EOR
PPC: vcipherlast
v128.aes.dec(a, b)
Perform one round of an AES decryption flow
general:
fn aesdec(v128 a, v128 b) {
return MixColumnsInv(ShiftRowsInv(SubBytesInv(a))) ^ b
}
x86: aesdec
ARM: AESIMC + AESD + EOR
PPC: vncipher
v128.aes.dec_last(a, b)
Perform the last round of an AES decryption flow
general:
fn aesdec_last(v128 a, v128 b) {
return ShiftRowsInv(SubBytesInv(a)) ^ b
}
x86: aesdeclast
ARM: AESD + EOR
PPC: vncipherlast
v128.aes.keygen(a, imm8)
Generating the round keys used for encryption
general:
fn aeskeygen(v128 a, u8 imm) {
X3[31:0] = a[127:96]
X2[31:0] = a[95:64]
X1[31:0] = a[63:32]
X0[31:0] = a[31:0]
RCON[31:0] = ZeroExtend(imm8)
vDst[31:0] = SubWord(X1)
vDst[63:32] = RotWord(SubWord(X1)) ^ RCON
vDst[95:64] = SubWord(X3)
vDst[127:96] = RotWord(SubWord(X3)) ^ RCON
return vDst
}
x86: aeskeygenassist
ARM: Efficient emulation on ARM (See emulating-x86-aes-intrinsics-on-armv8-a):
__m128i _mm_aeskeygenassist_si128_arm(__m128i a, const int imm8) {
a = vaeseq_u8(a, (__m128i){}); // perform ShiftRows and SubBytes on "a"
uint32_t rcon = (uint32_t)(uint8_t)imm8;
__m128i dest = {
// Undo ShiftRows step from AESE and extract X1 and X3
a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
};
return dest ^ (__m128i)((uint32x4_t){0, rcon, 0, rcon});
}
PPC:
__m128i _mm_aeskeygenassist_si128_ppc(__m128i a, const int imm8) {
a = __builtin_crypto_vcipherlast(a, (__m128i){}); // perform ShiftRows and SubBytes on "a"
uint32_t rcon = (uint32_t)(uint8_t)imm8;
__m128i dest = {
// Undo ShiftRows step from vcipherlast and extract X1 and X3
a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
};
return dest ^ (__m128i)((uint32x4_t){0, rcon, 0, rcon});
}
Details about operations like MixColumns
, ShiftRowsInv
and etc see Intel's white paper
How does behavior differ across processors? What new fingerprinting surfaces will be exposed?
- x86 support by Intel (Westmere, Sandy/Ivy Bridge, Haswell, Skylake and etc) and AMD (>= Jaguar, >= Puma, >= Zen1).
- ARM Optionally support on ARMv8-A (ARM Cortex-A30/50/70 cores), Qualcomm 805, Exynos 3 series .
- RISC-V doesn't have such specific instructions but a number of RISC-V chips include integrated AES co-processors. And perhaps will be standardize in future
- POWER8/9/10 also support this. (thanks to @nemequ for pointing that out).
What use cases are there?
- speedup AES encryption / decryption
- fast cryptographic and non-cryptographic hashing. Check this benchmark results: https://github.com/rurban/smhasher/blob/master/README.md. All fastest hash algos uses AES-NI