Skip to content

AES-NIΒ #1433

Open
Open
AES-NI#1433
@MaxGraey

Description

@MaxGraey

I'm not sure if this is the right place for this proposal, but I would like to suggest some very useful commands to speed up cryptography and especially cryptographic and non-cryptographic hashing.

What are the instructions being proposed?

AES-NI (Advanced Encryption Standard New Instructions) is extended instruction set which accelerate AES encryption / decription.

v128.aes.enc(a, b)

Perform one round of an AES decryption flow

general:

fn aesenc(v128 a, v128 b) {
  return MixColumns(ShiftRows(SubBytes(a))) ^ b
}

x86: aesenc
ARM: AESMC + AESE + EOR
PPC: vcipher

v128.aes.enc_last(a, b)

Perform the last round of an AES decryption flow

general:

fn aesenc_last(v128 a, v128 b) {
  return ShiftRows(SubBytes(a)) ^ b
}

x86: aesenclast
ARM: AESE + EOR
PPC: vcipherlast

v128.aes.dec(a, b)

Perform one round of an AES decryption flow

general:

fn aesdec(v128 a, v128 b) {
  return MixColumnsInv(ShiftRowsInv(SubBytesInv(a))) ^ b
}

x86: aesdec
ARM: AESIMC + AESD + EOR
PPC: vncipher

v128.aes.dec_last(a, b)

Perform the last round of an AES decryption flow

general:

fn aesdec_last(v128 a, v128 b) {
  return ShiftRowsInv(SubBytesInv(a)) ^ b
}

x86: aesdeclast
ARM: AESD + EOR
PPC: vncipherlast

v128.aes.keygen(a, imm8)

Generating the round keys used for encryption

general:

fn aeskeygen(v128 a, u8 imm) {
  X3[31:0] = a[127:96]
  X2[31:0] = a[95:64]
  X1[31:0] = a[63:32]
  X0[31:0] = a[31:0]
  RCON[31:0] = ZeroExtend(imm8)
  vDst[31:0] = SubWord(X1)
  vDst[63:32] = RotWord(SubWord(X1)) ^ RCON
  vDst[95:64] = SubWord(X3)
  vDst[127:96] = RotWord(SubWord(X3)) ^ RCON
  return vDst
}

x86: aeskeygenassist
ARM: Efficient emulation on ARM (See emulating-x86-aes-intrinsics-on-armv8-a):

__m128i _mm_aeskeygenassist_si128_arm(__m128i a, const int imm8) {
    a = vaeseq_u8(a, (__m128i){}); // perform ShiftRows and SubBytes on "a"
    uint32_t rcon = (uint32_t)(uint8_t)imm8;
    __m128i dest = {
        // Undo ShiftRows step from AESE and extract X1 and X3
        a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
        a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
        a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
        a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
    };
    return dest ^ (__m128i)((uint32x4_t){0, rcon, 0, rcon});
}

PPC:

__m128i _mm_aeskeygenassist_si128_ppc(__m128i a, const int imm8) {
    a = __builtin_crypto_vcipherlast(a, (__m128i){}); // perform ShiftRows and SubBytes on "a"
    uint32_t rcon = (uint32_t)(uint8_t)imm8;
    __m128i dest = {
        // Undo ShiftRows step from vcipherlast and extract X1 and X3
        a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
        a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
        a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
        a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
    };
    return dest ^ (__m128i)((uint32x4_t){0, rcon, 0, rcon});
}

Details about operations like MixColumns, ShiftRowsInv and etc see Intel's white paper

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

  • x86 support by Intel (Westmere, Sandy/Ivy Bridge, Haswell, Skylake and etc) and AMD (>= Jaguar, >= Puma, >= Zen1).
  • ARM Optionally support on ARMv8-A (ARM Cortex-A30/50/70 cores), Qualcomm 805, Exynos 3 series .
  • RISC-V doesn't have such specific instructions but a number of RISC-V chips include integrated AES co-processors. And perhaps will be standardize in future
  • POWER8/9/10 also support this. (thanks to @nemequ for pointing that out).

What use cases are there?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions