Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

icp: Port AVX2 implementation of aes-gcm from BoringSSL #17058

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

lowjoel
Copy link

@lowjoel lowjoel commented Feb 15, 2025

Motivation and Context

Zen 3 CPUs support the VAES and VPCLMULDQ instructions which extend the width of each instruction from 128-bits to 256-bits. BoringSSL has recently implemented this version for AES-GCM and it provides up to a 80% speedup. See google/boringssl@3b6e1be.

Description

I've backported the implementation from BoringSSL, adapting code from google/boringssl@3b6e1be (but picking the tip of master), as well as from google/boringssl@62f9751 which changed the primitive signature from 6 arguments to 7 (by not implicitly relying on the address offset of the ghash structure.)

Adaptations for icp (akin to #9749) as well as to use the RET macro for kernel code are in the second commit.

The third and fourth commits combine the use_avx/use_avx2 flags into an enum, allowing toggling of the different implementations that are available. Also, define different values of CAN_USE_GCM_ASM to indicate various levels of compiler support.

How Has This Been Tested?

Compile tested.

I'm now running it on my ZFS-on-root main machine.

@robn has a Wycheproof test set (#17089) that has been merged; these changes pass.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@lowjoel lowjoel marked this pull request as draft February 15, 2025 07:34
@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Feb 15, 2025
@lowjoel lowjoel force-pushed the aes-gcm-avx2 branch 2 times, most recently from 940863a to fbfc371 Compare February 15, 2025 08:18
@lowjoel
Copy link
Author

lowjoel commented Feb 15, 2025

Hey @AttilaFueloep I tried to port your changes in #9749 and I can't figure out what you mean by the change with the comment // ICP does not zero key schedule. in the modified assembly sources. I've ported the other two changes, for the round offset in the AES_KEY struct, as well as the OpenSSL vs ICP representation of number of AES rounds. I've also fixed the gcm_init_vpclmulqdq_avx2 method which you mentioned ICP stores H in network order.

@lowjoel lowjoel force-pushed the aes-gcm-avx2 branch 3 times, most recently from 0ce5f7b to 1b8867f Compare February 15, 2025 09:00
@robn
Copy link
Member

robn commented Feb 15, 2025

Strong opening move! I've had a little play with it, here's what I got.

Before the final thing, you should run make checkstyle, and fix the errors it throws out.

I added this patch to let me see what it thinks is happening on my test machines:

diff --git module/zcommon/simd_stat.c module/zcommon/simd_stat.c
index d82a88ca9..da557bbb0 100644
--- module/zcommon/simd_stat.c
+++ module/zcommon/simd_stat.c
@@ -117,6 +117,10 @@ simd_stat_kstat_data(char *buf, size_t size, void *data)
 		    "pclmulqdq", zfs_pclmulqdq_available());
 		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
 		    "movbe", zfs_movbe_available());
+		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+		    "vaes", zfs_vaes_available());
+		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+		    "vpclmulqdq", zfs_vpclmulqdq_available());
 
 		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
 		    "osxsave", boot_cpu_has(X86_FEATURE_OSXSAVE));

With that, my old Intel 2019 junker laptop says:

# grep -E 'vaes|vpclmulqdq' /proc/spl/kstat/zfs/simd
vaes                    0
vpclmulqdq              0

My much nicer Ryzen 5 from last year says:

# grep -E 'vaes|vpclmulqdq' /proc/spl/kstat/zfs/simd
vaes                    1
vpclmulqdq              1

Unfortunately we don't have visibility on the ICP microbenchmarks like we do for checksums and raidz, but we can at least see the options available:

# cat /sys/module/zfs/parameters/icp_gcm_impl
cycle [fastest] avx avx2 generic pclmulqdq

So I'd say it's all wired up right, which is half the fun.

I set it to avx and created a dataset, then unmounted and exported. Then I set it to avx2 and tried to zfs load-key, but that didn't work:

# zfs load-key -a
Enter passphrase for 'tank/enc':
Key load error: Incorrect key provided for 'tank/enc'.

Trying to create the pool and dataset with avx2 selected gets a crash:

# zpool create tank -O encryption=aes-256-gcm -O keyformat=passphrase /home/robn/blk
Enter new passphrase:
Re-enter new passphrase:
Killed

And the kernel has a nice complaint:

[ 2769.438918] BUG: unable to handle page fault for address: ffff9c23c8307e10
[ 2769.439133] #PF: supervisor read access in kernel mode
[ 2769.439290] #PF: error_code(0x0000) - not-present page
[ 2769.439447] PGD 110e01067 P4D 110e01067 PUD 0
[ 2769.439588] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 2769.439733] CPU: 0 PID: 73114 Comm: zpool Tainted: P           OE      6.1.0-25-amd64 #1  Debian 6.1.106-3
[ 2769.440035] Hardware name: FreeBSD BHYVE/BHYVE, BIOS 14.0 10/17/2021
[ 2769.440236] RIP: 0010:aes_gcm_dec_update_vaes_avx2+0x3e/0x5e0 [zfs]
[ 2769.440596] Code: 0c 24 c4 e2 71 00 c8 c4 42 7d 5a 18 c4 62 25 00 d8 44 8b 91 78 01 00 00 46 8d 14 95 f0 ff ff ff 4e 8d 5c 91 60 c4 62 7d 5a 09 <c4> 42 7d 5a 13 c5 25 fe 1d d5 ed 0b 00 48 83 fa 7f 0f 8
6 31 03 00
[ 2769.441171] RSP: 0018:ffffb1b6c37df358 EFLAGS: 00010086
[ 2769.441339] RAX: ffffffffc089b140 RBX: ffffb1b6c37df43c RCX: ffff9c214b6eae00
[ 2769.441566] RDX: 0000000000000060 RSI: ffff9c208e918500 RDI: ffff9c208e918500
[ 2769.441793] RBP: 0000000000000006 R08: ffffb1b6c37df430 R09: ffff9c20e8eba3c0
[ 2769.442019] R10: 000000009f3073ec R11: ffff9c23c8307e10 R12: ffffb1b6c37df498
[ 2769.442102] R13: ffffffffc089b140 R14: ffffb1b6c37df420 R15: ffffb1b6c37df488
[ 2769.442102] FS:  00007f2d7ead5840(0000) GS:ffff9c23afc00000(0000) knlGS:0000000000000000
[ 2769.442102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2769.442102] CR2: ffff9c23c8307e10 CR3: 00000001cccaa000 CR4: 00000000003506f0
[ 2769.442102] Call Trace:
[ 2769.442102]  <TASK>
[ 2769.442102]  ? __die_body.cold+0x1a/0x1f
[ 2769.442102]  ? page_fault_oops+0xd2/0x2b0
[ 2769.442102]  ? srso_alias_return_thunk+0x5/0x7f
[ 2769.442102]  ? search_bpf_extables+0x5b/0x80
[ 2769.442102]  ? exc_page_fault+0xca/0x170
[ 2769.442102]  ? asm_exc_page_fault+0x22/0x30
[ 2769.442102]  ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102]  ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102]  ? aes_gcm_dec_update_vaes_avx2+0x3e/0x5e0 [zfs]
[ 2769.442102]  ? aesni_gcm_decrypt_avx2+0x2a/0x50 [zfs]
[ 2769.442102]  ? gcm_decrypt_final_avx+0x167/0x460 [zfs]
[ 2769.442102]  ? crypto_update_uio+0xc0/0x120 [zfs]
[ 2769.442102]  ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102]  ? aes_decrypt_atomic+0x1ca/0x310 [zfs]
[ 2769.442102]  ? crypto_decrypt+0x78/0x1c0 [zfs]
[ 2769.442102]  ? zio_do_crypt_uio+0x2ce/0x400 [zfs]
[ 2769.442102]  ? zio_crypt_key_unwrap+0x254/0x480 [zfs]
[ 2769.442102]  ? dsl_crypto_key_open.constprop.0+0x2d1/0x350 [zfs]
[ 2769.442102]  ? dsl_crypto_key_open.constprop.0+0x2d1/0x350 [zfs]
[ 2769.442102]  ? spa_keystore_dsl_key_hold_dd+0x12e/0x280 [zfs]
[ 2769.442102]  ? __kmalloc_node+0x4c/0x150
[ 2769.442102]  ? spa_keystore_create_mapping+0x79/0x200 [zfs]
[ 2769.442102]  ? dsl_dataset_hold_obj_flags+0x48/0x90 [zfs]
[ 2769.442102]  ? dsl_pool_create+0x1bc/0x490 [zfs]
[ 2769.442102]  ? spa_create+0x83d/0xdb0 [zfs]
[ 2769.442102]  ? zfs_ioc_pool_create+0xab/0x310 [zfs]
[ 2769.442102]  ? zfsdev_ioctl_common+0x6a0/0x7c0 [zfs]
[ 2769.442102]  ? __kmalloc_node+0xbf/0x150
[ 2769.442102]  ? srso_alias_return_thunk+0x5/0x7f
[ 2769.442102]  ? zfsdev_ioctl+0x4f/0xd0 [zfs]
[ 2769.442102]  ? __x64_sys_ioctl+0x90/0xd0
[ 2769.442102]  ? do_syscall_64+0x55/0xb0

That's all I have time for tonight. This is a good start!

@robn
Copy link
Member

robn commented Feb 15, 2025

Oh the other thing I forgot to add, after loading the module it says fastest, and fails as above. So it is selecting avx2 as fastest, which is good! (we need a stat for it, for sure).

@lowjoel lowjoel force-pushed the aes-gcm-avx2 branch 14 times, most recently from ae64131 to c0a1e8e Compare February 16, 2025 07:12
} else {
/*
* Handle the "cycle" implementation by creating avx and
* non-avx contexts alternately.
*/
gcm_ctx->gcm_use_avx = gcm_toggle_avx();
gcm_ctx->gcm_use_avx2 = gcm_toggle_avx2();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure the cycle behaviour here isn't correct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed in #17061.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this PR have a dependency on #17061?

Copy link
Author

@lowjoel lowjoel Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cycle will still work, it just won't toggle the same way (or the assumed way) - is that behaviour documented somewhere? Under what circumstances would people be using cycle?

When this code runs on an AVX2 processor, cycle will only toggle between AVX2 and generic, ignoring AVX - I don't know if that is a deal breaker enough to consider it a "dependency"

AVX only processors should still cycle the same way.

(at least that's the intention - might have to think a bit deeper about this)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tonyhutter I've had another go at the cycle code - since movbe only applies to AVX I've modified the toggling for that to not apply to AVX2. Furthermore, if bswap is needed neither accelerated method will be usable.

@lowjoel lowjoel marked this pull request as ready for review February 16, 2025 23:10
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Feb 16, 2025
@AttilaFueloep
Copy link
Contributor

@lowjoel BTW, did you see #17058 (comment) already?

@robn
Copy link
Member

robn commented Mar 5, 2025

@AttilaFueloep sounds like merging now suits everyone: you get some breathing room, @lowjoel gets his gold star, and fancy folk with AVX2 machines get a nice bump (maybe in time for 2.3.1!)

And sure, once you've got a PR I'd be happy to stick it on the test rig and put it through its paces.

@AttilaFueloep
Copy link
Contributor

@robn That would be nice!

While I have your attention, a review of #14531, once rebased on the refactoring, would be nice as well. To be honest, I was a bit disappointed to see no review feedback on it for years. It has a couple of satisfied users, so I'm trying to get it over the line currently.

@lowjoel
Copy link
Author

lowjoel commented Mar 5, 2025

No, sorry. No Slack, IRC or whatever. Once bitten twice shy.

oops, sorry to hear that.

@lowjoel BTW, did you see #17058 (comment) already?

Thanks, I did, but the comment seemed out of place relative to the other asm comments. If I understand the assembly right, the line where the comment was added was intended to break the loop if we've reached the number of rounds for AES-256, but that seems to disagree with the comment. Maybe I'm misunderstanding something. Or does that affect some other key size?

I'm now also wondering if I'm missing testing for some corner case because that change hasn't been ported. And AFAIK BoringSSL uses the same API as OpenSSL

@AttilaFueloep
Copy link
Contributor

@lowjoel Again, off the top of my head,

break the loop if we've reached the number of rounds for AES-256,

Right, the original code used a zero detection to break the loop, while my change checks for the number of rounds given. Of course this depends on how the assembly is written, it may be totally different in your case.

Let me look it up more thoroughly, it's been six years since I wrote this so my memories are a bit hazy.

@AttilaFueloep
Copy link
Contributor

@lowjoel Ok, had a quick look:

module/icp/asm-x86_64/modes/aesni-gcm-avx2.S

384	movl	240(%rcx),%r10d
385	leal	-20(,%r10,4),%r10d
....
473	cmpl	$24,%r10d
474	jl	.Laes128__func1
475	je	.Laes192__func1

So %r10d seems to hold the number of rounds. 24 seems to be a strange value though, not sure what '%r10d' contains. Anyway you seem to be good since this code doesn't depend on a zeroed keyschedule.

@lowjoel
Copy link
Author

lowjoel commented Mar 5, 2025

@AttilaFueloep

24 seems to be a strange value though, not sure what '%r10d' contains.

I also interpreted it to be number of rounds, 0-based. The instruction leal -20(,%r10,4),%r10d then converts r10 into one of {16,24,32} (had to fix the math for ICP), hence the later cmp against 24 checks for AES-192. r10 is also used to produce r11 which loads the later round key hence the mathing around... I had to stare at it for a good while myself.

Anyway you seem to be good since this code doesn't depend on a zeroed keyschedule.

Thanks!

Copy link
Contributor

@AttilaFueloep AttilaFueloep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quick first pass review of the assembly file only.


#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_X86_64) && defined(__ELF__)
.section .rodata
.align 16

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please always use either .balign or .p2align. Depending on the assembler .align resolves to the former or the latter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a02c1c1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Comment on lines 47 to 52
.text
.globl gcm_init_vpclmulqdq_avx2
.hidden gcm_init_vpclmulqdq_avx2
.type gcm_init_vpclmulqdq_avx2,@function
.align 32
gcm_init_vpclmulqdq_avx2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.text
.globl gcm_init_vpclmulqdq_avx2
.hidden gcm_init_vpclmulqdq_avx2
.type gcm_init_vpclmulqdq_avx2,@function
.align 32
gcm_init_vpclmulqdq_avx2:
ENTRY_ALIGN(gcm_init_vpclmulqdq_avx2, 32)

Please always use ENTRY_ALIGN.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a02c1c1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

gcm_init_vpclmulqdq_avx2:
.cfi_startproc

_CET_ENDBR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_CET_ENDBR
ENDBR

Dito.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a02c1c1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

RET

.cfi_endproc
.size gcm_init_vpclmulqdq_avx2, . - gcm_init_vpclmulqdq_avx2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.size gcm_init_vpclmulqdq_avx2, . - gcm_init_vpclmulqdq_avx2
SET_SIZE(gcm_init_vpclmulqdq_avx2)

Dito.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a02c1c1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@AttilaFueloep
Copy link
Contributor

@lowjoel

24 seems to be a strange value though, not sure what '%r10d' contains.

I also interpreted it to be number of rounds, 0-based. The instruction leal -20(,%r10,4),%r10d then converts r10 into one of {16,24,32}

Sure, I was just wondering why it's expected to be 24 and not 11. If I read the code correctly %r10 is only used in compare and branch context where you could use both. I may miss something obvious though.

Comment on lines 1184 to 1188
static inline size_t
gcm_simd_get_htab_size(void)
{
return (2 * 6 * 2 * sizeof (uint64_t));
}
Copy link
Contributor

@AttilaFueloep AttilaFueloep Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change isn't accurate. Each implementation has it's own Htab size and we have to handle it.

On lines 678 - 682 above we are allocating the Htab using this function which will return 192. But we need 16*16=256 bytes (const u128 Htable[16]). The reason this works is by coincidence. In the BoringSSL code I see

We do not use Htable[12..15].

So 192 bytes are sufficient. Still it seems like bad style to declare u128 Htable[16] but allocate u128 Htable[12].

Not sure if the removed return 0 for the non-SIMD case is needed, will look tomorrow

Copy link
Author

@lowjoel lowjoel Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the removed return 0 for the non-SIMD case is needed, will look tomorrow

Line 675-676:

if (gcm_ctx->gcm_use_avx == B_TRUE) {
	size_t htab_len = gcm_simd_get_htab_size(gcm_ctx->gcm_use_avx);

I only ever see gcm_simd_get_htab_size called with TRUE.

I agree that relying on the fact that BoringSSL only uses the first 12 bytes is a happy coincidence; I'll look at how to rewire tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the removed return 0 for the non-SIMD case is needed, will look tomorrow
I only ever see gcm_simd_get_htab_size called with TRUE.

Agreed.

I agree that relying on the fact that BoringSSL only uses the first 12 bytes is a happy coincidence; I'll look at how to rewire tomorrow.

Not a big deal. You could either adjust the prototypes to use u128[12] or change gcm_simd_get_htab_size to return 256 for the AVX2 case. I'm fine either way. This is mostly about avoiding readers wondering about the size of Htab.

Copy link
Contributor

@AttilaFueloep AttilaFueloep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the risk of nitpicking, I think we should reconsider the module parameter name. Currently it's called avx2 but the main feature it adds is support for the VAES/VCLMUL extension to AES-NI/CLMUL. So I'd suggest something like vaes-avx2. (Actually it uses AVX2 only due to the fact that Zen 3 has no AVX-512/AVX10.) Thoughts?

I've some more nits but since all of them are in areas I'm refactoring right now it wouldn't make much sense to bring them up here. There's one optimization possibility regarding GCM_AVX_MIN_{DE,EN}CRYPT_BYTES (should both be 16 for VAES). But that's hard to fix and I already solved it locally. So I think it's OK to leave it as is right know.

So modulo Htab size and module parameter name we are good to go I'd say.

@lowjoel
Copy link
Author

lowjoel commented Mar 8, 2025

At the risk of nitpicking, I think we should reconsider the module parameter name. Currently it's called avx2 but the main feature it adds is support for the VAES/VCLMUL extension to AES-NI/CLMUL. So I'd suggest something like vaes-avx2. (Actually it uses AVX2 only due to the fact that Zen 3 has no AVX-512/AVX10.) Thoughts?

Yeah. I think you're right. The commit calls it VAES+AVX2; I should have stuck with it. I'll push this later today

So modulo Htab size and module parameter name we are good to go I'd say.

Added htab refactor in c49a94c - the method seemed to not fit in the program flow any more hence the slightly more aggressive rewrite. Thanks @AttilaFueloep

@lowjoel lowjoel force-pushed the aes-gcm-avx2 branch 3 times, most recently from a6b8673 to 3fc6ac8 Compare March 8, 2025 10:14
@lowjoel
Copy link
Author

lowjoel commented Mar 8, 2025

@AttilaFueloep renamed to avx2-vaes in 59fb58a; identifiers in code still remains just avx2 for symmetry with the assembly.

lowjoel and others added 8 commits March 9, 2025 10:06
This uses the AVX2 versions of the AESENC and PCLMULQDQ instructions; on
Zen 3 this provides an up to 80% performance improvement.

Original source:
https://github.com/google/boringssl/blob/13840dd094f9e9c1b00a7368aa25e656554221f1/gen/bcm/aes-gcm-avx2-x86_64-linux.S

See the original BoringSSL commit at
google/boringssl@3b6e1be.

Signed-off-by: Joel Low <joel@joelsplace.sg>
Signed-off-by: Joel Low <joel@joelsplace.sg>
 - Accept GCM H variable in network endianness (ICP convention)
 - Fix round count offset in AES_KEY struct (ICP convention)
 - Use RET macro for kernel code
 - Explicitly use .balign directive
 - Use ENTRY_ALIGN and SET_SIZE macros
 - Use ENDBR macro

Signed-off-by: Joel Low <joel@joelsplace.sg>
Signed-off-by: Joel Low <joel@joelsplace.sg>
The BoringSSL AVX2 implementation allocates uint128_t[16] but only uses
the first 12 elements. Be explicit in the code.

Signed-off-by: Joel Low <joel@joelsplace.sg>
Signed-off-by: Joel Low <joel@joelsplace.sg>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants