-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
icp: Port AVX2 implementation of aes-gcm from BoringSSL #17058
base: master
Are you sure you want to change the base?
Conversation
940863a
to
fbfc371
Compare
Hey @AttilaFueloep I tried to port your changes in #9749 and I can't figure out what you mean by the change with the comment |
0ce5f7b
to
1b8867f
Compare
Strong opening move! I've had a little play with it, here's what I got. Before the final thing, you should run I added this patch to let me see what it thinks is happening on my test machines: diff --git module/zcommon/simd_stat.c module/zcommon/simd_stat.c
index d82a88ca9..da557bbb0 100644
--- module/zcommon/simd_stat.c
+++ module/zcommon/simd_stat.c
@@ -117,6 +117,10 @@ simd_stat_kstat_data(char *buf, size_t size, void *data)
"pclmulqdq", zfs_pclmulqdq_available());
off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
"movbe", zfs_movbe_available());
+ off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+ "vaes", zfs_vaes_available());
+ off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+ "vpclmulqdq", zfs_vpclmulqdq_available());
off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
"osxsave", boot_cpu_has(X86_FEATURE_OSXSAVE)); With that, my old Intel 2019 junker laptop says:
My much nicer Ryzen 5 from last year says:
Unfortunately we don't have visibility on the ICP microbenchmarks like we do for checksums and raidz, but we can at least see the options available:
So I'd say it's all wired up right, which is half the fun. I set it to
Trying to create the pool and dataset with
And the kernel has a nice complaint:
That's all I have time for tonight. This is a good start! |
Oh the other thing I forgot to add, after loading the module it says |
ae64131
to
c0a1e8e
Compare
module/icp/algs/modes/gcm.c
Outdated
} else { | ||
/* | ||
* Handle the "cycle" implementation by creating avx and | ||
* non-avx contexts alternately. | ||
*/ | ||
gcm_ctx->gcm_use_avx = gcm_toggle_avx(); | ||
gcm_ctx->gcm_use_avx2 = gcm_toggle_avx2(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure the cycle behaviour here isn't correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed in #17061.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does this PR have a dependency on #17061?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cycle will still work, it just won't toggle the same way (or the assumed way) - is that behaviour documented somewhere? Under what circumstances would people be using cycle?
When this code runs on an AVX2 processor, cycle will only toggle between AVX2 and generic, ignoring AVX - I don't know if that is a deal breaker enough to consider it a "dependency"
AVX only processors should still cycle the same way.
(at least that's the intention - might have to think a bit deeper about this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tonyhutter I've had another go at the cycle code - since movbe
only applies to AVX I've modified the toggling for that to not apply to AVX2. Furthermore, if bswap is needed neither accelerated method will be usable.
c0a1e8e
to
f0c81fa
Compare
@lowjoel BTW, did you see #17058 (comment) already? |
@AttilaFueloep sounds like merging now suits everyone: you get some breathing room, @lowjoel gets his gold star, and fancy folk with AVX2 machines get a nice bump (maybe in time for 2.3.1!) And sure, once you've got a PR I'd be happy to stick it on the test rig and put it through its paces. |
oops, sorry to hear that.
Thanks, I did, but the comment seemed out of place relative to the other asm comments. If I understand the assembly right, the line where the comment was added was intended to break the loop if we've reached the number of rounds for AES-256, but that seems to disagree with the comment. Maybe I'm misunderstanding something. Or does that affect some other key size? I'm now also wondering if I'm missing testing for some corner case because that change hasn't been ported. And AFAIK BoringSSL uses the same API as OpenSSL |
@lowjoel Again, off the top of my head,
Right, the original code used a zero detection to break the loop, while my change checks for the number of rounds given. Of course this depends on how the assembly is written, it may be totally different in your case. Let me look it up more thoroughly, it's been six years since I wrote this so my memories are a bit hazy. |
@lowjoel Ok, had a quick look: module/icp/asm-x86_64/modes/aesni-gcm-avx2.S 384 movl 240(%rcx),%r10d
385 leal -20(,%r10,4),%r10d
....
473 cmpl $24,%r10d
474 jl .Laes128__func1
475 je .Laes192__func1 So |
I also interpreted it to be number of rounds, 0-based. The instruction
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quick first pass review of the assembly file only.
|
||
#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_X86_64) && defined(__ELF__) | ||
.section .rodata | ||
.align 16 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please always use either .balign
or .p2align
. Depending on the assembler .align
resolves to the former or the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in a02c1c1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
.text | ||
.globl gcm_init_vpclmulqdq_avx2 | ||
.hidden gcm_init_vpclmulqdq_avx2 | ||
.type gcm_init_vpclmulqdq_avx2,@function | ||
.align 32 | ||
gcm_init_vpclmulqdq_avx2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.text | |
.globl gcm_init_vpclmulqdq_avx2 | |
.hidden gcm_init_vpclmulqdq_avx2 | |
.type gcm_init_vpclmulqdq_avx2,@function | |
.align 32 | |
gcm_init_vpclmulqdq_avx2: | |
ENTRY_ALIGN(gcm_init_vpclmulqdq_avx2, 32) |
Please always use ENTRY_ALIGN
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in a02c1c1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
gcm_init_vpclmulqdq_avx2: | ||
.cfi_startproc | ||
|
||
_CET_ENDBR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_CET_ENDBR | |
ENDBR |
Dito.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in a02c1c1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
RET | ||
|
||
.cfi_endproc | ||
.size gcm_init_vpclmulqdq_avx2, . - gcm_init_vpclmulqdq_avx2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.size gcm_init_vpclmulqdq_avx2, . - gcm_init_vpclmulqdq_avx2 | |
SET_SIZE(gcm_init_vpclmulqdq_avx2) | |
Dito.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in a02c1c1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Sure, I was just wondering why it's expected to be |
module/icp/algs/modes/gcm.c
Outdated
static inline size_t | ||
gcm_simd_get_htab_size(void) | ||
{ | ||
return (2 * 6 * 2 * sizeof (uint64_t)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change isn't accurate. Each implementation has it's own Htab size and we have to handle it.
On lines 678 - 682 above we are allocating the Htab using this function which will return 192. But we need 16*16=256 bytes (const u128 Htable[16]
). The reason this works is by coincidence. In the BoringSSL code I see
We do not use Htable[12..15].
So 192 bytes are sufficient. Still it seems like bad style to declare u128 Htable[16]
but allocate u128 Htable[12]
.
Not sure if the removed return 0
for the non-SIMD case is needed, will look tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the removed
return 0
for the non-SIMD case is needed, will look tomorrow
Line 675-676:
if (gcm_ctx->gcm_use_avx == B_TRUE) {
size_t htab_len = gcm_simd_get_htab_size(gcm_ctx->gcm_use_avx);
I only ever see gcm_simd_get_htab_size
called with TRUE.
I agree that relying on the fact that BoringSSL only uses the first 12 bytes is a happy coincidence; I'll look at how to rewire tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the removed
return 0
for the non-SIMD case is needed, will look tomorrow
I only ever seegcm_simd_get_htab_size
called with TRUE.
Agreed.
I agree that relying on the fact that BoringSSL only uses the first 12 bytes is a happy coincidence; I'll look at how to rewire tomorrow.
Not a big deal. You could either adjust the prototypes to use u128[12]
or change gcm_simd_get_htab_size
to return 256 for the AVX2 case. I'm fine either way. This is mostly about avoiding readers wondering about the size of Htab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the risk of nitpicking, I think we should reconsider the module parameter name. Currently it's called avx2
but the main feature it adds is support for the VAES/VCLMUL extension to AES-NI/CLMUL. So I'd suggest something like vaes-avx2
. (Actually it uses AVX2 only due to the fact that Zen 3 has no AVX-512/AVX10.) Thoughts?
I've some more nits but since all of them are in areas I'm refactoring right now it wouldn't make much sense to bring them up here. There's one optimization possibility regarding GCM_AVX_MIN_{DE,EN}CRYPT_BYTES (should both be 16 for VAES). But that's hard to fix and I already solved it locally. So I think it's OK to leave it as is right know.
So modulo Htab size and module parameter name we are good to go I'd say.
Yeah. I think you're right. The commit calls it
Added htab refactor in c49a94c - the method seemed to not fit in the program flow any more hence the slightly more aggressive rewrite. Thanks @AttilaFueloep |
a6b8673
to
3fc6ac8
Compare
@AttilaFueloep renamed to avx2-vaes in 59fb58a; identifiers in code still remains just |
This uses the AVX2 versions of the AESENC and PCLMULQDQ instructions; on Zen 3 this provides an up to 80% performance improvement. Original source: https://github.com/google/boringssl/blob/13840dd094f9e9c1b00a7368aa25e656554221f1/gen/bcm/aes-gcm-avx2-x86_64-linux.S See the original BoringSSL commit at google/boringssl@3b6e1be. Signed-off-by: Joel Low <joel@joelsplace.sg>
Signed-off-by: Joel Low <joel@joelsplace.sg>
- Accept GCM H variable in network endianness (ICP convention) - Fix round count offset in AES_KEY struct (ICP convention) - Use RET macro for kernel code - Explicitly use .balign directive - Use ENTRY_ALIGN and SET_SIZE macros - Use ENDBR macro Signed-off-by: Joel Low <joel@joelsplace.sg>
Signed-off-by: Joel Low <joel@joelsplace.sg>
The BoringSSL AVX2 implementation allocates uint128_t[16] but only uses the first 12 elements. Be explicit in the code. Signed-off-by: Joel Low <joel@joelsplace.sg>
Signed-off-by: Joel Low <joel@joelsplace.sg>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Motivation and Context
Zen 3 CPUs support the VAES and VPCLMULDQ instructions which extend the width of each instruction from 128-bits to 256-bits. BoringSSL has recently implemented this version for AES-GCM and it provides up to a 80% speedup. See google/boringssl@3b6e1be.
Description
I've backported the implementation from BoringSSL, adapting code from google/boringssl@3b6e1be (but picking the tip of master), as well as from google/boringssl@62f9751 which changed the primitive signature from 6 arguments to 7 (by not implicitly relying on the address offset of the ghash structure.)
Adaptations for icp (akin to #9749) as well as to use the RET macro for kernel code are in the second commit.
The third and fourth commits combine the use_avx/use_avx2 flags into an enum, allowing toggling of the different implementations that are available. Also, define different values of CAN_USE_GCM_ASM to indicate various levels of compiler support.
How Has This Been Tested?
Compile tested.
I'm now running it on my ZFS-on-root main machine.
@robn has a Wycheproof test set (#17089) that has been merged; these changes pass.
Types of changes
Checklist:
Signed-off-by
.