-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8371450: AES performance improvements for key schedule generation #28188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
👋 Welcome back smemery! A progress list of the required criteria for merging this PR into |
|
@smemery This change is no longer ready for integration - check the PR body for details. |
Webrevs
|
| | ((SBOX[(b1 & 0xF0) >> 4][b1 & 0x0F] & 0xFF) << 16) | ||
| | ((SBOX[(b2 & 0xF0) >> 4][b2 & 0x0F] & 0xFF) << 8) | ||
| | (SBOX[(b3 & 0xF0) >> 4][b3 & 0x0F] & 0xFF); | ||
| | (SBOX[(word & 0xF0) >> 4][word & 0x0F] & 0xFF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need for any of these bytes. Every index can be computed as "(word >> offset) & 0x0F". Actually, if you define SBOX as a 1-dim array, you can index into it with "(word >> offset) & 0xFF".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review. The byte assignments were to avoid three redundant shift operations.
| int widx = i * WB; | ||
| int idx = kLen - widx; | ||
|
|
||
| dw[idx] = TMI0[w[widx] >>> 24] ^ TMI1[(w[widx] >> 16) & 0xFF] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there would be any benefit to putting w[widx] through w[widx+3] on local int variables? In some cases I've seen where that increases register pressure and can lead to some perf benefits. I'm not sure if this is one of those cases but it seems like you'd only need to reference memory once instead of 4 times per assignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe my original changes here utilize a "MergeStore" technique that the compiler optimizes. I've asked @minborg to see if I got this right. To verify the optimization here, I used the separate local int variable technique and saw a 0.7% decrease in benchmark performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, your experiment answers my question. Unless @minborg says otherwise I'm fine with what you have. You're already seeing improvements so that's great.
This fix improves performance in the AES key schedule generation by eliminating an unnecessary object and unnecessary mask in the inverse key schedule.
The micro:org.openjdk.bench.javax.crypto.AESReinit benchmark results are improved by 6.96% for arm64 and 7.79% for x86_64.
Thank you @jnimeh for catching the unnecessary byte mask!
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28188/head:pull/28188$ git checkout pull/28188Update a local copy of the PR:
$ git checkout pull/28188$ git pull https://git.openjdk.org/jdk.git pull/28188/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28188View PR using the GUI difftool:
$ git pr show -t 28188Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28188.diff
Using Webrev
Link to Webrev Comment