-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8353741: Eliminate table lookup in UUID.toString #22928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back swen! A progress list of the required criteria for merging this PR into |
|
@wenshao This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 43 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
@wenshao this pull request can not be integrated into git checkout optim_uuid_to_string_202501
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
f2db2ca to
b550366
Compare
|
Under the x64 architecture, performance is significantly improved. However, on some aarch64 platforms, performance regresses.. The performance numbers are as follows: 1. Scriptgit remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao
# baseline dfaa89162a3
git checkout dfaa89162a35acd20b1ed35e147f9626a181510a
make test TEST="micro:java.util.UUIDBench.toString"
# current 010ab70c00b
git checkout 010ab70c00b7c0f417127c050654a381b489d052
make test TEST="micro:java.util.UUIDBench.toString"2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 84.620 ± 15.957 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current 010ab70c00b)
+UUIDBench.toString 20000 thrpt 15 130.913 ± 0.111 ops/us +54.70%
3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 84.754 ± 0.291 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current 010ab70c00b)
+UUIDBench.toString 20000 thrpt 15 94.817 ± 0.231 ops/us +11.87%4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)-Benchmark (size) Mode Cnt Score Error Units (current 010ab70c00b)
-UUIDBench.toString 20000 thrpt 15 70.288 ± 0.147 ops/us
+Benchmark (size) Mode Cnt Score Error Units
+UUIDBench.toString 20000 thrpt 15 92.088 ± 0.137 ops/us +31.01%5. MacBook M1 Pro (aarch64)-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 109.001 ? 0.354 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current 010ab70c00b)
+UUIDBench.toString 20000 thrpt 15 80.671 ? 0.722 ops/us -25.99%6. orange_pi5_aarch64 (CPU RK3588S)-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 37.752 ± 1.430 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current 010ab70c00b)
+UUIDBench.toString 20000 thrpt 15 30.940 ± 1.474 ops/us -18.047. orange_aipro_aarch64 (CPU TAISHANV200M)-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 13.764 ± 0.262 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current 010ab70c00b)
+UUIDBench.toString 20000 thrpt 15 13.310 ± 0.175 ops/us -3.29% |
Note: Using Long.reverseBytes + Long.expand is faster on x64 and ARMv9. |
|
By stepping through the code of This looks like it might actually do better than Method 2. If inlining and constant folding is happening in the non-intrinsic |
|
The new implementation improves performance on the aarch64 architecture but results in a performance regression on x64. 1. Scriptgit remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao
# baseline dfaa89162a3
git checkout dfaa89162a35acd20b1ed35e147f9626a181510a
make test TEST="micro:java.util.UUIDBench.toString"
# current c513087056b
git checkout c513087056be8c1e1a915625e0b425a7ecbb21d6
make test TEST="micro:java.util.UUIDBench.toString"2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 94.274 ± 0.452 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
+UUIDBench.toString 20000 thrpt 15 80.241 ± 0.894 ops/us -14.88%
3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 85.323 ± 2.044 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
+UUIDBench.toString 20000 thrpt 15 73.636 ± 0.590 ops/us -13.69%4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 69.286 ± 1.136 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
+UUIDBench.toString 20000 thrpt 15 80.475 ± 0.310 ops/us +16.14%
5. MacBook M1 Pro (aarch64)-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 108.254 ? 1.167 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
+UUIDBench.toString 20000 thrpt 15 122.313 ? 0.820 ops/us +12.98%
6. orange_pi5_aarch64 (CPU RK3588S)-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 37.783 ± 1.553 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
+UUIDBench.toString 20000 thrpt 15 42.928 ± 2.534 ops/us +13.61%
7. orange_aipro_aarch64 (CPU TAISHANV200M)-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 13.822 ± 0.203 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
+UUIDBench.toString 20000 thrpt 15 18.946 ± 0.156 ops/us +37.07% |
|
The non-intrinsified java code should be able to run as quickly as the hand-inlined one. I think I've found an issue that prevents the code from being constant-folded as expected. C2 seems to not do constant-folding of xor nodes. See #23089 for an attempt at addressing this. |
|
Does this expandNibbles compile to any xor nodes in the GVN? Note that aarch64 has an intrinsic for expand that goes through the vector processing unit. Don't know if that intrinsic or the constant folding takes priority; might need further tweaks to hotspot if that intrinsic comes before the constant folded result and slows things down. |
Webrevs
|
|
https://github.com/wenshao/jdk/tree/optim_uuid_to_string_202501_x2 I also made a version that does not use ByteArrayLittleEndian, but the performance drops by 5% to 10%. By printing the igv file through PrintIdealGraph, I found that the RangeCheck of array access is not eliminated. |
|
Have you considered removing |
I tested it on a MacBook M1 Max, and the performance of using reverseBytes + ByteArrayLittleEndian is about 1% faster. Considering that most of us have little-endian machines, ByteArray.setInt/setLong also needs to do reverseBytes. Doing reverseBytes in hex8 requires 4 Long.reverseBytes in a UUID.toString call. If using ByteArray, on a little-endian machine, a UUID.toString call requires 2 Long.reverseBytes and 4 Integer.reverseBytes. |
|
@wenshao This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the performance comparisons (percentage improvement.)
| * Extract the least significant 4 bytes from the input integer i, convert each byte into its corresponding 2-digit | ||
| * hexadecimal representation, concatenate these hexadecimal strings into one continuous string, and then interpret | ||
| * this string as a hexadecimal number to form and return a long value. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reference describing this technique? Such as the Hacker's Delight citation used in java.lang.Long?
If not a longer descriptive comment would be good for maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This algorithm was researched and implemented by me, and I have added detailed description and comments
|
The latest performance test numbers show that under AMD X64, the performance is improved by about 21%, and under MacBook M1 Pro, the performance is improved by about 7%. performance test scriptsgit remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao
# origin
git checkout bd99525633e4d3d3f180a6678eedb8780dbb6139
make test TEST="micro:java.util.UUIDBench.toString"
# current
git checkout 878a3cb7c73b777eb5385b3e4d158c998cd9be46
make test TEST="micro:java.util.UUIDBench.toString"aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)-# origin bd99525633e4d3d3f180a6678eedb8780dbb6139
-Benchmark (size) Mode Cnt Score Error Units
-UUIDBench.toString 20000 thrpt 15 93.320 ± 0.359 ops/us
+# current 878a3cb7c73b777eb5385b3e4d158c998cd9be46
+Benchmark (size) Mode Cnt Score Error Units
+UUIDBench.toString 20000 thrpt 15 113.508 ± 1.372 ops/us +21.63%Apple MackBook M1 Pro-# origin bd99525633e4d3d3f180a6678eedb8780dbb6139
-Benchmark (size) Mode Cnt Score Error Units
-UUIDBench.toString 20000 thrpt 15 106.199 ± 0.568 ops/us
+# current 878a3cb7c73b777eb5385b3e4d158c998cd9be46
+Benchmark (size) Mode Cnt Score Error Units
+UUIDBench.toString 20000 thrpt 15 113.911 ± 0.628 ops/us +7.34% |
| * Input: 0x0123456789ABCDEF | ||
| * Output: 0x3031323334353637 ('0','1','2','3','4','5','6','7' in ASCII) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the low 32 bits are used, the example only needs to show the input ox 0x0123457.
Or change the signature to accept an int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the input of Long.expand is long, if the parameter is int, it will cause a redundant conversion process of long -> int -> long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is just the example input and output should be something like:
Input: 0xABCDEF01
Output: 0x3130666564636261
to demonstrate that it is the low-order bits that are used, as well as that the digit order is reversed in the result.
| * @see Long#reverseBytes(long) | ||
| */ | ||
| private static long hex8(long i) { | ||
| // Expand each 4-bit group into 8 bits, spreading them out in the long value: 0xAABBCCDD -> 0x0A0B0C0D0A0B0C0D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be 0xAABBCCDD -> 0xA0A0B0B0C0C0D0D
| * This method processes multiple digits in parallel by treating a long value as eight 8-bit lanes, | ||
| * achieving significantly better performance compared to traditional loop-based conversion. | ||
| * | ||
| * <p>The conversion algorithm works as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the in-line comment describing the algorithm is enough - doesn't need the details in the javadoc as well.
| HexDigits.put4(buf, 24, i2); | ||
| HexDigits.put4(buf, 28, i3 >> 16); | ||
| HexDigits.put4(buf, 32, i3); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest a comment here along the lines of "Although the UUID byte ordering is defined to be big-endian, ByteArrayLittleEndian is used here to optimize for the most common architectures. hex8 reverses the order internally."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for the detailed comments and vectorized hex to string conversion.
|
/integrate |
|
Going to push as commit 796ec5e.
Your commit was automatically rebased without conflicts. |

Improve the performance of UUID::toString by using Long.expand and SWAR (SIMD within a register) instead of table lookup. Eliminating the table lookup can also avoid the performance degradation problem when the cache misses.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22928/head:pull/22928$ git checkout pull/22928Update a local copy of the PR:
$ git checkout pull/22928$ git pull https://git.openjdk.org/jdk.git pull/22928/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 22928View PR using the GUI difftool:
$ git pr show -t 22928Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22928.diff
Using Webrev
Link to Webrev Comment