Skip to content

Conversation

@wenshao
Copy link
Contributor

@wenshao wenshao commented Jan 6, 2025

Improve the performance of UUID::toString by using Long.expand and SWAR (SIMD within a register) instead of table lookup. Eliminating the table lookup can also avoid the performance degradation problem when the cache misses.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8353741: Eliminate table lookup in UUID.toString (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22928/head:pull/22928
$ git checkout pull/22928

Update a local copy of the PR:
$ git checkout pull/22928
$ git pull https://git.openjdk.org/jdk.git pull/22928/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 22928

View PR using the GUI difftool:
$ git pr show -t 22928

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22928.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 6, 2025

👋 Welcome back swen! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jan 6, 2025

@wenshao This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8353741: Eliminate table lookup in UUID.toString

Reviewed-by: rriggs

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 43 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Jan 6, 2025

@wenshao this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout optim_uuid_to_string_202501
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Jan 6, 2025
@openjdk
Copy link

openjdk bot commented Jan 6, 2025

@wenshao The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the core-libs core-libs-dev@openjdk.org label Jan 6, 2025
@wenshao wenshao marked this pull request as draft January 6, 2025 14:06
@wenshao wenshao force-pushed the optim_uuid_to_string_202501 branch from f2db2ca to b550366 Compare January 6, 2025 14:28
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Jan 6, 2025
@wenshao
Copy link
Contributor Author

wenshao commented Jan 6, 2025

Under the x64 architecture, performance is significantly improved. However, on some aarch64 platforms, performance regresses.. The performance numbers are as follows:

1. Script

git remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao

# baseline dfaa89162a3
git checkout dfaa89162a35acd20b1ed35e147f9626a181510a
make test TEST="micro:java.util.UUIDBench.toString"

 # current 010ab70c00b
git checkout 010ab70c00b7c0f417127c050654a381b489d052
make test TEST="micro:java.util.UUIDBench.toString"

2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  84.620 ± 15.957  ops/us

+Benchmark           (size)   Mode  Cnt    Score   Error   Units (current 010ab70c00b)
+UUIDBench.toString   20000  thrpt   15  130.913 ± 0.111  ops/us +54.70%

3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)

-Benchmark           (size)   Mode  Cnt   Score    Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  84.754 ± 0.291  ops/us

+Benchmark           (size)   Mode  Cnt    Score   Error   Units (current 010ab70c00b)
+UUIDBench.toString   20000  thrpt   15  94.817 ± 0.231  ops/us +11.87%

4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (current 010ab70c00b)
-UUIDBench.toString   20000  thrpt   15  70.288 ± 0.147  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units
+UUIDBench.toString   20000  thrpt   15  92.088 ± 0.137  ops/us +31.01%

5. MacBook M1 Pro (aarch64)

-Benchmark           (size)   Mode  Cnt    Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  109.001 ? 0.354  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current 010ab70c00b)
+UUIDBench.toString   20000  thrpt   15  80.671 ? 0.722  ops/us -25.99%

6. orange_pi5_aarch64 (CPU RK3588S)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  37.752 ± 1.430  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current 010ab70c00b)
+UUIDBench.toString   20000  thrpt   15  30.940 ± 1.474  ops/us -18.04

7. orange_aipro_aarch64 (CPU TAISHANV200M)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  13.764 ± 0.262  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current 010ab70c00b)
+UUIDBench.toString   20000  thrpt   15  13.310 ± 0.175  ops/us -3.29%

@wenshao wenshao marked this pull request as ready for review January 6, 2025 22:51
@wenshao
Copy link
Contributor Author

wenshao commented Jan 8, 2025

// Method 1:
i = Long.reverseBytes(Long.expand(i, 0x0F0F_0F0F_0F0F_0F0FL));

// Method 2:
i = ((i & 0xF0000000L) >> 28)
  | ((i & 0xF000000L) >> 16)
  | ((i & 0xF00000L) >> 4)
  | ((i & 0xF0000L) << 8)
  | ((i & 0xF000L) << 20)
  | ((i & 0xF00L) << 32)
  | ((i & 0xF0L) << 44)
  | ((i & 0xFL) << 56);

Note: Using Long.reverseBytes + Long.expand is faster on x64 and ARMv9.
However, on AArch64 with ARMv8, it will be slower compared to the manual unrolling shown in Method 2.
ARMv8 includes Apple M1/M2, AWS Graviton 3; ARMv9.0 includes Apple M3/M4, Aliyun Yitian 710.

@j3graham
Copy link
Contributor

By stepping through the code of Long.expand, and substituting in the constants, I come up with this:

   static long expandNibbles(long i){
        // Inlined version of Long.expand(i,0x0F0F_0F0F_0F0F_0F0FL)
        long t = i << 16;
        i = (i & ~0xFFFF00000000L) | (t & 0xFFFF00000000L);
        t = i << 8;
        i = (i & ~0xFF000000FF0000L) | (t & 0xFF000000FF0000L);
        t = i << 4;
        i = (i & ~0xF000F000F000F00L) | (t & 0xF000F000F000F00L);
        
        return i & 0x0F0F_0F0F_0F0F_0F0FL;
    }

This looks like it might actually do better than Method 2. If inlining and constant folding is happening in the non-intrinsic Long.expand I would imagine it would perform comparably to this.

@wenshao
Copy link
Contributor Author

wenshao commented Jan 11, 2025

The new implementation improves performance on the aarch64 architecture but results in a performance regression on x64.

1. Script

git remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao

# baseline dfaa89162a3
git checkout dfaa89162a35acd20b1ed35e147f9626a181510a
make test TEST="micro:java.util.UUIDBench.toString"

 # current c513087056b
git checkout c513087056be8c1e1a915625e0b425a7ecbb21d6
make test TEST="micro:java.util.UUIDBench.toString"

2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  94.274 ± 0.452  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
+UUIDBench.toString   20000  thrpt   15  80.241 ± 0.894  ops/us -14.88%

3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  85.323 ± 2.044  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
+UUIDBench.toString   20000  thrpt   15  73.636 ± 0.590  ops/us -13.69%

4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  69.286 ± 1.136  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
+UUIDBench.toString   20000  thrpt   15  80.475 ± 0.310  ops/us +16.14%

5. MacBook M1 Pro (aarch64)

-Benchmark           (size)   Mode  Cnt    Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  108.254 ? 1.167  ops/us

+Benchmark           (size)   Mode  Cnt    Score   Error   Units (current c513087056b)
+UUIDBench.toString   20000  thrpt   15  122.313 ? 0.820  ops/us +12.98%

6. orange_pi5_aarch64 (CPU RK3588S)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  37.783 ± 1.553  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
+UUIDBench.toString   20000  thrpt   15  42.928 ± 2.534  ops/us +13.61%

7. orange_aipro_aarch64 (CPU TAISHANV200M)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  13.822 ± 0.203  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
+UUIDBench.toString   20000  thrpt   15  18.946 ± 0.156  ops/us +37.07%

@j3graham
Copy link
Contributor

j3graham commented Jan 13, 2025

The non-intrinsified java code should be able to run as quickly as the hand-inlined one.

I think I've found an issue that prevents the code from being constant-folded as expected. C2 seems to not do constant-folding of xor nodes.

See #23089 for an attempt at addressing this.

@liach
Copy link
Member

liach commented Jan 14, 2025

Does this expandNibbles compile to any xor nodes in the GVN?

Note that aarch64 has an intrinsic for expand that goes through the vector processing unit. Don't know if that intrinsic or the constant folding takes priority; might need further tweaks to hotspot if that intrinsic comes before the constant folded result and slows things down.

@j3graham
Copy link
Contributor

There are no XOR nodes in expandNibbles
image

@j3graham
Copy link
Contributor

With regard to the aarch64 vector instrinsic, I don't have access to an aarch64 to try it on (I'm faking it x64 by disabling the intrinsic). @wenshao would it be possible for you to try the Long.expand version of this patch with the patch from #23089 to see how aarch64 performs?

@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 4, 2025
@mlbridge
Copy link

mlbridge bot commented Apr 4, 2025

@wenshao
Copy link
Contributor Author

wenshao commented Apr 8, 2025

https://github.com/wenshao/jdk/tree/optim_uuid_to_string_202501_x2

I also made a version that does not use ByteArrayLittleEndian, but the performance drops by 5% to 10%. By printing the igv file through PrintIdealGraph, I found that the RangeCheck of array access is not eliminated.

pr_22928_igv.zip

@j3graham
Copy link
Contributor

j3graham commented Apr 8, 2025

Have you considered removing reverseBytes and using ByteArray instead of ByteArrayLittleEndian?

@wenshao
Copy link
Contributor Author

wenshao commented Apr 10, 2025

Have you considered removing reverseBytes and using ByteArray instead of ByteArrayLittleEndian?

I tested it on a MacBook M1 Max, and the performance of using reverseBytes + ByteArrayLittleEndian is about 1% faster. Considering that most of us have little-endian machines, ByteArray.setInt/setLong also needs to do reverseBytes.

Doing reverseBytes in hex8 requires 4 Long.reverseBytes in a UUID.toString call. If using ByteArray, on a little-endian machine, a UUID.toString call requires 2 Long.reverseBytes and 4 Integer.reverseBytes.

@wenshao wenshao changed the title 8353741: Improve UUID.toString performance by using SIMD within a register instead of table lookup 8353741: Eliminate table lookup in UUID.toString May 2, 2025
@bridgekeeper
Copy link

bridgekeeper bot commented May 8, 2025

@wenshao This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Copy link
Contributor

@RogerRiggs RogerRiggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the performance comparisons (percentage improvement.)

* Extract the least significant 4 bytes from the input integer i, convert each byte into its corresponding 2-digit
* hexadecimal representation, concatenate these hexadecimal strings into one continuous string, and then interpret
* this string as a hexadecimal number to form and return a long value.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reference describing this technique? Such as the Hacker's Delight citation used in java.lang.Long?
If not a longer descriptive comment would be good for maintainability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm was researched and implemented by me, and I have added detailed description and comments

@wenshao
Copy link
Contributor Author

wenshao commented May 19, 2025

The latest performance test numbers show that under AMD X64, the performance is improved by about 21%, and under MacBook M1 Pro, the performance is improved by about 7%.

performance test scripts

git remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao

# origin
git checkout bd99525633e4d3d3f180a6678eedb8780dbb6139
make test TEST="micro:java.util.UUIDBench.toString"

# current
git checkout 878a3cb7c73b777eb5385b3e4d158c998cd9be46
make test TEST="micro:java.util.UUIDBench.toString"

aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)

-# origin  bd99525633e4d3d3f180a6678eedb8780dbb6139
-Benchmark           (size)   Mode  Cnt   Score   Error   Units
-UUIDBench.toString   20000  thrpt   15  93.320 ± 0.359  ops/us

+# current 878a3cb7c73b777eb5385b3e4d158c998cd9be46
+Benchmark           (size)   Mode  Cnt    Score   Error   Units
+UUIDBench.toString   20000  thrpt   15  113.508 ± 1.372  ops/us +21.63%

Apple MackBook M1 Pro

-# origin  bd99525633e4d3d3f180a6678eedb8780dbb6139
-Benchmark           (size)   Mode  Cnt    Score   Error   Units
-UUIDBench.toString   20000  thrpt   15  106.199 ± 0.568  ops/us

+# current 878a3cb7c73b777eb5385b3e4d158c998cd9be46
+Benchmark           (size)   Mode  Cnt    Score   Error   Units
+UUIDBench.toString   20000  thrpt   15  113.911 ± 0.628  ops/us +7.34%

Comment on lines 527 to 528
* Input: 0x0123456789ABCDEF
* Output: 0x3031323334353637 ('0','1','2','3','4','5','6','7' in ASCII)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the low 32 bits are used, the example only needs to show the input ox 0x0123457.
Or change the signature to accept an int.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the input of Long.expand is long, if the parameter is int, it will cause a redundant conversion process of long -> int -> long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is just the example input and output should be something like:

Input: 0xABCDEF01
Output: 0x3130666564636261

to demonstrate that it is the low-order bits that are used, as well as that the digit order is reversed in the result.

* @see Long#reverseBytes(long)
*/
private static long hex8(long i) {
// Expand each 4-bit group into 8 bits, spreading them out in the long value: 0xAABBCCDD -> 0x0A0B0C0D0A0B0C0D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be 0xAABBCCDD -> 0xA0A0B0B0C0C0D0D

* This method processes multiple digits in parallel by treating a long value as eight 8-bit lanes,
* achieving significantly better performance compared to traditional loop-based conversion.
*
* <p>The conversion algorithm works as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the in-line comment describing the algorithm is enough - doesn't need the details in the javadoc as well.

HexDigits.put4(buf, 24, i2);
HexDigits.put4(buf, 28, i3 >> 16);
HexDigits.put4(buf, 32, i3);

Copy link
Contributor

@j3graham j3graham May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a comment here along the lines of "Although the UUID byte ordering is defined to be big-endian, ByteArrayLittleEndian is used here to optimize for the most common architectures. hex8 reverses the order internally."

Copy link
Contributor

@RogerRiggs RogerRiggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for the detailed comments and vectorized hex to string conversion.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 21, 2025
@wenshao
Copy link
Contributor Author

wenshao commented May 22, 2025

/integrate

@openjdk
Copy link

openjdk bot commented May 22, 2025

Going to push as commit 796ec5e.
Since your change was applied there have been 71 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 22, 2025
@openjdk openjdk bot closed this May 22, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 22, 2025
@openjdk
Copy link

openjdk bot commented May 22, 2025

@wenshao Pushed as commit 796ec5e.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

4 participants