-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8333893: Optimization for StringBuilder append boolean & null #19626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back wenshao! A progress list of the required criteria for merging this PR into |
@wenshao This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 12 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
1. Compare with the master branch
2. Benchmark Commandsmake test TEST="micro:java.lang.StringBuilders.toStringCharWithBool8"
make test TEST="micro:java.lang.StringBuilders.toStringCharWithNull8" 3. Benchmark NumbersThe performance numbers under MacBookPro M1 Pro are as follows: -Benchmark Mode Cnt Score Error Units #master (a6fc2f8)
-StringBuilders.toStringCharWithBool8Latin1 avgt 15 7.371 ? 0.003 ns/op
-StringBuilders.toStringCharWithBool8Utf16 avgt 15 9.613 ? 0.018 ns/op
-StringBuilders.toStringCharWithNull8Latin1 avgt 15 7.071 ? 0.003 ns/op
-StringBuilders.toStringCharWithNull8Utf16 avgt 15 9.296 ? 0.016 ns/op
+Benchmark Mode Cnt Score Error Units #current (5e815)
+StringBuilders.toStringCharWithBool8Latin1 avgt 15 6.515 ? 0.121 ns/op +11.61%
+StringBuilders.toStringCharWithBool8Utf16 avgt 15 8.654 ? 0.035 ns/op +9.97%
+StringBuilders.toStringCharWithNull8Latin1 avgt 15 5.550 ? 0.010 ns/op +21.51%
+StringBuilders.toStringCharWithNull8Utf16 avgt 15 8.108 ? 0.041 ns/op +12.77% |
1. Compare with unsafe branch
I think the performance of the Unsafe branch may be the best data for the C2 optimizer. @eme64 can help me see if C2 can do it? 2. Benchmark Commandsmake test TEST="micro:java.lang.StringBuilders.toStringCharWithBool8"
make test TEST="micro:java.lang.StringBuilders.toStringCharWithNull8" 3. Implementation of Unsafe Branchclass AbstractStringBuilder {
static final Unsafe UNSAFE = Unsafe.getUnsafe();
static final int NULL_LATIN1;
static final int TRUE_LATIN1;
static final int FALS_LATIN1;
static final long NULL_UTF16;
static final long TRUE_UTF16;
static final long FALS_UTF16;
static {
byte[] bytes4 = new byte[4];
byte[] bytes8 = new byte[8];
bytes4[0] = 'n';
bytes4[1] = 'u';
bytes4[2] = 'l';
bytes4[3] = 'l';
NULL_LATIN1 = UNSAFE.getInt(bytes4, Unsafe.ARRAY_BYTE_BASE_OFFSET);
StringUTF16.inflate(bytes4, 0, bytes8, 0, 4);
NULL_UTF16 = UNSAFE.getLong(bytes8, Unsafe.ARRAY_BYTE_BASE_OFFSET);
bytes4[0] = 't';
bytes4[1] = 'r';
bytes4[2] = 'u';
bytes4[3] = 'e';
TRUE_LATIN1 = UNSAFE.getInt(bytes4, Unsafe.ARRAY_BYTE_BASE_OFFSET);
StringUTF16.inflate(bytes4, 0, bytes8, 0, 4);
TRUE_UTF16 = UNSAFE.getLong(bytes8, Unsafe.ARRAY_BYTE_BASE_OFFSET);
bytes4[0] = 'f';
bytes4[1] = 'a';
bytes4[2] = 'l';
bytes4[3] = 's';
FALS_LATIN1 = UNSAFE.getInt(bytes4, Unsafe.ARRAY_BYTE_BASE_OFFSET);
StringUTF16.inflate(bytes4, 0, bytes8, 0, 4);
FALS_UTF16 = UNSAFE.getLong(bytes8, Unsafe.ARRAY_BYTE_BASE_OFFSET);
}
private AbstractStringBuilder appendNull() {
ensureCapacityInternal(count + 4);
int count = this.count;
byte[] val = this.value;
if (isLatin1()) {
UNSAFE.putInt(
val,
Unsafe.ARRAY_BYTE_BASE_OFFSET + count,
NULL_LATIN1);
} else {
UNSAFE.putLong(
val,
Unsafe.ARRAY_BYTE_BASE_OFFSET + (count << 1),
NULL_UTF16);
}
this.count = count + 4;
return this;
}
public AbstractStringBuilder append(boolean b) {
int count = this.count;
int spaceNeeded = count + (b ? 4 : 5);
ensureCapacityInternal(spaceNeeded);
byte[] val = this.value;
if (isLatin1()) {
UNSAFE.putInt(
val,
Unsafe.ARRAY_BYTE_BASE_OFFSET + count,
b ? TRUE_LATIN1 : FALS_LATIN1);
if (!b) {
val[count + 4] = 'e';
}
} else {
UNSAFE.putLong(
val,
Unsafe.ARRAY_BYTE_BASE_OFFSET + (count << 1),
b ? TRUE_UTF16 : FALS_UTF16);
if (!b) {
StringUTF16.putChar(val, count + 4, 'e');
}
}
this.count = spaceNeeded;
return this;
}
} 4. Benchmark NumbersThe performance numbers under MacBookPro M1 Pro are as follows: -Benchmark Mode Cnt Score Error Units # unsafe (adc220)
-StringBuilders.toStringCharWithBool8Latin1 avgt 15 6.415 ? 0.061 ns/op
-StringBuilders.toStringCharWithBool8Utf16 avgt 15 7.307 ? 0.013 ns/op
-StringBuilders.toStringCharWithNull8Latin1 avgt 15 5.443 ? 0.011 ns/op
-StringBuilders.toStringCharWithNull8Utf16 avgt 15 6.944 ? 0.102 ns/op
+Benchmark Mode Cnt Score Error Units #current (5e815)
+StringBuilders.toStringCharWithBool8Latin1 avgt 15 6.515 ? 0.121 ns/op -1.55%
+StringBuilders.toStringCharWithBool8Utf16 avgt 15 8.654 ? 0.035 ns/op -18.44%
+StringBuilders.toStringCharWithNull8Latin1 avgt 15 5.550 ? 0.010 ns/op -1.96%
+StringBuilders.toStringCharWithNull8Utf16 avgt 15 8.108 ? 0.041 ns/op -16.76% |
Webrevs
|
I think for that C2 JIT to work, we need to merge the |
Have you tried to see if the optimization actually was done/taken? You can use the http://psy-lob-saw.blogspot.com/2015/07/jmh-perfasm.html @liach I don't think it makes a difference if it is |
You are right, I was thinking about the case where you have 2 short variables, you should combine them into a long explicitly for C2 to generate a 4-byte write, otherwise it would be 2 2-bytes. Omitted the constant part which already eliminates this restriction. |
@eme64 It seems that when the following code uses StringUTF16.putChar, C2's optimization is not as good as the manual merging and storage effect. class AbstractStringBuilder {
private AbstractStringBuilder appendNull() {
// ...
StringUTF16.putCharsAt(val, count, 'n', 'u', 'l', 'l');
// ...
}
public AbstractStringBuilder append(boolean b) {
// ...
StringUTF16.putCharsAt(val, count, 't', 'r', 'u', 'e');
// ...
StringUTF16.putCharsAt(val, count, 'f', 'a', 'l', 's', 'e');
// ...
}
}
class StringUTF16 {
public static void putCharsAt(byte[] value, int i, char c1, char c2, char c3, char c4) {
putChar(value, i , c1);
putChar(value, i + 1, c2);
putChar(value, i + 2, c3);
putChar(value, i + 3, c4);
}
@IntrinsicCandidate
// intrinsic performs no bounds checks
static void putChar(byte[] val, int index, int c) {
assert index >= 0 && index < length(val) : "Trusted caller missed bounds check";
index <<= 1;
val[index++] = (byte)(c >> HI_BYTE_SHIFT);
val[index] = (byte)(c >> LO_BYTE_SHIFT);
}
} The code for manually merging storage is as follows, without using Unsafe: class AbstractStringBuilder {
static final long NULL_UTF16;
static final long TRUE_UTF16;
static final long FALS_UTF16;
static {
byte[] bytes = new byte[8];
StringUTF16.putCharsAt(bytes, 0, 'n', 'u', 'l', 'l');
NULL_UTF16 = getLong(bytes, 0);
StringUTF16.putCharsAt(bytes, 0, 't', 'r', 'u', 'e');
TRUE_UTF16 = getLong(bytes, 0);
StringUTF16.putCharsAt(bytes, 0, 'f', 'a', 'l', 's');
FALS_UTF16 = getLong(bytes, 0);
}
private static long getLong(byte[] bytes, int offset) {
return (((long)bytes[offset ] & 0xff) ) |
(((long)bytes[offset + 1] & 0xff) << 8) |
(((long)bytes[offset + 2] & 0xff) << 16) |
(((long)bytes[offset + 3] & 0xff) << 24) |
(((long)bytes[offset + 4] & 0xff) << 32) |
(((long)bytes[offset + 5] & 0xff) << 40) |
(((long)bytes[offset + 6] & 0xff) << 48) |
(((long)bytes[offset + 7] & 0xff) << 56);
}
private static void setLong(byte[] array, int offset, long value) {
array[offset] = (byte) value;
array[offset + 1] = (byte) (value >> 8);
array[offset + 2] = (byte) (value >> 16);
array[offset + 3] = (byte) (value >> 24);
array[offset + 4] = (byte) (value >> 32);
array[offset + 5] = (byte) (value >> 40);
array[offset + 6] = (byte) (value >> 48);
array[offset + 7] = (byte) (value >> 56);
}
private AbstractStringBuilder appendNull() {
int count = this.count;
ensureCapacityInternal(count + 4);
byte[] val = this.value;
if (isLatin1()) {
val[count ] = 'n';
val[count + 1] = 'u';
val[count + 2] = 'l';
val[count + 3] = 'l';
} else {
setLong(val, count, NULL_UTF16);
}
this.count = count + 4;
return this;
}
public AbstractStringBuilder append(boolean b) {
int count = this.count;
int spaceNeeded = count + (b ? 4 : 5);
ensureCapacityInternal(spaceNeeded);
byte[] val = this.value;
if (isLatin1()) {
if (b) {
val[count ] = 't';
val[count + 1] = 'r';
val[count + 2] = 'u';
val[count + 3] = 'e';
} else {
val[count ] = 'f';
val[count + 1] = 'a';
val[count + 2] = 'l';
val[count + 3] = 's';
val[count + 4] = 'e';
}
} else {
setLong(val, count, b ? TRUE_UTF16 : FALS_UTF16);
if (!b) {
StringUTF16.putChar(val, count + 4, 'e');
}
}
this.count = spaceNeeded;
return this;
}
} The getLong/setLong methods here can be optimized and merged by C2. Maybe we need a public class that does not use Unsafe to implement these methods. They are needed in many places. Maybe it is appropriate to improve them based on ByteArray/ByteArrayLittleEndian. |
From a compiler point of view, it might be better to enhance the 8318446 optimization so that it recognizes code patterns that use increment. |
Whether increments are handled or not, I think the |
As I asked above, you will need to provide some evidence / generated assembly / perf data, and logs from And please also try @cl4es advide here: And sure, maybe you need some public API for setting multiple bytes at once, which the |
Edited to remove the extra long PrintAssembly information to make the page easier to read |
@wenshao This is just an assembly dump. You need to have some profiling data that tells you where the time is spent. I'm not going to do the analysis work for you, I'm sorry. I gave you some pointers as how to do that. If you have more questions about how to do that, feel free to ask. You also have not provided the Can you investigate WHY there is a performance difference? Which |
@eme64 How to TraceMergeStores? |
@wenshao Have you grepped it in the code base? |
Also: the assembly that you pasted: does it have the original stores, or do you see that the stores were actually merged? |
@wenshao have you published info about |
@eme64 TraceMergeStores are here, but I can't understand these assembly codes JavaCodeclass AbstractStringBuilder {
private AbstractStringBuilder appendNull() {
int count = this.count;
ensureCapacityInternal(count + 4);
byte[] val = this.value;
if (isLatin1()) {
val[count ] = 'n';
val[count + 1] = 'u';
val[count + 2] = 'l';
val[count + 3] = 'l';
} else {
StringUTF16.putCharsAt(val, count, 'n', 'u', 'l', 'l');
}
this.count = count + 4;
return this;
}
}
class StringUTF16 {
public static void putCharsAt(byte[] value, int i, char c1, char c2, char c3, char c4) {
putChar(value, i , c1);
putChar(value, i + 1, c2);
putChar(value, i + 2, c3);
putChar(value, i + 3, c4);
}
} TraceMergeStores
|
It'd be interesting to check performance on this micro with #19970 alone |
I ran performance tests on MaxBook M1 (aarch64) and aliyun c8a (AMD CPU x64). There was no significant performance difference between pr #19970 and master, but pr #19626 combined with #19970 significantly improved performance. Scriptgit remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao
# master
git checkout 85582d7a88bd5f79f5991ce22bc3bc75e514aec9
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"
# pr 19970
git checkout 3b89956957085e134a05c05876f40b85d949227e
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"
# pr 19626 + 19970
git checkout 58dae7888eceb1c61243f658b67c208e6c30f7f2
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"
MacBook M1 Max Performance Numbers# master
Benchmark Mode Cnt Score Error Units
StringBuilders.appendWithNull8Latin1 avgt 15 6.950 ? 0.027 ns/op
# pr 19970
Benchmark Mode Cnt Score Error Units
StringBuilders.appendWithNull8Latin1 avgt 15 6.945 ? 0.008 ns/op
# pr 19626 + 19970
Benchmark Mode Cnt Score Error Units
StringBuilders.appendWithNull8Latin1 avgt 15 6.441 ? 0.059 ns/op AMD x64 Performance Numbers# master
Benchmark Mode Cnt Score Error Units
StringBuilders.appendWithNull8Latin1 avgt 15 17.522 ± 8.113 ns/op
# pr 19970
Benchmark Mode Cnt Score Error Units
StringBuilders.appendWithNull8Latin1 avgt 15 17.487 ± 8.127 ns/op
# pr 19626 + 19970
Benchmark Mode Cnt Score Error Units
StringBuilders.appendWithNull8Latin1 avgt 15 5.983 ± 0.113 ns/op |
Hmm, I would have hoped for Perhaps we'd need to minimally change from
(I think it would be good to explore if we can to trigger the optimization without resorting to |
Simplifying the implementation of appendNull can improve performance, but it is still not as good as using Unsafe.putByte. git remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao
# master
git checkout 85582d7a88bd5f79f5991ce22bc3bc75e514aec9
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"
# pr 19970
git checkout 3b89956957085e134a05c05876f40b85d949227e
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"
# pr 19970 + simplify append null
git checkout a43be33a6cc67ac72058d1819ee3008fb6f76211
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"
# pr 19626 + 19970
git checkout 58dae7888eceb1c61243f658b67c208e6c30f7f2
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"
MacBook M1 Max Performance Numbers# master
Benchmark Mode Cnt Score Error Units
StringBuilders.appendWithNull8Latin1 avgt 15 6.950 ? 0.027 ns/op
# pr 19970
Benchmark Mode Cnt Score Error Units
StringBuilders.appendWithNull8Latin1 avgt 15 6.945 ? 0.008 ns/op
# pr 19970 + simplify append null
Benchmark Mode Cnt Score Error Units
StringBuilders.appendWithNull8Latin1 avgt 15 6.766 ? 0.012 ns/op
# pr 19626 + 19970
Benchmark Mode Cnt Score Error Units
StringBuilders.appendWithNull8Latin1 avgt 15 6.441 ? 0.059 ns/op |
There is no array out-of-bounds check when using Unsafe. In the appendNull method, there is already a call to ensureCapacityInternal. It is also safe to use Unsafe in this scenario. |
@wenshao @cl4es I've investigated a little with the benchmark. RangeCheck smearing could be the issue here. It turns out that RangeChecks are smeared (a kind of elimination in straight-line code) in the same phase as the MergeStores happens ( Of course if you do it all with Maybe I can somehow delay the MergeStores a little more. But I have nice solution in mind - only hacks so far 😅 . And maybe the analysis is not complete and there are other factors. I have lots of other tasks now, so I might have to put this a little off again. @wenshao what would be nice is if you could add some IR tests for these kinds of examples. Add IR rules that currently pass, and add some comments what we would prefer for better performance. That would help me immensely once I can get back to it. You could even just integrate the IR tests now already, so that I can then adjust them when I improve the optimization further. That way you would get credit for adding the tests. You could add them to I'm putting my Test2.java here for reference.
At times we first merge 2 StoreB -> 1 StoreC. And then in a second step 2 StoreC -> StoreI. See below the snippets.
|
Not sure why the RangeCheck smearing issue only appreas here now. Maybe because the graph is bigger? Maybe because of other things in the graph that change the order of things? IDK yet. |
Thanks for checking @eme64 - up to you if and when you want to try and tackle such an improvement, if it can be done in a clean way. And as you say there might be other factors at play here. Perhaps things are confounded by how the code is structured before and after; this PR outlines the array stores to a separate method which might affect loop optimization passes if that method is first compiled separately then inlined. While I don't think the point fix in this PR is all that urgent this has been lingering for a while and I'm sure @wenshao wants to integrate and move on. We can always integrate then back out the |
I'm not the library guy, so I'm not going to block this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the evaluations. I think we can use this unsafe version for now, and use direct writes once JIT can reliably eliminate bound checks.
This PR requires PR #19970 to support Unsafe.putByte MergeStore for better performance. I will add more test scenarios to TestMergeStores.java or MergeStoreBench later. |
FYI #19970 is now integrated - thanks for the patience :) |
It has been tested that mergeStore can work after the master branch is merged |
/integrate |
Going to push as commit 5890d94.
Your commit was automatically rebased without conflicts. |
I have concerns with the changes here, because of what it can cause when an (incorrect) program is sharing a StringBuilder between multiple threads. Even without any reordering, a thread could start an The previous code would have thrown an exception at an explicit or implicit bounds check, but with the changes here that'd no longer happen. For example
|
AbstractStringBuilder's value/coder/count are not volatile. If you use StringBuilder, you may not get the correct result. If you want thread safety, you should use StringBuffer. |
Sure, I already mentioned such program would be incorrect, but one thing is an incorrect result in the StringBuilder and another one is writing outside the bounds of the array into other parts (possibly objects) of the heap, or even outside, causing a VM crash. |
This indeed a genuine issue and needs to be addressed. Thank you @altrisi for noticing and raising this. |
After PR #16245, C2 optimizes stores into primitive arrays by combining values into larger stores.
This PR rewrites the code of appendNull and append(boolean) methods so that these two methods can be optimized by C2.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19626/head:pull/19626
$ git checkout pull/19626
Update a local copy of the PR:
$ git checkout pull/19626
$ git pull https://git.openjdk.org/jdk.git pull/19626/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 19626
View PR using the GUI difftool:
$ git pr show -t 19626
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19626.diff
Webrev
Link to Webrev Comment