8333893: Optimization for StringBuilder append boolean & null #19626

wenshao · 2024-06-10T12:12:58Z

After PR #16245, C2 optimizes stores into primitive arrays by combining values into larger stores.

This PR rewrites the code of appendNull and append(boolean) methods so that these two methods can be optimized by C2.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8333893: Optimization for StringBuilder append boolean & null (Enhancement - P4)

Reviewers

Chen Liang (@liach - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19626/head:pull/19626
$ git checkout pull/19626

Update a local copy of the PR:
$ git checkout pull/19626
$ git pull https://git.openjdk.org/jdk.git pull/19626/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19626

View PR using the GUI difftool:
$ git pr show -t 19626

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19626.diff

Webrev

Link to Webrev Comment

bridgekeeper · 2024-06-10T12:13:37Z

👋 Welcome back wenshao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2024-06-10T12:13:48Z

@wenshao This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8333893: Optimization for StringBuilder append boolean & null

Reviewed-by: liach

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 12 new commits pushed to the master branch:

07f550b: 8340818: Add a new jtreg test root to test the generated documentation
27ef6c9: 8341470: BigDecimal.stripTrailingZeros() optimization
5d5d88a: 8339570: Add Tidy build support for JDK tests
239d84a: 8342578: GHA: RISC-V: Bootstrap using Debian snapshot is still failing
aa060f2: 8342334: CDS: Scratch mirrors should not point to dead klasses
680dc5d: 8342496: C2/Shenandoah: SEGV in compiled code when running jcstress
8f2b23b: 8341407: C2: assert(main_limit == cl->limit() || get_ctrl(main_limit) == new_limit_ctrl) failed: wrong control for added limit
21682bc: 8342612: Increase memory usage of compiler/c2/TestScalarReplacementMaxLiveNodes.java
d61f56a: 8342287: C2 fails with "assert(is_IfTrue()) failed: invalid node class: IfFalse" due to Template Assertion Predicate with two UCTs
76ae072: 8342579: RISC-V: C2: Cleanup effect of killing flag register for call instructs
... and 2 more: https://git.openjdk.org/jdk/compare/8591109419efc8f71544a98bdb04a48cb1afc47e...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk · 2024-06-10T12:14:03Z

@wenshao The following labels will be automatically applied to this pull request:

core-libs
hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

wenshao · 2024-06-10T12:22:35Z

1. Compare with the master branch

master (a6fc2f8) https://github.com/wenshao/jdk/tree/upstream_master_a6fc2f8
current (5e815) https://github.com/wenshao/jdk/tree/optim_str_builder_append_202406

2. Benchmark Commands

make test TEST="micro:java.lang.StringBuilders.toStringCharWithBool8"
make test TEST="micro:java.lang.StringBuilders.toStringCharWithNull8"

3. Benchmark Numbers

The performance numbers under MacBookPro M1 Pro are as follows:

-Benchmark                                   Mode  Cnt  Score   Error  Units #master (a6fc2f8)
-StringBuilders.toStringCharWithBool8Latin1  avgt   15  7.371 ? 0.003  ns/op
-StringBuilders.toStringCharWithBool8Utf16   avgt   15  9.613 ? 0.018  ns/op
-StringBuilders.toStringCharWithNull8Latin1  avgt   15  7.071 ? 0.003  ns/op
-StringBuilders.toStringCharWithNull8Utf16   avgt   15  9.296 ? 0.016  ns/op

+Benchmark                                   Mode  Cnt  Score   Error  Units #current (5e815)
+StringBuilders.toStringCharWithBool8Latin1  avgt   15  6.515 ? 0.121  ns/op +11.61%
+StringBuilders.toStringCharWithBool8Utf16   avgt   15  8.654 ? 0.035  ns/op +9.97%
+StringBuilders.toStringCharWithNull8Latin1  avgt   15  5.550 ? 0.010  ns/op +21.51%
+StringBuilders.toStringCharWithNull8Utf16   avgt   15  8.108 ? 0.041  ns/op +12.77%

wenshao · 2024-06-10T13:04:08Z

1. Compare with unsafe branch

current (5e815) https://github.com/wenshao/jdk/tree/optim_str_builder_append_202406
unsafe (adc220) https://github.com/wenshao/jdk/tree/optim_str_builder_append_202406_unsafe

I think the performance of the Unsafe branch may be the best data for the C2 optimizer. @eme64 can help me see if C2 can do it?

2. Benchmark Commands

make test TEST="micro:java.lang.StringBuilders.toStringCharWithBool8"
make test TEST="micro:java.lang.StringBuilders.toStringCharWithNull8"

3. Implementation of Unsafe Branch

class AbstractStringBuilder {
    static final Unsafe UNSAFE = Unsafe.getUnsafe();

    static final int NULL_LATIN1;
    static final int TRUE_LATIN1;
    static final int FALS_LATIN1;

    static final long NULL_UTF16;
    static final long TRUE_UTF16;
    static final long FALS_UTF16;

    static {
        byte[] bytes4 = new byte[4];
        byte[] bytes8 = new byte[8];

        bytes4[0] = 'n';
        bytes4[1] = 'u';
        bytes4[2] = 'l';
        bytes4[3] = 'l';
        NULL_LATIN1 = UNSAFE.getInt(bytes4, Unsafe.ARRAY_BYTE_BASE_OFFSET);
        StringUTF16.inflate(bytes4, 0, bytes8, 0, 4);
        NULL_UTF16 = UNSAFE.getLong(bytes8, Unsafe.ARRAY_BYTE_BASE_OFFSET);

        bytes4[0] = 't';
        bytes4[1] = 'r';
        bytes4[2] = 'u';
        bytes4[3] = 'e';
        TRUE_LATIN1 = UNSAFE.getInt(bytes4, Unsafe.ARRAY_BYTE_BASE_OFFSET);
        StringUTF16.inflate(bytes4, 0, bytes8, 0, 4);
        TRUE_UTF16 = UNSAFE.getLong(bytes8, Unsafe.ARRAY_BYTE_BASE_OFFSET);

        bytes4[0] = 'f';
        bytes4[1] = 'a';
        bytes4[2] = 'l';
        bytes4[3] = 's';
        FALS_LATIN1 = UNSAFE.getInt(bytes4, Unsafe.ARRAY_BYTE_BASE_OFFSET);
        StringUTF16.inflate(bytes4, 0, bytes8, 0, 4);
        FALS_UTF16 = UNSAFE.getLong(bytes8, Unsafe.ARRAY_BYTE_BASE_OFFSET);
    }

    private AbstractStringBuilder appendNull() {
        ensureCapacityInternal(count + 4);
        int count = this.count;
        byte[] val = this.value;
        if (isLatin1()) {
            UNSAFE.putInt(
                    val,
                    Unsafe.ARRAY_BYTE_BASE_OFFSET + count,
                    NULL_LATIN1);
        } else {
            UNSAFE.putLong(
                    val,
                    Unsafe.ARRAY_BYTE_BASE_OFFSET + (count << 1),
                    NULL_UTF16);
        }
        this.count = count + 4;
        return this;
    }

    public AbstractStringBuilder append(boolean b) {
        int count = this.count;
        int spaceNeeded = count + (b ? 4 : 5);
        ensureCapacityInternal(spaceNeeded);
        byte[] val = this.value;
        if (isLatin1()) {
            UNSAFE.putInt(
                    val,
                    Unsafe.ARRAY_BYTE_BASE_OFFSET + count,
                    b ? TRUE_LATIN1 : FALS_LATIN1);
            if (!b) {
                val[count + 4] = 'e';
            }
        } else {
            UNSAFE.putLong(
                    val,
                    Unsafe.ARRAY_BYTE_BASE_OFFSET + (count << 1),
                    b ? TRUE_UTF16 : FALS_UTF16);
            if (!b) {
                StringUTF16.putChar(val, count + 4, 'e');
            }
        }
        this.count = spaceNeeded;
        return this;
    }
}

4. Benchmark Numbers

The performance numbers under MacBookPro M1 Pro are as follows:

-Benchmark                                   Mode  Cnt  Score   Error  Units # unsafe (adc220)
-StringBuilders.toStringCharWithBool8Latin1  avgt   15  6.415 ? 0.061  ns/op
-StringBuilders.toStringCharWithBool8Utf16   avgt   15  7.307 ? 0.013  ns/op
-StringBuilders.toStringCharWithNull8Latin1  avgt   15  5.443 ? 0.011  ns/op
-StringBuilders.toStringCharWithNull8Utf16   avgt   15  6.944 ? 0.102  ns/op

+Benchmark                                   Mode  Cnt  Score   Error  Units #current (5e815)
+StringBuilders.toStringCharWithBool8Latin1  avgt   15  6.515 ? 0.121  ns/op -1.55%
+StringBuilders.toStringCharWithBool8Utf16   avgt   15  8.654 ? 0.035  ns/op -18.44%
+StringBuilders.toStringCharWithNull8Latin1  avgt   15  5.550 ? 0.010  ns/op -1.96%
+StringBuilders.toStringCharWithNull8Utf16   avgt   15  8.108 ? 0.041  ns/op -16.76%

mlbridge · 2024-06-10T13:47:22Z

Webrevs

liach · 2024-06-10T14:20:03Z

I think for that C2 JIT to work, we need to merge the 't' 'r' 'u' 'e' ascii bytes into an int constant. Same for false.

eme64 · 2024-06-10T14:32:46Z

@wenshao

I think the performance of the Unsafe branch may be the best data for the C2 optimizer. @eme64 can help me see if C2 can do it?

Have you tried to see if the optimization actually was done/taken? You can use the TraceMergeStores, flag. Can you present the generated assembly code of the benchmarks, and explain the difference based on the generated assembly code? You can run JMH penchmarks with perf. These two blogs may help you:

http://psy-lob-saw.blogspot.com/2015/07/jmh-perfasm.html
https://shipilev.net/blog/2016/arrays-wisdom-ancients/#_meet_jmh_prof_perfasm

@liach I don't think it makes a difference if it is int or byte constants. Or what exactly is the code change you are proposing?

liach · 2024-06-10T15:56:48Z

You are right, I was thinking about the case where you have 2 short variables, you should combine them into a long explicitly for C2 to generate a 4-byte write, otherwise it would be 2 2-bytes. Omitted the constant part which already eliminates this restriction.

wenshao · 2024-06-10T23:10:05Z

@eme64 It seems that when the following code uses StringUTF16.putChar, C2's optimization is not as good as the manual merging and storage effect.

class AbstractStringBuilder {
    private AbstractStringBuilder appendNull() {
    	// ...
		StringUTF16.putCharsAt(val, count, 'n', 'u', 'l', 'l');
		// ...
	}

	public AbstractStringBuilder append(boolean b) {
	    // ...
		StringUTF16.putCharsAt(val, count, 't', 'r', 'u', 'e');
		// ...
		StringUTF16.putCharsAt(val, count, 'f', 'a', 'l', 's', 'e');
		// ...
	}
}

class StringUTF16 {
   public static void putCharsAt(byte[] value, int i, char c1, char c2, char c3, char c4) {
        putChar(value, i    , c1);
        putChar(value, i + 1, c2);
        putChar(value, i + 2, c3);
        putChar(value, i + 3, c4);
    }

    @IntrinsicCandidate
    // intrinsic performs no bounds checks
    static void putChar(byte[] val, int index, int c) {
        assert index >= 0 && index < length(val) : "Trusted caller missed bounds check";
        index <<= 1;
        val[index++] = (byte)(c >> HI_BYTE_SHIFT);
        val[index]   = (byte)(c >> LO_BYTE_SHIFT);
    }
}

The code for manually merging storage is as follows, without using Unsafe:

class AbstractStringBuilder {
    static final long NULL_UTF16;
    static final long TRUE_UTF16;
    static final long FALS_UTF16;

    static {
        byte[] bytes = new byte[8];

        StringUTF16.putCharsAt(bytes, 0, 'n', 'u', 'l', 'l');
        NULL_UTF16 = getLong(bytes, 0);

        StringUTF16.putCharsAt(bytes, 0, 't', 'r', 'u', 'e');
        TRUE_UTF16 = getLong(bytes, 0);

        StringUTF16.putCharsAt(bytes, 0, 'f', 'a', 'l', 's');
        FALS_UTF16 = getLong(bytes, 0);
    }

    private static long getLong(byte[] bytes, int offset) {
        return (((long)bytes[offset    ] & 0xff)      ) |
               (((long)bytes[offset + 1] & 0xff) <<  8) |
               (((long)bytes[offset + 2] & 0xff) << 16) |
               (((long)bytes[offset + 3] & 0xff) << 24) |
               (((long)bytes[offset + 4] & 0xff) << 32) |
               (((long)bytes[offset + 5] & 0xff) << 40) |
               (((long)bytes[offset + 6] & 0xff) << 48) |
               (((long)bytes[offset + 7] & 0xff) << 56);
    }

    private static void setLong(byte[] array, int offset, long value) {
        array[offset]     = (byte)  value;
        array[offset + 1] = (byte) (value >> 8);
        array[offset + 2] = (byte) (value >> 16);
        array[offset + 3] = (byte) (value >> 24);
        array[offset + 4] = (byte) (value >> 32);
        array[offset + 5] = (byte) (value >> 40);
        array[offset + 6] = (byte) (value >> 48);
        array[offset + 7] = (byte) (value >> 56);
    }

    private AbstractStringBuilder appendNull() {
        int count = this.count;
        ensureCapacityInternal(count + 4);
        byte[] val = this.value;
        if (isLatin1()) {
            val[count    ] = 'n';
            val[count + 1] = 'u';
            val[count + 2] = 'l';
            val[count + 3] = 'l';
        } else {
            setLong(val, count, NULL_UTF16);
        }
        this.count = count + 4;
        return this;
    }

    public AbstractStringBuilder append(boolean b) {
        int count = this.count;
        int spaceNeeded = count + (b ? 4 : 5);
        ensureCapacityInternal(spaceNeeded);
        byte[] val = this.value;
        if (isLatin1()) {
            if (b) {
                val[count    ] = 't';
                val[count + 1] = 'r';
                val[count + 2] = 'u';
                val[count + 3] = 'e';
            } else {
                val[count    ] = 'f';
                val[count + 1] = 'a';
                val[count + 2] = 'l';
                val[count + 3] = 's';
                val[count + 4] = 'e';
            }
        } else {
            setLong(val, count, b ? TRUE_UTF16 : FALS_UTF16);
            if (!b) {
                StringUTF16.putChar(val, count + 4, 'e');
            }
        }
        this.count = spaceNeeded;
        return this;
    }
}

The getLong/setLong methods here can be optimized and merged by C2. Maybe we need a public class that does not use Unsafe to implement these methods. They are needed in many places. Maybe it is appropriate to improve them based on ByteArray/ByteArrayLittleEndian.

dean-long · 2024-06-10T23:37:52Z

From a compiler point of view, it might be better to enhance the 8318446 optimization so that it recognizes code patterns that use increment.
For a libraries point of view, this proposed change probably needs some comments, so it doesn't get accidentally changed in the future in a way that breaks the optimization. How would we prevent that? The situation seems fragile to me.

cl4es · 2024-06-10T23:49:33Z

Whether increments are handled or not, I think the StringUTF16.putChar intrinsic might be getting in the way here and prevent merging consecutive "char" stores into 4 or 8 byte stores. It would be interesting to get numbers for the putChar version with the intrinsic disabled (-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_putCharStringU).

eme64 · 2024-06-11T06:27:13Z

@wenshao

@eme64 It seems that when the following code uses StringUTF16.putChar, C2's optimization is not as good as the manual merging and storage effect.

As I asked above, you will need to provide some evidence / generated assembly / perf data, and logs from TraceMergeStores. I currently do not have time to produce these myself, and I think they would be crucial to determine where the missing performance has gone. See my earlier comment:
#19626 (comment)

And please also try @cl4es advide here:
#19626 (comment)

And sure, maybe you need some public API for setting multiple bytes at once, which the MergeStores optimization can optimize. I'm a C2 engineer, so I leave that up to the library folks ;)

wenshao · 2024-06-11T09:17:00Z

Edited to remove the extra long PrintAssembly information to make the page easier to read

eme64 · 2024-06-11T09:32:14Z

@wenshao This is just an assembly dump. You need to have some profiling data that tells you where the time is spent. I'm not going to do the analysis work for you, I'm sorry. I gave you some pointers as how to do that. If you have more questions about how to do that, feel free to ask. You also have not provided the TraceMergeStores log yet, as I asked you.

Can you investigate WHY there is a performance difference? Which loads and branches etc are generated?

wenshao · 2024-06-11T09:39:29Z

@eme64 How to TraceMergeStores?

eme64 · 2024-06-11T09:43:07Z

@wenshao Have you grepped it in the code base?
grep TraceMergeStores src/hotspot/share/ -r
Hence, use enable flag with -XX:+TraceMergeStores

eme64 · 2024-06-11T09:50:25Z

Also: the assembly that you pasted: does it have the original stores, or do you see that the stores were actually merged?

wenshao · 2024-06-11T12:11:21Z

(f96cde4e) (0cbaa5ac) The previous overriding of StringUTF16.putChar method did not improve performance, so I reverted to the original version.

I have collected information on TraceMergeStores, but I don't have enough knowledge to analyze it yet

https://github.com/wenshao/jdk/wiki/pr19626_appendNull_0

eme64 · 2024-06-11T15:35:47Z

@wenshao have you published info about TraceMergeStores somewhere? It is very well possible that the optimization does not apply in your code. Then you would need to dig into the VM code and see why.

wenshao · 2024-06-11T15:55:18Z

@eme64 TraceMergeStores are here, but I can't understand these assembly codes

JavaCode

class AbstractStringBuilder {
     private AbstractStringBuilder appendNull() {
        int count = this.count;
        ensureCapacityInternal(count + 4);
        byte[] val = this.value;
        if (isLatin1()) {
            val[count    ] = 'n';
            val[count + 1] = 'u';
            val[count + 2] = 'l';
            val[count + 3] = 'l';
        } else {
            StringUTF16.putCharsAt(val, count, 'n', 'u', 'l', 'l');
        }
        this.count = count + 4;
        return this;
    }
}

class StringUTF16 {
    public static void putCharsAt(byte[] value, int i, char c1, char c2, char c3, char c4) {
        putChar(value, i    , c1);
        putChar(value, i + 1, c2);
        putChar(value, i + 2, c3);
        putChar(value, i + 3, c4);
    }
}

TraceMergeStores

[TraceMergeStores]: Replace
  78  StoreB  === 61 7 76 12  [[ 103 89 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=5;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5; !jvms: StringUTF16::putChar @ bci:43 (line 71)
 103  StoreB  === 88 78 101 80  [[ 18 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=5;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5; !jvms: StringUTF16::putChar @ bci:52 (line 72)
[TraceMergeStores]: with
  12  Parm  === 3  [[ 78 66 38 80 107 113 ]] Parm2: int !jvms: StringUTF16::putChar @ bci:-1 (line 69)
 113  StoreC  === 88 7 76 12  [[ ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5;

cl4es · 2024-10-19T23:20:37Z

After PR 19970, the performance has been significantly improved. Below are the performance numbers for AMD CPU (x64)

It'd be interesting to check performance on this micro with #19970 alone

wenshao · 2024-10-20T07:51:02Z

I ran performance tests on MaxBook M1 (aarch64) and aliyun c8a (AMD CPU x64). There was no significant performance difference between pr #19970 and master, but pr #19626 combined with #19970 significantly improved performance.

Script

git remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao

# master
git checkout 85582d7a88bd5f79f5991ce22bc3bc75e514aec9
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"

# pr 19970
git checkout 3b89956957085e134a05c05876f40b85d949227e
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"

# pr 19626 + 19970
git checkout 58dae7888eceb1c61243f658b67c208e6c30f7f2
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"

MacBook M1 Max Performance Numbers

# master
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.950 ? 0.027  ns/op

# pr 19970
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.945 ? 0.008  ns/op

# pr 19626 + 19970
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.441 ? 0.059  ns/op

AMD x64 Performance Numbers

# master
Benchmark                             Mode  Cnt   Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  17.522 ± 8.113  ns/op

# pr 19970
Benchmark                             Mode  Cnt   Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  17.487 ± 8.127  ns/op

# pr 19626 + 19970
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  5.983 ± 0.113  ns/op

cl4es · 2024-10-20T20:19:19Z

Hmm, I would have hoped for appendNull the pre-existing code would have allowed for merging stores with #19970. Can you run with +TraceMergeStores on the #19970 branch?

Perhaps we'd need to minimally change from count++ increments to constant offsets:

            val[count] = 'n';
            val[count + 1] = 'u';
            val[count + 2] = 'l';
            val[count + 3] = 'l';

(I think it would be good to explore if we can to trigger the optimization without resorting to Unsafe. Any ideas, @eme64?)

wenshao · 2024-10-20T22:46:57Z

Simplifying the implementation of appendNull can improve performance, but it is still not as good as using Unsafe.putByte.

git remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao

# master
git checkout 85582d7a88bd5f79f5991ce22bc3bc75e514aec9
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"

# pr 19970
git checkout 3b89956957085e134a05c05876f40b85d949227e
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"


# pr 19970 + simplify append null
git checkout a43be33a6cc67ac72058d1819ee3008fb6f76211
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"

# pr 19626 + 19970
git checkout 58dae7888eceb1c61243f658b67c208e6c30f7f2
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"

MacBook M1 Max Performance Numbers

# master
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.950 ? 0.027  ns/op

# pr 19970
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.945 ? 0.008  ns/op

# pr 19970 + simplify append null
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.766 ? 0.012  ns/op

# pr 19626 + 19970
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.441 ? 0.059  ns/op

wenshao · 2024-10-21T01:29:16Z

There is no array out-of-bounds check when using Unsafe. In the appendNull method, there is already a call to ensureCapacityInternal. It is also safe to use Unsafe in this scenario.

eme64 · 2024-10-21T07:12:02Z

@wenshao @cl4es I've investigated a little with the benchmark.

RangeCheck smearing could be the issue here.

It turns out that RangeChecks are smeared (a kind of elimination in straight-line code) in the same phase as the MergeStores happens (post_loop_opts_phase). So it seems to depend on the order of processing in the worklist, if we first remove the RC or try to merge the stores. If a RC is somewhere still stuck between stores, then merging does not quite work as hoped.

Of course if you do it all with Unsafe, then those RC are not there any more.

Maybe I can somehow delay the MergeStores a little more. But I have nice solution in mind - only hacks so far 😅 . And maybe the analysis is not complete and there are other factors.

I have lots of other tasks now, so I might have to put this a little off again.

@wenshao what would be nice is if you could add some IR tests for these kinds of examples. Add IR rules that currently pass, and add some comments what we would prefer for better performance. That would help me immensely once I can get back to it. You could even just integrate the IR tests now already, so that I can then adjust them when I improve the optimization further. That way you would get credit for adding the tests. You could add them to test/hotspot/jtreg/compiler/c2/TestMergeStores.java in a similar style.

I'm putting my Test2.java here for reference.

public class Test2 {
    public static void main(String[] args) {
         int len = 0;
         for (int i = 0; i < 100_000; i++) {
             len = appendWithNull8Latin1();
         }
         System.out.println(len);
    }

    public static int appendWithNull8Latin1() {
        StringBuilder buf = new StringBuilder("Latin1 string");
        buf.setLength(0);
        buf.append((String) null);
        buf.append((String) null);
        buf.append((String) null);
        buf.append((String) null);
        buf.append((String) null);
        buf.append((String) null);
        buf.append((String) null);
        buf.append((String) null);
        return buf.length();
    }
}

/oracle-work/jdk-fork5/build/linux-x64-debug/jdk/bin/java -ea -XX:+UnlockDiagnosticVMOptions -XX:+MergeStores -XX:CompileCommand=compileonly,Test2::appendWithNull8Latin1 -XX:CompileCommand=TraceMergeStores,Test2::*,BASIC,SUCCESS -Xbatch Test2.java

At times we first merge 2 StoreB -> 1 StoreC. And then in a second step 2 StoreC -> StoreI. See below the snippets.

[TraceMergeStores] MergePrimitiveStores::run:   612  StoreB  === 902 735 935 390  [[ 283 392 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores] expect no use: None
[TraceMergeStores] expect def: Found[735 StoreB, RC]
[TraceMergeStores] find def: Found[735 StoreB, RC]
[TraceMergeStores] found RangeCheck, stop traversal.
[TraceMergeStores] found:
    0-->   612  StoreB  === 902 735 935 390  [[ 283 392 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
    1-->   735  StoreB  === 1038 156 1040 381  [[ 612 388 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:34 (line 643) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores] truncated:
    0-->   612  StoreB  === 902 735 935 390  [[ 283 392 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
    1-->   735  StoreB  === 1038 156 1040 381  [[ 612 388 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:34 (line 643) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores]: Replace
  735  StoreB  === 1038 156 1040 381  [[ 612 388 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:34 (line 643) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
  612  StoreB  === 902 735 935 390  [[ 283 392 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores]: with
 2204  ConI  === 0  [[ 2205 ]]  #int:30062
 2205  StoreC  === 902 156 1040 2204  [[ ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11;

[TraceMergeStores] MergePrimitiveStores::run:  2268  StoreC  === 902 2205 613 2267  [[ 77 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[159] !jvms: AbstractStringBuilder::appendNull @ bci:58 (line 646) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores] expect no use: None
[TraceMergeStores] expect def: Found[2205 StoreC, no-RC]
[TraceMergeStores] find def: Found[2205 StoreC, no-RC]
[TraceMergeStores] find def: None
[TraceMergeStores] found:
    0-->  2268  StoreC  === 902 2205 613 2267  [[ 77 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[159] !jvms: AbstractStringBuilder::appendNull @ bci:58 (line 646) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
    1-->  2205  StoreC  === 902 156 1040 2204  [[ 2268 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[612] !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores] truncated:
    0-->  2268  StoreC  === 902 2205 613 2267  [[ 77 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[159] !jvms: AbstractStringBuilder::appendNull @ bci:58 (line 646) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
    1-->  2205  StoreC  === 902 156 1040 2204  [[ 2268 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[612] !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores]: Replace
 2205  StoreC  === 902 156 1040 2204  [[ 2268 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[612] !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
 2268  StoreC  === 902 2205 613 2267  [[ 77 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[159] !jvms: AbstractStringBuilder::appendNull @ bci:58 (line 646) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores]: with
 2234  ConI  === 0  [[ 2235 2255 2274 2276 ]]  #int:1819047278
 2276  StoreI  === 902 156 1040 2234  [[ ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11;
[TraceMergeStores] MergePrimitiveStores::run:  1314  StoreB  === 1213 884 1585 285  [[ 977 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:58 (line 646) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:29 (line 15)

eme64 · 2024-10-21T07:58:42Z

Not sure why the RangeCheck smearing issue only appreas here now. Maybe because the graph is bigger? Maybe because of other things in the graph that change the order of things? IDK yet.

cl4es · 2024-10-21T08:03:36Z

Thanks for checking @eme64 - up to you if and when you want to try and tackle such an improvement, if it can be done in a clean way. And as you say there might be other factors at play here. Perhaps things are confounded by how the code is structured before and after; this PR outlines the array stores to a separate method which might affect loop optimization passes if that method is first compiled separately then inlined.

While I don't think the point fix in this PR is all that urgent this has been lingering for a while and I'm sure @wenshao wants to integrate and move on. We can always integrate then back out the Unsafe stuff once/if the phases have been disentangled. Or we keep experimenting to try and see if we can get it to behave ideally without Unsafe here and now. WDYT? @liach?

eme64 · 2024-10-21T08:48:53Z

I'm not the library guy, so I'm not going to block this PR.
But having good tests and benchmarks are crucial, and very time-consuming to come up with for us compiler engineers. So these examples here are very valuable, and I think the time is not wasted working more on it.

liach

Thanks for the evaluations. I think we can use this unsafe version for now, and use direct writes once JIT can reliably eliminate bound checks.

wenshao · 2024-10-21T12:47:37Z

This PR requires PR #19970 to support Unsafe.putByte MergeStore for better performance. I will add more test scenarios to TestMergeStores.java or MergeStoreBench later.

eme64 · 2024-11-05T13:38:24Z

FYI #19970 is now integrated - thanks for the patience :)

wenshao · 2024-11-05T15:04:20Z

It has been tested that mergeStore can work after the master branch is merged

wenshao · 2024-11-05T15:04:44Z

/integrate

openjdk · 2024-11-05T15:05:37Z

Going to push as commit 5890d94.
Since your change was applied there have been 227 commits pushed to the master branch:

c33a8f5: 8343314: Move common properties from jpackage jtreg test declarations to TEST.properties file
16feeb7: 8343547: Restore accidentally removed annotations in LambdaForm from ClassFile API port
f62fc48: 8342498: Add test for Allocation elimination after use as alignment reference by SuperWord
f3671be: 8335392: C2 MergeStores: enhanced pointer parsing
4fc6d41: 8341194: [REDO] Implement C2 VectorizedHashCode on AArch64
abf2dc7: 8343298: Improve stability of runtime/cds/DeterministicDump.java test
dafa2e5: 8343124: Tests fails with java.lang.IllegalAccessException: class com.sun.javatest.regtest.agent.MainWrapper$MainTask cannot access
0f7dd98: 8251926: PPC: Remove an unused variable in assembler_ppc.cpp
cd91a44: 8343549: SeededSecureRandomTest needn't be in a package
20f3aaf: 8343471: RISC-V: compiler/cpuflags/TestAESIntrinsicsOnUnsupportedConfig.java fails after JDK-8334999
... and 217 more: https://git.openjdk.org/jdk/compare/8591109419efc8f71544a98bdb04a48cb1afc47e...master

Your commit was automatically rebased without conflicts.

openjdk · 2024-11-05T15:05:46Z

@wenshao Pushed as commit 5890d94.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

altrisi · 2025-02-01T12:38:22Z

I have concerns with the changes here, because of what it can cause when an (incorrect) program is sharing a StringBuilder between multiple threads.

Even without any reordering, a thread could start an append(boolean|null), pass the ensureCapacityInternal call, then another thread up the count by enough to no longer have enough capacity. Then the first thread could proceed to read the new (no longer enough!) count, and write without any bounds checks outside of the array. Easy to reproduce with a debugger and some breakpoints.

The previous code would have thrown an exception at an explicit or implicit bounds check, but with the changes here that'd no longer happen.

For example

(initial state)
value.length = 16
count = 11

Thread 1	Thread 2
append(true)
-ensureCap(11+4)
--nothing to do
	append("str")
	- ensureCap(11+3)
	-- nothing to do
	- ...
	- this.count <- 14
- cnt <- this.count (14!)
- val <- this.value
- val[cnt] <-u 't'
- val[cnt+1] <-u 'r'
- val[cnt+2 (16!)] <-u 'u'
- ...

wenshao · 2025-02-01T13:42:29Z

AbstractStringBuilder's value/coder/count are not volatile. If you use StringBuilder, you may not get the correct result. If you want thread safety, you should use StringBuffer.

altrisi · 2025-02-01T15:17:26Z

Sure, I already mentioned such program would be incorrect, but one thing is an incorrect result in the StringBuilder and another one is writing outside the bounds of the array into other parts (possibly objects) of the heap, or even outside, causing a VM crash.

jaikiran · 2025-02-03T14:57:57Z

The previous code would have thrown an exception at an explicit or implicit bounds check, but with the changes here that'd no longer happen.

This indeed a genuine issue and needs to be addressed. Thank you @altrisi for noticing and raising this.

Optimization for StringBuilder append(boolean) & appendNull

5e815b6

openjdk bot added hotspot-compiler hotspot-compiler-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Jun 10, 2024

wenshao changed the title ~~Optimization for StringBuilder append boolean & appendNull~~ Optimization for StringBuilder append boolean & null Jun 10, 2024

wenshao changed the title ~~Optimization for StringBuilder append boolean & null~~ 8333893: Optimization for StringBuilder append boolean & null Jun 10, 2024

openjdk bot added the rfr Pull request is ready for review label Jun 10, 2024

wenshao added 2 commits June 11, 2024 13:20

optimize to combining values into larger stores

0cbaa5a

bug fix

f96cde4

revert

27a3050

liach approved these changes Oct 21, 2024

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Oct 21, 2024

openjdk bot added the integrated Pull request has been integrated label Nov 5, 2024

openjdk bot closed this Nov 5, 2024

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 5, 2024

wenshao mentioned this pull request Nov 5, 2024

8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt #21907

Closed

3 tasks

wenshao mentioned this pull request Dec 28, 2024

[Backport]String concatenation & StringBuilder appendNull/appendBoolean dragonwell-project/dragonwell21#160

Merged

jaikiran mentioned this pull request Feb 3, 2025

8349183: [BACKOUT] Optimization for StringBuilder append boolean & null #23420

Closed

3 tasks

This was referenced Feb 3, 2025

StringLatin1 & StringUTF16::putCharsAt checkBoundsBeginEnd #23423

Closed

8315585: Optimization for decimal to string #23310

Draft

8333893: Optimization for StringBuilder append boolean & null #19626

8333893: Optimization for StringBuilder append boolean & null #19626

Uh oh!

Conversation

wenshao commented Jun 10, 2024 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Webrev

Uh oh!

bridgekeeper bot commented Jun 10, 2024

Uh oh!

openjdk bot commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Jun 10, 2024

Uh oh!

wenshao commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Compare with the master branch

2. Benchmark Commands

3. Benchmark Numbers

Uh oh!

wenshao commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Compare with unsafe branch

2. Benchmark Commands

3. Implementation of Unsafe Branch

4. Benchmark Numbers

Uh oh!

mlbridge bot commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

liach commented Jun 10, 2024

Uh oh!

eme64 commented Jun 10, 2024

Uh oh!

liach commented Jun 10, 2024

Uh oh!

wenshao commented Jun 10, 2024

Uh oh!

dean-long commented Jun 10, 2024

Uh oh!

cl4es commented Jun 10, 2024

Uh oh!

eme64 commented Jun 11, 2024

Uh oh!

wenshao commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eme64 commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenshao commented Jun 11, 2024

Uh oh!

eme64 commented Jun 11, 2024

Uh oh!

eme64 commented Jun 11, 2024

Uh oh!

wenshao commented Jun 11, 2024

Uh oh!

eme64 commented Jun 11, 2024

Uh oh!

wenshao commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JavaCode

TraceMergeStores

Uh oh!

cl4es commented Oct 19, 2024

Uh oh!

wenshao commented Oct 20, 2024

Script

MacBook M1 Max Performance Numbers

AMD x64 Performance Numbers

Uh oh!

cl4es commented Oct 20, 2024

Uh oh!

wenshao commented Jun 10, 2024 •

edited by openjdk bot

Loading

openjdk bot commented Jun 10, 2024 •

edited

Loading

wenshao commented Jun 10, 2024 •

edited

Loading

wenshao commented Jun 10, 2024 •

edited

Loading

mlbridge bot commented Jun 10, 2024 •

edited

Loading

wenshao commented Jun 11, 2024 •

edited

Loading

eme64 commented Jun 11, 2024 •

edited

Loading

wenshao commented Jun 11, 2024 •

edited

Loading

eme64 commented Oct 21, 2024 •

edited

Loading

jaikiran commented Feb 3, 2025 •

edited

Loading