Skip to content

8333893: Optimization for StringBuilder append boolean & null #19626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from

Conversation

wenshao
Copy link
Contributor

@wenshao wenshao commented Jun 10, 2024

After PR #16245, C2 optimizes stores into primitive arrays by combining values ​​into larger stores.

This PR rewrites the code of appendNull and append(boolean) methods so that these two methods can be optimized by C2.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8333893: Optimization for StringBuilder append boolean & null (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19626/head:pull/19626
$ git checkout pull/19626

Update a local copy of the PR:
$ git checkout pull/19626
$ git pull https://git.openjdk.org/jdk.git pull/19626/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19626

View PR using the GUI difftool:
$ git pr show -t 19626

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19626.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 10, 2024

👋 Welcome back wenshao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jun 10, 2024

@wenshao This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8333893: Optimization for StringBuilder append boolean & null

Reviewed-by: liach

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 12 new commits pushed to the master branch:

  • 07f550b: 8340818: Add a new jtreg test root to test the generated documentation
  • 27ef6c9: 8341470: BigDecimal.stripTrailingZeros() optimization
  • 5d5d88a: 8339570: Add Tidy build support for JDK tests
  • 239d84a: 8342578: GHA: RISC-V: Bootstrap using Debian snapshot is still failing
  • aa060f2: 8342334: CDS: Scratch mirrors should not point to dead klasses
  • 680dc5d: 8342496: C2/Shenandoah: SEGV in compiled code when running jcstress
  • 8f2b23b: 8341407: C2: assert(main_limit == cl->limit() || get_ctrl(main_limit) == new_limit_ctrl) failed: wrong control for added limit
  • 21682bc: 8342612: Increase memory usage of compiler/c2/TestScalarReplacementMaxLiveNodes.java
  • d61f56a: 8342287: C2 fails with "assert(is_IfTrue()) failed: invalid node class: IfFalse" due to Template Assertion Predicate with two UCTs
  • 76ae072: 8342579: RISC-V: C2: Cleanup effect of killing flag register for call instructs
  • ... and 2 more: https://git.openjdk.org/jdk/compare/8591109419efc8f71544a98bdb04a48cb1afc47e...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Jun 10, 2024

@wenshao The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot-compiler hotspot-compiler-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Jun 10, 2024
@wenshao
Copy link
Contributor Author

wenshao commented Jun 10, 2024

1. Compare with the master branch

  1. master (a6fc2f8) https://github.com/wenshao/jdk/tree/upstream_master_a6fc2f8
  2. current (5e815) https://github.com/wenshao/jdk/tree/optim_str_builder_append_202406

2. Benchmark Commands

make test TEST="micro:java.lang.StringBuilders.toStringCharWithBool8"
make test TEST="micro:java.lang.StringBuilders.toStringCharWithNull8"

3. Benchmark Numbers

The performance numbers under MacBookPro M1 Pro are as follows:

-Benchmark                                   Mode  Cnt  Score   Error  Units #master (a6fc2f8)
-StringBuilders.toStringCharWithBool8Latin1  avgt   15  7.371 ? 0.003  ns/op
-StringBuilders.toStringCharWithBool8Utf16   avgt   15  9.613 ? 0.018  ns/op
-StringBuilders.toStringCharWithNull8Latin1  avgt   15  7.071 ? 0.003  ns/op
-StringBuilders.toStringCharWithNull8Utf16   avgt   15  9.296 ? 0.016  ns/op

+Benchmark                                   Mode  Cnt  Score   Error  Units #current (5e815)
+StringBuilders.toStringCharWithBool8Latin1  avgt   15  6.515 ? 0.121  ns/op +11.61%
+StringBuilders.toStringCharWithBool8Utf16   avgt   15  8.654 ? 0.035  ns/op +9.97%
+StringBuilders.toStringCharWithNull8Latin1  avgt   15  5.550 ? 0.010  ns/op +21.51%
+StringBuilders.toStringCharWithNull8Utf16   avgt   15  8.108 ? 0.041  ns/op +12.77%

@wenshao
Copy link
Contributor Author

wenshao commented Jun 10, 2024

1. Compare with unsafe branch

  1. current (5e815) https://github.com/wenshao/jdk/tree/optim_str_builder_append_202406
  2. unsafe (adc220) https://github.com/wenshao/jdk/tree/optim_str_builder_append_202406_unsafe

I think the performance of the Unsafe branch may be the best data for the C2 optimizer. @eme64 can help me see if C2 can do it?

2. Benchmark Commands

make test TEST="micro:java.lang.StringBuilders.toStringCharWithBool8"
make test TEST="micro:java.lang.StringBuilders.toStringCharWithNull8"

3. Implementation of Unsafe Branch

class AbstractStringBuilder {
    static final Unsafe UNSAFE = Unsafe.getUnsafe();

    static final int NULL_LATIN1;
    static final int TRUE_LATIN1;
    static final int FALS_LATIN1;

    static final long NULL_UTF16;
    static final long TRUE_UTF16;
    static final long FALS_UTF16;

    static {
        byte[] bytes4 = new byte[4];
        byte[] bytes8 = new byte[8];

        bytes4[0] = 'n';
        bytes4[1] = 'u';
        bytes4[2] = 'l';
        bytes4[3] = 'l';
        NULL_LATIN1 = UNSAFE.getInt(bytes4, Unsafe.ARRAY_BYTE_BASE_OFFSET);
        StringUTF16.inflate(bytes4, 0, bytes8, 0, 4);
        NULL_UTF16 = UNSAFE.getLong(bytes8, Unsafe.ARRAY_BYTE_BASE_OFFSET);

        bytes4[0] = 't';
        bytes4[1] = 'r';
        bytes4[2] = 'u';
        bytes4[3] = 'e';
        TRUE_LATIN1 = UNSAFE.getInt(bytes4, Unsafe.ARRAY_BYTE_BASE_OFFSET);
        StringUTF16.inflate(bytes4, 0, bytes8, 0, 4);
        TRUE_UTF16 = UNSAFE.getLong(bytes8, Unsafe.ARRAY_BYTE_BASE_OFFSET);

        bytes4[0] = 'f';
        bytes4[1] = 'a';
        bytes4[2] = 'l';
        bytes4[3] = 's';
        FALS_LATIN1 = UNSAFE.getInt(bytes4, Unsafe.ARRAY_BYTE_BASE_OFFSET);
        StringUTF16.inflate(bytes4, 0, bytes8, 0, 4);
        FALS_UTF16 = UNSAFE.getLong(bytes8, Unsafe.ARRAY_BYTE_BASE_OFFSET);
    }

    private AbstractStringBuilder appendNull() {
        ensureCapacityInternal(count + 4);
        int count = this.count;
        byte[] val = this.value;
        if (isLatin1()) {
            UNSAFE.putInt(
                    val,
                    Unsafe.ARRAY_BYTE_BASE_OFFSET + count,
                    NULL_LATIN1);
        } else {
            UNSAFE.putLong(
                    val,
                    Unsafe.ARRAY_BYTE_BASE_OFFSET + (count << 1),
                    NULL_UTF16);
        }
        this.count = count + 4;
        return this;
    }

    public AbstractStringBuilder append(boolean b) {
        int count = this.count;
        int spaceNeeded = count + (b ? 4 : 5);
        ensureCapacityInternal(spaceNeeded);
        byte[] val = this.value;
        if (isLatin1()) {
            UNSAFE.putInt(
                    val,
                    Unsafe.ARRAY_BYTE_BASE_OFFSET + count,
                    b ? TRUE_LATIN1 : FALS_LATIN1);
            if (!b) {
                val[count + 4] = 'e';
            }
        } else {
            UNSAFE.putLong(
                    val,
                    Unsafe.ARRAY_BYTE_BASE_OFFSET + (count << 1),
                    b ? TRUE_UTF16 : FALS_UTF16);
            if (!b) {
                StringUTF16.putChar(val, count + 4, 'e');
            }
        }
        this.count = spaceNeeded;
        return this;
    }
}

4. Benchmark Numbers

The performance numbers under MacBookPro M1 Pro are as follows:

-Benchmark                                   Mode  Cnt  Score   Error  Units # unsafe (adc220)
-StringBuilders.toStringCharWithBool8Latin1  avgt   15  6.415 ? 0.061  ns/op
-StringBuilders.toStringCharWithBool8Utf16   avgt   15  7.307 ? 0.013  ns/op
-StringBuilders.toStringCharWithNull8Latin1  avgt   15  5.443 ? 0.011  ns/op
-StringBuilders.toStringCharWithNull8Utf16   avgt   15  6.944 ? 0.102  ns/op

+Benchmark                                   Mode  Cnt  Score   Error  Units #current (5e815)
+StringBuilders.toStringCharWithBool8Latin1  avgt   15  6.515 ? 0.121  ns/op -1.55%
+StringBuilders.toStringCharWithBool8Utf16   avgt   15  8.654 ? 0.035  ns/op -18.44%
+StringBuilders.toStringCharWithNull8Latin1  avgt   15  5.550 ? 0.010  ns/op -1.96%
+StringBuilders.toStringCharWithNull8Utf16   avgt   15  8.108 ? 0.041  ns/op -16.76%

@wenshao wenshao changed the title Optimization for StringBuilder append boolean & appendNull Optimization for StringBuilder append boolean & null Jun 10, 2024
@wenshao wenshao changed the title Optimization for StringBuilder append boolean & null 8333893: Optimization for StringBuilder append boolean & null Jun 10, 2024
@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 10, 2024
@mlbridge
Copy link

mlbridge bot commented Jun 10, 2024

@liach
Copy link
Member

liach commented Jun 10, 2024

I think for that C2 JIT to work, we need to merge the 't' 'r' 'u' 'e' ascii bytes into an int constant. Same for false.

@eme64
Copy link
Contributor

eme64 commented Jun 10, 2024

@wenshao

I think the performance of the Unsafe branch may be the best data for the C2 optimizer. @eme64 can help me see if C2 can do it?

Have you tried to see if the optimization actually was done/taken? You can use the TraceMergeStores, flag. Can you present the generated assembly code of the benchmarks, and explain the difference based on the generated assembly code? You can run JMH penchmarks with perf. These two blogs may help you:

http://psy-lob-saw.blogspot.com/2015/07/jmh-perfasm.html
https://shipilev.net/blog/2016/arrays-wisdom-ancients/#_meet_jmh_prof_perfasm

@liach I don't think it makes a difference if it is int or byte constants. Or what exactly is the code change you are proposing?

@liach
Copy link
Member

liach commented Jun 10, 2024

You are right, I was thinking about the case where you have 2 short variables, you should combine them into a long explicitly for C2 to generate a 4-byte write, otherwise it would be 2 2-bytes. Omitted the constant part which already eliminates this restriction.

@wenshao
Copy link
Contributor Author

wenshao commented Jun 10, 2024

@eme64 It seems that when the following code uses StringUTF16.putChar, C2's optimization is not as good as the manual merging and storage effect.

class AbstractStringBuilder {
    private AbstractStringBuilder appendNull() {
    	// ...
		StringUTF16.putCharsAt(val, count, 'n', 'u', 'l', 'l');
		// ...
	}

	public AbstractStringBuilder append(boolean b) {
	    // ...
		StringUTF16.putCharsAt(val, count, 't', 'r', 'u', 'e');
		// ...
		StringUTF16.putCharsAt(val, count, 'f', 'a', 'l', 's', 'e');
		// ...
	}
}

class StringUTF16 {
   public static void putCharsAt(byte[] value, int i, char c1, char c2, char c3, char c4) {
        putChar(value, i    , c1);
        putChar(value, i + 1, c2);
        putChar(value, i + 2, c3);
        putChar(value, i + 3, c4);
    }

    @IntrinsicCandidate
    // intrinsic performs no bounds checks
    static void putChar(byte[] val, int index, int c) {
        assert index >= 0 && index < length(val) : "Trusted caller missed bounds check";
        index <<= 1;
        val[index++] = (byte)(c >> HI_BYTE_SHIFT);
        val[index]   = (byte)(c >> LO_BYTE_SHIFT);
    }
}

The code for manually merging storage is as follows, without using Unsafe:

class AbstractStringBuilder {
    static final long NULL_UTF16;
    static final long TRUE_UTF16;
    static final long FALS_UTF16;

    static {
        byte[] bytes = new byte[8];

        StringUTF16.putCharsAt(bytes, 0, 'n', 'u', 'l', 'l');
        NULL_UTF16 = getLong(bytes, 0);

        StringUTF16.putCharsAt(bytes, 0, 't', 'r', 'u', 'e');
        TRUE_UTF16 = getLong(bytes, 0);

        StringUTF16.putCharsAt(bytes, 0, 'f', 'a', 'l', 's');
        FALS_UTF16 = getLong(bytes, 0);
    }

    private static long getLong(byte[] bytes, int offset) {
        return (((long)bytes[offset    ] & 0xff)      ) |
               (((long)bytes[offset + 1] & 0xff) <<  8) |
               (((long)bytes[offset + 2] & 0xff) << 16) |
               (((long)bytes[offset + 3] & 0xff) << 24) |
               (((long)bytes[offset + 4] & 0xff) << 32) |
               (((long)bytes[offset + 5] & 0xff) << 40) |
               (((long)bytes[offset + 6] & 0xff) << 48) |
               (((long)bytes[offset + 7] & 0xff) << 56);
    }

    private static void setLong(byte[] array, int offset, long value) {
        array[offset]     = (byte)  value;
        array[offset + 1] = (byte) (value >> 8);
        array[offset + 2] = (byte) (value >> 16);
        array[offset + 3] = (byte) (value >> 24);
        array[offset + 4] = (byte) (value >> 32);
        array[offset + 5] = (byte) (value >> 40);
        array[offset + 6] = (byte) (value >> 48);
        array[offset + 7] = (byte) (value >> 56);
    }

    private AbstractStringBuilder appendNull() {
        int count = this.count;
        ensureCapacityInternal(count + 4);
        byte[] val = this.value;
        if (isLatin1()) {
            val[count    ] = 'n';
            val[count + 1] = 'u';
            val[count + 2] = 'l';
            val[count + 3] = 'l';
        } else {
            setLong(val, count, NULL_UTF16);
        }
        this.count = count + 4;
        return this;
    }

    public AbstractStringBuilder append(boolean b) {
        int count = this.count;
        int spaceNeeded = count + (b ? 4 : 5);
        ensureCapacityInternal(spaceNeeded);
        byte[] val = this.value;
        if (isLatin1()) {
            if (b) {
                val[count    ] = 't';
                val[count + 1] = 'r';
                val[count + 2] = 'u';
                val[count + 3] = 'e';
            } else {
                val[count    ] = 'f';
                val[count + 1] = 'a';
                val[count + 2] = 'l';
                val[count + 3] = 's';
                val[count + 4] = 'e';
            }
        } else {
            setLong(val, count, b ? TRUE_UTF16 : FALS_UTF16);
            if (!b) {
                StringUTF16.putChar(val, count + 4, 'e');
            }
        }
        this.count = spaceNeeded;
        return this;
    }
}

The getLong/setLong methods here can be optimized and merged by C2. Maybe we need a public class that does not use Unsafe to implement these methods. They are needed in many places. Maybe it is appropriate to improve them based on ByteArray/ByteArrayLittleEndian.

@dean-long
Copy link
Member

From a compiler point of view, it might be better to enhance the 8318446 optimization so that it recognizes code patterns that use increment.
For a libraries point of view, this proposed change probably needs some comments, so it doesn't get accidentally changed in the future in a way that breaks the optimization. How would we prevent that? The situation seems fragile to me.

@cl4es
Copy link
Member

cl4es commented Jun 10, 2024

Whether increments are handled or not, I think the StringUTF16.putChar intrinsic might be getting in the way here and prevent merging consecutive "char" stores into 4 or 8 byte stores. It would be interesting to get numbers for the putChar version with the intrinsic disabled (-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_putCharStringU).

@eme64
Copy link
Contributor

eme64 commented Jun 11, 2024

@wenshao

@eme64 It seems that when the following code uses StringUTF16.putChar, C2's optimization is not as good as the manual merging and storage effect.

As I asked above, you will need to provide some evidence / generated assembly / perf data, and logs from TraceMergeStores. I currently do not have time to produce these myself, and I think they would be crucial to determine where the missing performance has gone. See my earlier comment:
#19626 (comment)

And please also try @cl4es advide here:
#19626 (comment)

And sure, maybe you need some public API for setting multiple bytes at once, which the MergeStores optimization can optimize. I'm a C2 engineer, so I leave that up to the library folks ;)

@wenshao
Copy link
Contributor Author

wenshao commented Jun 11, 2024

Edited to remove the extra long PrintAssembly information to make the page easier to read

@eme64
Copy link
Contributor

eme64 commented Jun 11, 2024

@wenshao This is just an assembly dump. You need to have some profiling data that tells you where the time is spent. I'm not going to do the analysis work for you, I'm sorry. I gave you some pointers as how to do that. If you have more questions about how to do that, feel free to ask. You also have not provided the TraceMergeStores log yet, as I asked you.

Can you investigate WHY there is a performance difference? Which loads and branches etc are generated?

@wenshao
Copy link
Contributor Author

wenshao commented Jun 11, 2024

@eme64 How to TraceMergeStores?

@eme64
Copy link
Contributor

eme64 commented Jun 11, 2024

@wenshao Have you grepped it in the code base?
grep TraceMergeStores src/hotspot/share/ -r
Hence, use enable flag with -XX:+TraceMergeStores

@eme64
Copy link
Contributor

eme64 commented Jun 11, 2024

Also: the assembly that you pasted: does it have the original stores, or do you see that the stores were actually merged?

@wenshao
Copy link
Contributor Author

wenshao commented Jun 11, 2024

(f96cde4e) (0cbaa5ac) The previous overriding of StringUTF16.putChar method did not improve performance, so I reverted to the original version.

I have collected information on TraceMergeStores, but I don't have enough knowledge to analyze it yet

https://github.com/wenshao/jdk/wiki/pr19626_appendNull_0

@eme64
Copy link
Contributor

eme64 commented Jun 11, 2024

@wenshao have you published info about TraceMergeStores somewhere? It is very well possible that the optimization does not apply in your code. Then you would need to dig into the VM code and see why.

@wenshao
Copy link
Contributor Author

wenshao commented Jun 11, 2024

@eme64 TraceMergeStores are here, but I can't understand these assembly codes

JavaCode

class AbstractStringBuilder {
     private AbstractStringBuilder appendNull() {
        int count = this.count;
        ensureCapacityInternal(count + 4);
        byte[] val = this.value;
        if (isLatin1()) {
            val[count    ] = 'n';
            val[count + 1] = 'u';
            val[count + 2] = 'l';
            val[count + 3] = 'l';
        } else {
            StringUTF16.putCharsAt(val, count, 'n', 'u', 'l', 'l');
        }
        this.count = count + 4;
        return this;
    }
}

class StringUTF16 {
    public static void putCharsAt(byte[] value, int i, char c1, char c2, char c3, char c4) {
        putChar(value, i    , c1);
        putChar(value, i + 1, c2);
        putChar(value, i + 2, c3);
        putChar(value, i + 3, c4);
    }
}

TraceMergeStores

[TraceMergeStores]: Replace
  78  StoreB  === 61 7 76 12  [[ 103 89 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=5;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5; !jvms: StringUTF16::putChar @ bci:43 (line 71)
 103  StoreB  === 88 78 101 80  [[ 18 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=5;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5; !jvms: StringUTF16::putChar @ bci:52 (line 72)
[TraceMergeStores]: with
  12  Parm  === 3  [[ 78 66 38 80 107 113 ]] Parm2: int !jvms: StringUTF16::putChar @ bci:-1 (line 69)
 113  StoreC  === 88 7 76 12  [[ ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5;

@cl4es
Copy link
Member

cl4es commented Oct 19, 2024

After PR 19970, the performance has been significantly improved. Below are the performance numbers for AMD CPU (x64)

It'd be interesting to check performance on this micro with #19970 alone

@wenshao
Copy link
Contributor Author

wenshao commented Oct 20, 2024

I ran performance tests on MaxBook M1 (aarch64) and aliyun c8a (AMD CPU x64). There was no significant performance difference between pr #19970 and master, but pr #19626 combined with #19970 significantly improved performance.

Script

git remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao

# master
git checkout 85582d7a88bd5f79f5991ce22bc3bc75e514aec9
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"

# pr 19970
git checkout 3b89956957085e134a05c05876f40b85d949227e
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"

# pr 19626 + 19970
git checkout 58dae7888eceb1c61243f658b67c208e6c30f7f2
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"

MacBook M1 Max Performance Numbers

# master
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.950 ? 0.027  ns/op

# pr 19970
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.945 ? 0.008  ns/op

# pr 19626 + 19970
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.441 ? 0.059  ns/op

AMD x64 Performance Numbers

# master
Benchmark                             Mode  Cnt   Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  17.522 ± 8.113  ns/op

# pr 19970
Benchmark                             Mode  Cnt   Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  17.487 ± 8.127  ns/op

# pr 19626 + 19970
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  5.983 ± 0.113  ns/op

@cl4es
Copy link
Member

cl4es commented Oct 20, 2024

Hmm, I would have hoped for appendNull the pre-existing code would have allowed for merging stores with #19970. Can you run with +TraceMergeStores on the #19970 branch?

Perhaps we'd need to minimally change from count++ increments to constant offsets:

            val[count] = 'n';
            val[count + 1] = 'u';
            val[count + 2] = 'l';
            val[count + 3] = 'l';

(I think it would be good to explore if we can to trigger the optimization without resorting to Unsafe. Any ideas, @eme64?)

@wenshao
Copy link
Contributor Author

wenshao commented Oct 20, 2024

Simplifying the implementation of appendNull can improve performance, but it is still not as good as using Unsafe.putByte.

git remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao

# master
git checkout 85582d7a88bd5f79f5991ce22bc3bc75e514aec9
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"

# pr 19970
git checkout 3b89956957085e134a05c05876f40b85d949227e
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"


# pr 19970 + simplify append null
git checkout a43be33a6cc67ac72058d1819ee3008fb6f76211
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"

# pr 19626 + 19970
git checkout 58dae7888eceb1c61243f658b67c208e6c30f7f2
make test TEST="micro:java.lang.StringBuilders.appendWithNull8Latin1"

MacBook M1 Max Performance Numbers

# master
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.950 ? 0.027  ns/op

# pr 19970
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.945 ? 0.008  ns/op

# pr 19970 + simplify append null
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.766 ? 0.012  ns/op

# pr 19626 + 19970
Benchmark                             Mode  Cnt  Score   Error  Units
StringBuilders.appendWithNull8Latin1  avgt   15  6.441 ? 0.059  ns/op

@wenshao
Copy link
Contributor Author

wenshao commented Oct 21, 2024

There is no array out-of-bounds check when using Unsafe. In the appendNull method, there is already a call to ensureCapacityInternal. It is also safe to use Unsafe in this scenario.

@eme64
Copy link
Contributor

eme64 commented Oct 21, 2024

@wenshao @cl4es I've investigated a little with the benchmark.

RangeCheck smearing could be the issue here.

image

It turns out that RangeChecks are smeared (a kind of elimination in straight-line code) in the same phase as the MergeStores happens (post_loop_opts_phase). So it seems to depend on the order of processing in the worklist, if we first remove the RC or try to merge the stores. If a RC is somewhere still stuck between stores, then merging does not quite work as hoped.

Of course if you do it all with Unsafe, then those RC are not there any more.

Maybe I can somehow delay the MergeStores a little more. But I have nice solution in mind - only hacks so far 😅 . And maybe the analysis is not complete and there are other factors.

I have lots of other tasks now, so I might have to put this a little off again.

@wenshao what would be nice is if you could add some IR tests for these kinds of examples. Add IR rules that currently pass, and add some comments what we would prefer for better performance. That would help me immensely once I can get back to it. You could even just integrate the IR tests now already, so that I can then adjust them when I improve the optimization further. That way you would get credit for adding the tests. You could add them to test/hotspot/jtreg/compiler/c2/TestMergeStores.java in a similar style.

I'm putting my Test2.java here for reference.

public class Test2 {
    public static void main(String[] args) {
         int len = 0;
         for (int i = 0; i < 100_000; i++) {
             len = appendWithNull8Latin1();
         }
         System.out.println(len);
    }

    public static int appendWithNull8Latin1() {
        StringBuilder buf = new StringBuilder("Latin1 string");
        buf.setLength(0);
        buf.append((String) null);
        buf.append((String) null);
        buf.append((String) null);
        buf.append((String) null);
        buf.append((String) null);
        buf.append((String) null);
        buf.append((String) null);
        buf.append((String) null);
        return buf.length();
    }
}

/oracle-work/jdk-fork5/build/linux-x64-debug/jdk/bin/java -ea -XX:+UnlockDiagnosticVMOptions -XX:+MergeStores -XX:CompileCommand=compileonly,Test2::appendWithNull8Latin1 -XX:CompileCommand=TraceMergeStores,Test2::*,BASIC,SUCCESS -Xbatch Test2.java

At times we first merge 2 StoreB -> 1 StoreC. And then in a second step 2 StoreC -> StoreI. See below the snippets.

[TraceMergeStores] MergePrimitiveStores::run:   612  StoreB  === 902 735 935 390  [[ 283 392 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores] expect no use: None
[TraceMergeStores] expect def: Found[735 StoreB, RC]
[TraceMergeStores] find def: Found[735 StoreB, RC]
[TraceMergeStores] found RangeCheck, stop traversal.
[TraceMergeStores] found:
    0-->   612  StoreB  === 902 735 935 390  [[ 283 392 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
    1-->   735  StoreB  === 1038 156 1040 381  [[ 612 388 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:34 (line 643) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores] truncated:
    0-->   612  StoreB  === 902 735 935 390  [[ 283 392 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
    1-->   735  StoreB  === 1038 156 1040 381  [[ 612 388 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:34 (line 643) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores]: Replace
  735  StoreB  === 1038 156 1040 381  [[ 612 388 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:34 (line 643) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
  612  StoreB  === 902 735 935 390  [[ 283 392 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores]: with
 2204  ConI  === 0  [[ 2205 ]]  #int:30062
 2205  StoreC  === 902 156 1040 2204  [[ ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11;
[TraceMergeStores] MergePrimitiveStores::run:  2268  StoreC  === 902 2205 613 2267  [[ 77 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[159] !jvms: AbstractStringBuilder::appendNull @ bci:58 (line 646) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores] expect no use: None
[TraceMergeStores] expect def: Found[2205 StoreC, no-RC]
[TraceMergeStores] find def: Found[2205 StoreC, no-RC]
[TraceMergeStores] find def: None
[TraceMergeStores] found:
    0-->  2268  StoreC  === 902 2205 613 2267  [[ 77 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[159] !jvms: AbstractStringBuilder::appendNull @ bci:58 (line 646) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
    1-->  2205  StoreC  === 902 156 1040 2204  [[ 2268 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[612] !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores] truncated:
    0-->  2268  StoreC  === 902 2205 613 2267  [[ 77 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[159] !jvms: AbstractStringBuilder::appendNull @ bci:58 (line 646) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
    1-->  2205  StoreC  === 902 156 1040 2204  [[ 2268 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[612] !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores]: Replace
 2205  StoreC  === 902 156 1040 2204  [[ 2268 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[612] !jvms: AbstractStringBuilder::appendNull @ bci:42 (line 644) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
 2268  StoreC  === 902 2205 613 2267  [[ 77 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !orig=[159] !jvms: AbstractStringBuilder::appendNull @ bci:58 (line 646) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:83 (line 21)
[TraceMergeStores]: with
 2234  ConI  === 0  [[ 2235 2255 2274 2276 ]]  #int:1819047278
 2276  StoreI  === 902 156 1040 2234  [[ ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11;
[TraceMergeStores] MergePrimitiveStores::run:  1314  StoreB  === 1213 884 1585 285  [[ 977 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=11;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=11; !jvms: AbstractStringBuilder::appendNull @ bci:58 (line 646) AbstractStringBuilder::append @ bci:5 (line 588) StringBuilder::append @ bci:2 (line 179) Test2::appendWithNull8Latin1 @ bci:29 (line 15)

@eme64
Copy link
Contributor

eme64 commented Oct 21, 2024

Not sure why the RangeCheck smearing issue only appreas here now. Maybe because the graph is bigger? Maybe because of other things in the graph that change the order of things? IDK yet.

@cl4es
Copy link
Member

cl4es commented Oct 21, 2024

Thanks for checking @eme64 - up to you if and when you want to try and tackle such an improvement, if it can be done in a clean way. And as you say there might be other factors at play here. Perhaps things are confounded by how the code is structured before and after; this PR outlines the array stores to a separate method which might affect loop optimization passes if that method is first compiled separately then inlined.

While I don't think the point fix in this PR is all that urgent this has been lingering for a while and I'm sure @wenshao wants to integrate and move on. We can always integrate then back out the Unsafe stuff once/if the phases have been disentangled. Or we keep experimenting to try and see if we can get it to behave ideally without Unsafe here and now. WDYT? @liach?

@eme64
Copy link
Contributor

eme64 commented Oct 21, 2024

I'm not the library guy, so I'm not going to block this PR.
But having good tests and benchmarks are crucial, and very time-consuming to come up with for us compiler engineers. So these examples here are very valuable, and I think the time is not wasted working more on it.

Copy link
Member

@liach liach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the evaluations. I think we can use this unsafe version for now, and use direct writes once JIT can reliably eliminate bound checks.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 21, 2024
@wenshao
Copy link
Contributor Author

wenshao commented Oct 21, 2024

This PR requires PR #19970 to support Unsafe.putByte MergeStore for better performance. I will add more test scenarios to TestMergeStores.java or MergeStoreBench later.

@eme64
Copy link
Contributor

eme64 commented Nov 5, 2024

FYI #19970 is now integrated - thanks for the patience :)

@wenshao
Copy link
Contributor Author

wenshao commented Nov 5, 2024

It has been tested that mergeStore can work after the master branch is merged

@wenshao
Copy link
Contributor Author

wenshao commented Nov 5, 2024

/integrate

@openjdk
Copy link

openjdk bot commented Nov 5, 2024

Going to push as commit 5890d94.
Since your change was applied there have been 227 commits pushed to the master branch:

  • c33a8f5: 8343314: Move common properties from jpackage jtreg test declarations to TEST.properties file
  • 16feeb7: 8343547: Restore accidentally removed annotations in LambdaForm from ClassFile API port
  • f62fc48: 8342498: Add test for Allocation elimination after use as alignment reference by SuperWord
  • f3671be: 8335392: C2 MergeStores: enhanced pointer parsing
  • 4fc6d41: 8341194: [REDO] Implement C2 VectorizedHashCode on AArch64
  • abf2dc7: 8343298: Improve stability of runtime/cds/DeterministicDump.java test
  • dafa2e5: 8343124: Tests fails with java.lang.IllegalAccessException: class com.sun.javatest.regtest.agent.MainWrapper$MainTask cannot access
  • 0f7dd98: 8251926: PPC: Remove an unused variable in assembler_ppc.cpp
  • cd91a44: 8343549: SeededSecureRandomTest needn't be in a package
  • 20f3aaf: 8343471: RISC-V: compiler/cpuflags/TestAESIntrinsicsOnUnsupportedConfig.java fails after JDK-8334999
  • ... and 217 more: https://git.openjdk.org/jdk/compare/8591109419efc8f71544a98bdb04a48cb1afc47e...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Nov 5, 2024
@openjdk openjdk bot closed this Nov 5, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 5, 2024
@openjdk
Copy link

openjdk bot commented Nov 5, 2024

@wenshao Pushed as commit 5890d94.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@altrisi
Copy link
Contributor

altrisi commented Feb 1, 2025

I have concerns with the changes here, because of what it can cause when an (incorrect) program is sharing a StringBuilder between multiple threads.

Even without any reordering, a thread could start an append(boolean|null), pass the ensureCapacityInternal call, then another thread up the count by enough to no longer have enough capacity. Then the first thread could proceed to read the new (no longer enough!) count, and write without any bounds checks outside of the array. Easy to reproduce with a debugger and some breakpoints.

The previous code would have thrown an exception at an explicit or implicit bounds check, but with the changes here that'd no longer happen.

For example
(initial state)
value.length = 16
count = 11
Thread 1 Thread 2
append(true)
-ensureCap(11+4)
--nothing to do
append("str")
- ensureCap(11+3)
-- nothing to do
- ...
- this.count <- 14
- cnt <- this.count (14!)
- val <- this.value
- val[cnt] <-u 't'
- val[cnt+1] <-u 'r'
- val[cnt+2 (16!)] <-u 'u'
- ...

@wenshao
Copy link
Contributor Author

wenshao commented Feb 1, 2025

AbstractStringBuilder's value/coder/count are not volatile. If you use StringBuilder, you may not get the correct result. If you want thread safety, you should use StringBuffer.

@altrisi
Copy link
Contributor

altrisi commented Feb 1, 2025

Sure, I already mentioned such program would be incorrect, but one thing is an incorrect result in the StringBuilder and another one is writing outside the bounds of the array into other parts (possibly objects) of the heap, or even outside, causing a VM crash.

@jaikiran
Copy link
Member

jaikiran commented Feb 3, 2025

The previous code would have thrown an exception at an explicit or implicit bounds check, but with the changes here that'd no longer happen.

This indeed a genuine issue and needs to be addressed. Thank you @altrisi for noticing and raising this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

7 participants