-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8252847: Optimize primitive arrayCopy stubs using AVX-512 masked instructions #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into |
|
@jatin-bhateja The following label will be automatically applied to this pull request: When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the |
Webrevs
|
|
/label add hotspot-compiler-dev |
|
@jatin-bhateja |
| BasicType type, int offset, bool use64byteVector) { | ||
| assert(MaxVectorSize >= 32, "vector length < 32"); | ||
| use64byteVector |= MaxVectorSize > 32 && AVX3Threshold == 0; | ||
| if (use64byteVector == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to "!use64byteVector"
| void MacroAssembler::copy64_avx(Register dst, Register src, XMMRegister xmm, int offset, bool use64byteVector) { | ||
| assert(MaxVectorSize == 64 || MaxVectorSize == 32, "vector length mismatch"); | ||
| use64byteVector |= MaxVectorSize > 32 && AVX3Threshold == 0; | ||
| if (use64byteVector == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to "!use64byteVector"
| void MacroAssembler::copy64_conjoint_avx(Register dst, Register src, XMMRegister xmm, int offset, bool use64byteVector) { | ||
| assert(MaxVectorSize == 64 || MaxVectorSize == 32, "vector length mismatch"); | ||
| use64byteVector |= MaxVectorSize > 32 && AVX3Threshold == 0; | ||
| if (use64byteVector == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to "!use64byteVector"
|
|
||
| if (!FLAG_IS_DEFAULT(AVX3Threshold)) { | ||
| if (!is_power_of_2(AVX3Threshold)) { | ||
| if (AVX3Threshold !=0 && !is_power_of_2(AVX3Threshold)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space before '0'
| void MacroAssembler::copy64_masked_avx(Register dst, Register src, XMMRegister xmm, | ||
| KRegister mask, Register length, Register temp, | ||
| BasicType type, int offset, bool use64byteVector) { | ||
| assert(MaxVectorSize >= 32, "vector length < 32"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does "MaxVectorSize >= 32" imply that "vector length < 32"?
This assert appears in multiple locations.
| KRegister mask, Register length, Register temp, | ||
| BasicType type, int offset, bool use64byteVector) { | ||
| assert(MaxVectorSize >= 32, "vector length < 32"); | ||
| use64byteVector |= MaxVectorSize > 32 && AVX3Threshold == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do you expect AVX3Threshold to be 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now when user explicitly pass -XX:AVX3Threshold=0 , default value of AVX3Threshold is 4096.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that you put special meaning on AVX3Threshold=0 and then have to add additional checks for it in places where you check its power of 2. And you don't check such setting in new tests.
Actually checking for 0 and power of 2 should be done by flag's constraint. See CodeEntryAlignmentConstraintFunc as example.
There is also this strange relation with MaxVectorSize. Also we should consider power level switch for 64 bytes AVX3 vectors. Does it make sense to use it if array length is small (< 4096 default)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that you have extracted the avx512 stub code from the rest - that makes it a lot more readable! Overall the new code feels easy to understand and read.
I found one more minor issue (appears in four places).
My only concern is that it's getting hard to follow under what circumstances avx3 instructions are used:
Could it be the case that different thresholds are needed for when you are using avx3 instructions with 32 or 64 byte vectors? Are we sure all variants are tested?
Also - have you thought about supporting oop-copies? You only have to call the BarrierSetAssembler::arraycopy_prologue/epilogue like in the old versions. It's not a requirement for me to approve this - but an encouragement for a future patch.
| address generate_conjoint_int_oop_copy(bool aligned, bool is_oop, address nooverlap_target, | ||
| address *entry, const char *name, | ||
| bool dest_uninitialized = false) { | ||
| if (VM_Version::supports_avx512vlbw() && false == is_oop && MaxVectorSize >= 32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"false == is_oop" => !is_oop
| // | ||
| address generate_disjoint_long_oop_copy(bool aligned, bool is_oop, address *entry, | ||
| const char *name, bool dest_uninitialized = false) { | ||
| if (VM_Version::supports_avx512vlbw() && false == is_oop && MaxVectorSize >= 32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false == is_oop => !is_oop
| address generate_conjoint_long_oop_copy(bool aligned, bool is_oop, | ||
| address nooverlap_target, address *entry, | ||
| const char *name, bool dest_uninitialized = false) { | ||
| if (VM_Version::supports_avx512vlbw() && false == is_oop && MaxVectorSize >= 32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false == is_oop => !is_oop
Following 2 runtime flags influence the implementation :-
Following general rules were followed during implementation:
Thus, for 32 byte vector we do not need any threshold since they execute at max frequency level. tier1, tier2 and tier3 did not show any new issues with the changes.
We may not see significant performance improvement considering prologue and epilogue barriers does considerable processing over object arrays. |
Yes that's a good pointer, I can explore extending existing GC barriers for array copy as a separate next step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification och update!
Reviewed.
|
@jatin-bhateja This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 415 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
@jatin-bhateja Can you put summary of performance improvement into JBS? |
…opy for reference types.
Yes, I have added the summary to JBS |
Hi @vnkozlov , @neliasso |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main concern which is not clear in these changes is ZMM usage which will lower frequency and case performance regression for small arrays.
That is why AVX3Threshold is set to 4096 bytes by default.
Allowing and checking for 0 AVX3Threshold value contradicts that. Would be nice to have clear comment/explanation about that.
I also propose to use Flag constraint() functionality for checking AVX3Threshold value instead of runtime checks everywhere. Separate RFE, please.
| } | ||
|
|
||
| void Assembler::evmovdqu(XMMRegister dst, KRegister mask, Address src, int vector_len, int type) { | ||
| assert(VM_Version::supports_avx512vlbw(), ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to add assert to these 2 new instruction to check 'type' value to make sure only expected types are passed.
| assert(VM_Version::supports_avx512vlbw(), ""); | ||
| InstructionMark im(this); | ||
| bool wide = type == T_SHORT || type == T_LONG || type == T_CHAR; | ||
| bool bwinstr = type == T_BYTE || type == T_SHORT || type == T_CHAR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'bwinstr' is used only once. You may as well directly set 'prefix' here. (Same in second instruction).
| void Assembler::evmovdqu(XMMRegister dst, KRegister mask, Address src, int vector_len, int type) { | ||
| assert(VM_Version::supports_avx512vlbw(), ""); | ||
| InstructionMark im(this); | ||
| bool wide = type == T_SHORT || type == T_LONG || type == T_CHAR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks strange but it is correct (I looked on existing evmovdqu* instructions). May be reorder - T_LONG last.
Do you consider replacing existing evmovdqu* instructions with these two?
| KRegister mask, Register length, Register temp, | ||
| BasicType type, int offset, bool use64byteVector) { | ||
| assert(MaxVectorSize >= 32, "vector length < 32"); | ||
| use64byteVector |= MaxVectorSize > 32 && AVX3Threshold == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that you put special meaning on AVX3Threshold=0 and then have to add additional checks for it in places where you check its power of 2. And you don't check such setting in new tests.
Actually checking for 0 and power of 2 should be done by flag's constraint. See CodeEntryAlignmentConstraintFunc as example.
There is also this strange relation with MaxVectorSize. Also we should consider power level switch for 64 bytes AVX3 vectors. Does it make sense to use it if array length is small (< 4096 default)?
| __ jcc(Assembler::greater, L_copy_8_bytes); // Copy trailing qwords | ||
| } | ||
|
|
||
| #ifndef PRODUCT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macroAssembler_x86.hpp become big. May be we should start thing about splitting arraycopy stubs into separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But lets do that in a another change. It is good that the AVX3 case is separated out in this change - makes it easy to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
| enum platform_dependent_constants { | ||
| code_size1 = 20000 LP64_ONLY(+10000), // simply increase if too small (assembler will crash if too small) | ||
| code_size2 = 35300 LP64_ONLY(+11400) // simply increase if too small (assembler will crash if too small) | ||
| code_size2 = 35300 LP64_ONLY(+21400) // simply increase if too small (assembler will crash if too small) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is big increase in size!
|
|
||
| if (!FLAG_IS_DEFAULT(AVX3Threshold)) { | ||
| if (!is_power_of_2(AVX3Threshold)) { | ||
| if (AVX3Threshold != 0 && !is_power_of_2(AVX3Threshold)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider flag's constraint() instead of runtime these checks. Separate RFE, please.
| super(TYPE); | ||
|
|
||
| assert CodeUtil.isPowerOf2(useAVX3Threshold) : "AVX3Threshold must be power of 2"; | ||
| assert useAVX3Threshold == 0 || CodeUtil.isPowerOf2(useAVX3Threshold) : "AVX3Threshold must be power of 2"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would need to upstream Graal changes.
| * @summary Optimize arrayCopy using AVX-512 masked instructions. | ||
| * | ||
| * @run main/othervm/timeout=600 -XX:-TieredCompilation -Xbatch -XX:+IgnoreUnrecognizedVMOptions | ||
| * -XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:ArrayCopyPartialInlineSize=0 -XX:MaxVectorSize=32 -XX:+UnlockDiagnosticVMOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayCopyPartialInlineSize flag is not defiled in these changes.
It seems they need to be included in 8252848 changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vnkozlov, I have updated the pull request to cover your comments. Kindly review.
New RFE JDK-8253721 has been created for AVX3Threshold flag related changes (PR-394).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this looks better. Reviewed. Before pushing let me test it. I will let you know results.
| __ jcc(Assembler::greater, L_copy_8_bytes); // Copy trailing qwords | ||
| } | ||
|
|
||
| #ifndef PRODUCT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
|
hs-tier1-3 testing passed on x86 (all OSs). |
|
/integrate |
|
@jatin-bhateja Since your change was applied there have been 420 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit 4b5ac3a. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
transformation. This patch fixes java/util/regex/RegExTest.java in jdk:tier1. We need to delete phi from aliases if we detect that its inputs don't ponit to a unique object. We should delete object pea->is_alias(phi) instead of unqiue. unique may be the new object just merged in. Co-authored-by: Xin Liu <xxinliu@amazon.com>
* Update LICENSE file
* The experimental version is aimed to refactor the tool
* verificationType mistakenly added to source code is renamed to type
* Adapted mvn bindings to at8
* Added tests to represent bad stream handling in jdec and correctone in jdis
* Made jdec to write output to proper toolOutput instead of bad logOutput
* 7903208: [Jasm] Add support for generics (the Signature attribute)
* 7902888: Excess entries in BootstrapMethods with the same bsm, bsmKind, bsmArgs
* Replaced hardcoded String[] args, by varargs String... where reasonable
Once the "tool" mandatory array memebr was removed from each tool's main
method, and considering calls from libraries, and form tests where the
argument is very often just one file, or more readable "a1", a2"...
without new String[]{} declaration, changed those String[] enforcing
headers to more benevolent String...
* Intentionally removed ACC_SUPER class modifier was causing hotswap to fail
When jasm's disasm, modify, asm cycle output binary was used for class
hotswap, remote JDK was not accepting it with java.lang.UnsupportedOperationException:
class redefinition failed: attempted to change the class modifiers
beacuse of ommited supoer keyword (although it have already no real
reason)
This patch is returning the kwyword without conditions, when it was
included in original bytecode
* extended test to verify that super is not used always
if super is not in source code, then it is not in disambled code
* 7903248: jasm: FieldData.ConstantValue holds undefined reference to CP while writing fields to a class
* Added support for stdin in jdis
* stdin is now read also by jasm,jcod,jdec
more tests needed
* Removed jdis specific missleading provide method
* Tool output moved where it belongs to commons
* Removed duplicated declaration of ArrayList<ToolInput> fileList
* Changed requireNonNull message in getDataInputStream
* Added one more abstraction layer providing highly reusable byte[] based input
* Properly exiting after verson is printed
* Stream based inputs are now drained once needed, not during construction
* reworked stdin read to be initiated by -
As a side effect, files and stdin can be read together
remove dduplicated code in tests by getting class in compile time
added tests and adapted older to new behavior
* added tests if asmtools assmble itself into valid bytecode
Added two set of tests
jdec->jcoder->load
jdis->jasm->load
Both in two variants, with -g, and without
Both jdec->jcoder->load works fine
Both jdis->jasm->load now fails on three files from 278:
/org/openjdk/asmtools/jasm/JasmEnvironment$InputFile.class, /org/openjdk/asmtools/jcoder/Jcoder.class, /org/openjdk/asmtools/jasm/Parser.class
Fail looks valid.
Unluckily, the issue where -g disassembled and back assembled
com.google.gson.Gson produce invalid bytecode was not hit
The class BruteForceHelper is reusable for any set of classes
* Naive fix for enforced dot.suffix
* repalce stdout by configurable object
This checkout is not buildable, but server as showcase for issue with
dual logging for compilers
* Replaced dualstream logger by wrapper
so it can be later made fully customisable
* Enabled shared i18n properties via reused asmtools/i18n.props
* All four tools now use neww ToolOutput output
As a consequence, all four tools by by default prints to stdout, and
honours -d properly
* Fixed typo which casued behavior regression
found by unittests
* Refatoed tests so they can harbour resources in maven way
* Moved all logging to stderr. use -dls to return original behavior
The -dls switch is oging to be removed once stderr is properl adapted
* Implemented library-like input and output
This commit is adding byte[]/String input/output clasases for direct
library usage.
Added tests, advertising how simple is usage of those inputs/outputs
There are two hunks, which fixes issues ovberlooked in previous
refactorings:
- traceln now correctly calls getOutputs
- jdec now uses proper logger insted of accidental stderr
* Enabled -g for jcoder to be set from external code
* Temporary workaround around tests being order-sensitive
As -g is now making some disassmebld code not asemble-able back, and
order of tests is not deterministic, and Options are static, thus if -g
is now set in some test, it is not unset in other tests.
Real fix is to move Options out of static context to context of
environment as it is done for jcoder.
* Added github actions
Just for record for "act" for local testing. To run with podman based distros, several steps are ncessary. See nektos/act#303 ; especially
* nektos/act#303 (comment)
* and nektos/act#303 (comment)
* the bind and socket
eg:
systemctl enable --now --user podman.socket
systemctl start --user podman.socket
export DOCKER_HOST=unix://$XDG_RUNTIME_DIR/podman/podman.sock
../act/bin/act --bind --container-daemon-socket $XDG_RUNTIME_DIR/podman/podman.sock
To rerun the build on clean env, you have to stop and start the podman socket again
Without explicit 'mvn test' the tests are not running. Not sure why
* 7902888: Excess entries in BootstrapMethods with the same bsm, bsmKind, bsmArgs
* Fixed issue when jasm produced only one file from source with multiple ones
Added tests for this issue
Added test veryfying that the jcoder is not affected
* 7903401: jtreg fails if set of jdk tests process jasm,jdis files with defects
* 7903402: jdis: tool writes incorrect StackMapTable if the first same_frame has type 0 (openjdk#51)
* 7902888: Excess entries in BootstrapMethods with the same bsm, bsmKind, bsmArgs
* 7903401: jtreg fails if set of jdk tests process jasm,jdis files with defects
* 7903402: jdis: tool writes incorrect StackMapTable if the first same_frame has type 0
* 7902888: Excess entries in BootstrapMethods with the same bsm, bsmKind, bsmArgs
* Refactored new ToolInput/Outputs so they reside in own packages and not in original interfaces (openjdk#53)
* Refactored new ToolInput/Outputs so they reside in own packages and not in original interfaces
* Removed unused imports
* Added licence headers
* Added ajvadoc description to three main interfaces.
* Removed unnecessary guard condition before changging \ to / for fqn
* Replaced "\n" by System.lineSeparator()
* Used better names for highlighted abbrevations
* 7903405: compiler does not warn about instruction arguments that exceed allowed limits (openjdk#54)
* 7902888: Excess entries in BootstrapMethods with the same bsm, bsmKind, bsmArgs
* 7903401: jtreg fails if set of jdk tests process jasm,jdis files with defects
* 7903402: jdis: tool writes incorrect StackMapTable if the first same_frame has type 0
* 7902888: Excess entries in BootstrapMethods with the same bsm, bsmKind, bsmArgs
* 7903405: compiler does not warn about instruction arguments that exceed allowed limits
* 7903405: compiler does not warn about instruction arguments that exceed allowed limits
* Delete ToolOutput.java
* Fixed junit test for CODETOOLS-7903405 (openjdk#56)
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0 (openjdk#57)
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* Fix tabs
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0 (Part II) (openjdk#58)
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* Fix tabs
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* CODETOOLS-7903506: Asmtools: jdis prints BootstrapMethod attribute if detailed output is off (openjdk#59)
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* Fix tabs
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* 7903506: Asmtools: jdis prints BootstrapMethod attribute if detailed output is off
* 7903509: jcoder, jasm: add option to override class file version in source file(s) (openjdk#60)
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* Fix tabs
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* 7903509: jcoder, jasm: add option to override class file version in source file(s)
* CODETOOLS-7903531: jdis: Suppress printing comments by adding an option (openjdk#61)
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* Fix tabs
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* 7903458: Umbrella: Preparations for switching to Asmtools 8.0
* 7903531: jdis: Suppress printing comments by adding an option
---------
Co-authored-by: Jiri Vanek <jvanek@redhat.com>
Summary:
JMH Results:
System : CascadeLake Server, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
Micros : test/micro/org/openjdk/bench/java/lang/ArrayCopy*.java
Baseline : http://cr.openjdk.java.net/~jbhateja/8252847/JMH_results/ArrayCopy_AVX3_Stubs_Baseline.txt
WithOpt : http://cr.openjdk.java.net/~jbhateja/8252847/JMH_results/ArrayCopy_AVX3_Stubs_WithOpts.txt
Progress
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/61/head:pull/61$ git checkout pull/61