-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8336245: AArch64: remove extra register copy when converting from long to pointer #20157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…g to pointer In the cases like: ``` UNSAFE.putLong(address + off1 + 1030, lseed); UNSAFE.putLong(address + 1023, lseed); UNSAFE.putLong(address + off2 + 1001, lseed); ``` Unsafe intrinsifies direct memory access using a long as the base address, generating a `CastX2P` node converting long to pointer in C2. Then we get optoassembly code like: ``` ldr R10, [R15, openjdk#120] # int ! Field: address ldr R11, [R16, openjdk#136] # int ! Field: off1 ldr R12, [R16, openjdk#144] # int ! Field: off2 add R11, R11, R10 mov R11, R11 # long -> ptr add R12, R12, R10 mov R10, R10 # long -> ptr add R11, R11, openjdk#1030 # ptr str R17, [R11] # int add R10, R10, openjdk#1023 # ptr str R17, [R10] # int mov R10, R12 # long -> ptr add R10, R10, openjdk#1001 # ptr str R17, [R10] # int ``` In aarch64, the conversion from long to pointer could be a nop but C2 doesn't know it. On the existing code, we do nothing for `mov dst src` only when `dst` == `src` [1], then we have assembly: ``` ldr x10, [x15,openjdk#120] ldp x11, x12, [x16,openjdk#136] add x11, x11, x10 add x12, x12, x10 add x11, x11, #0x406 str x17, [x11] add x10, x10, #0x3ff str x17, [x10] mov x10, x12 <--- extra register copy add x10, x10, #0x3e9 str x17, [x10] ``` There is still one extra register copy, which we're trying to remove in this patch. This patch folds `CastX2P` into memory operands by introducing `indirectX2P` and `indOffX2P`. We also create a new opclass `iRegPorL2P` to remove extra copies from `CastX2P` in pointer addition. Tier 1~3 passed on aarch64. No obvious change in size of libjvm.so [1] https://github.com/openjdk/jdk/blob/5c612c230b0a852aed5fd36e58b82ebf2e1838af/src/hotspot/cpu/aarch64/aarch64.ad#L7906
|
👋 Welcome back fgao! A progress list of the required criteria for merging this PR into |
|
@fg1417 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 173 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
/contributor add @theRealAph |
|
@fg1417 |
| if (t->isa_intptr_t() && offset != 0 && offset != Type::OffsetBot) { | ||
|
|
||
| if (t->isa_intptr_t() && | ||
| #if !defined(AARCH64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After applying the operand "IndirectX2P", we may have some patterns like:
str val, [CastX2P base]
The code path here will resolve the base, which is actually a intptr, not a ptr, and the offset is 0.
I guess the code here was intended to support [base, offset], where base can be a intptr but offset can not be 0. I'm not sure why there is such a limitation that offset can not be 0, maybe for some old machines?
I don't think the limitation is applied to aarch64 machines now. So I unblock it for aarch64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the other way around. Isn't this code saying that if the address is an intptr + a nonzero offset, then the returned type is bottom, ie nothing? What effect does this change have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review! Yeah, this code says if the address is an intptr + a nonzero offset, then return TypeRawPtr::BOTTOM. Then it continues the verification.
Without the change here, an intptr + a zero offset would fail to assert on next lines,
jdk/src/hotspot/share/opto/machnode.cpp
Lines 409 to 413 in a96de6d
| // be conservative if we do not recognize the type | |
| if (tp == nullptr) { | |
| assert(false, "this path may produce not optimal code"); | |
| return TypePtr::BOTTOM; | |
| } |
AFAIK, the return value TypeRawPtr::BOTTOM represents raw access to memory here. And an intptr + a zero offset is also a valid raw access, so I unblock it here. WDYT? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learn something every day, I guess. It's been a long while since I looked, but I expected "pointer to anything" to be TOP, not BOTTOM. Thinking some more, a TypeRawPtr must be casted to a usable physical type in order to use it, so "pointer to nothing" makes more sense than "pointer to anything".
This patch forces `CastX2P` to be a two-address instruction, so that C2 could allocate the same register for dst and src. Then we can remove the instruction completely in the assembly. The motivation comes from some cast operations like `castPP`. The difference for ADLC between `castPP` and `CastX2P` lies in that `CastX2P` always has different types for dst and src. We can force ADLC to generate an extra `two_adr()` for `CastX2P` like it does automatically for `castPP`, which could tell register allocator that the instruction needs the same register for dst and src. However, sometimes, RA and GCM in C2 can't work as we expected. For example, we have Assembly on the existing code: ``` ldp x10, x11, [x17,openjdk#136] add x10, x10, x15 add x11, x11, x10 ldr x12, [x17,openjdk#152] str x16, [x10] add x10, x12, x15 str x16, [x11] str x16, [x10] ``` After applying the patch, the assembly is: ``` ldr x10, [x16,openjdk#136] <--- 1 add x10, x10, x15 ldr x11, [x16,openjdk#144] <--- 2 mov x13, x10 <--- 3 str x17, [x13] ldr x12, [x16,openjdk#152] add x10, x11, x10 str x17, [x10] add x10, x12, x15 str x17, [x10] ``` C2 generate a totally extra mov, see 3, and we even lost the chance to merge load pair, see 1 and 2. That's terrible. Although this scenario would disappear after combining with openjdk#20157, I'm still not sure if this patch is worthwhile.
|
#20159 is also to fix the same issue. Please feel free to review the draft PR. Thanks. |
|
This will need quite a lot of testing, perhaps higher tiers and jcstress. You can test these two PRs together. |
Thanks for approval, @theRealAph. I'll test jcstress on my local. Could you help review and test these two PRs with higher tiers please? @TobiHartmann @vnkozlov Thanks. |
|
Sure, I'll run this through our testing and report back once it passed. |
|
All tests passed. |
@TobiHartmann thanks for your testing! Jcstress also passed. |
|
Can I have a second review please :-) |
| operand immLOffset() | ||
| %{ | ||
| predicate(Address::offset_ok_for_immed(n->get_long(), 0)); | ||
| predicate(n->get_long() >= -256 && n->get_long() <= 65520); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this using hard wired constants rather than using Address::offset_ok_for_immed?
Also, why is the constant value 65520?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Address::offset_ok_for_immed is too restrictive: we want a predicate that is the superset of all possible address offsets.
jshell> ((1<<12)-1) <<4
$3 ==> 65520
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I realise that this is 16 less than 65536. However, there are two things I don't follow.
In the original code immLoffset was only used to define indOffLN i.e. a long offset used with a narrow pointer. The use of Address::offset_ok_for_immed(n->get_long(), 0) in the predicate limited narrow pointer offsets to -256 <= offset <= (2^12 - 1). With this change the top end of the range is now (2^12 - 1) << 4. I am wondering why that is appropriate?
The change allows immLOffset to be used in the definition of indOffX2P. I am not clear why indOffX2P is not just defined using the existing operand immLoffset16 which has as its predicate Address::offset_ok_for_immed(n->get_long(), 4). The only difference I can see is that the alternative predicate used here will accept a positive offset that is not 16 byte aligned. Is that the intention of the redefinition? Again, why is that appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change allows immLOffset to be used in the definition of indOffX2P. I am not clear why indOffX2P is not just defined using the existing operand immLoffset16 which has as its predicate Address::offset_ok_for_immed(n->get_long(), 4).
After this change, immLOffset is a more general-purpose type than immLoffset16. immLOffset matches all possible address offsets, along with some impossible ones. For example, it matches all of the misaligned Unsafe accesses at any offset, regardless of operand size. In the (rare) event that an operand size and offset don't fit a single instruction, we'll split the instruction when we emit it.
After this patch there will be a few rare cases where we have a regression in code size, but it's worth it for the simplicity and the size of the matcher logic, which would otherwise explode. I don't expect any significant regression in execution time.
This patch is not the last word on the matter; later patches may well further reduce the number of integer offset types in a similar way. I don't think that many of the offsetL/I/X/P types do anything useful, and we'd probably profit from removing them, but that's another patch for anther day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see. The use of immLoffset as defined in this patch is actually correct for narrow oops and, indeed, for all other address base types. It allows for all possible offsets that might fit into a load an immediate slot. Whether we can legitimately encode the operand offset as an immediate or need instead to use an auxiliary add does not actually depend on the type of the address base but on the size of the datum fetched by the indirect load that consumes the operand. So, an indirect operand with offset 4098 would be too big to encode in an ldrb, fine to encode in an ldrh and invalid for encoding in an ldrw or ldrx because it is not suitably aligned.
That does imply we should get rid of the other (now redundant) immLoffset operands. However, we can do that in a follow-up patch because it is not what this fix is addressing
|
@theRealAph @adinn Thanks for your reviews! I'll integrate it. /integrate |
|
Going to push as commit d10afa2.
Your commit was automatically rebased without conflicts. |
In the cases like:
Unsafe intrinsifies direct memory access using a long as the base address, generating a
CastX2Pnode converting long to pointer in C2. Then we get optoassembly code like:In aarch64, the conversion from long to pointer could be a nop but C2 doesn't know it. On the existing code, we do nothing for
mov dst srconly whendst==src[1], then we have assembly:There is still one extra register copy, which we're trying to remove in this patch.
This patch folds
CastX2Pinto memory operands by introducingindirectX2PandindOffX2P. We also create a new opclassiRegPorL2Pto remove extra copies fromCastX2Pin pointer addition.Tier 1~3 passed on aarch64. No obvious change in size of libjvm.so
[1]
jdk/src/hotspot/cpu/aarch64/aarch64.ad
Line 7906 in 5c612c2
Progress
Issue
Reviewers
Contributors
<aph@openjdk.org>Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20157/head:pull/20157$ git checkout pull/20157Update a local copy of the PR:
$ git checkout pull/20157$ git pull https://git.openjdk.org/jdk.git pull/20157/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 20157View PR using the GUI difftool:
$ git pr show -t 20157Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20157.diff
Webrev
Link to Webrev Comment