Skip to content

Conversation

@fg1417
Copy link

@fg1417 fg1417 commented Jul 12, 2024

In the cases like:

  UNSAFE.putLong(address + off1 + 1030, lseed);
  UNSAFE.putLong(address + 1023, lseed);
  UNSAFE.putLong(address + off2 + 1001, lseed);

Unsafe intrinsifies direct memory access using a long as the base address, generating a CastX2P node converting long to pointer in C2. Then we get optoassembly code like:

  ldr  R10, [R15, #120]    # int ! Field: address
  ldr  R11, [R16, #136]    # int ! Field: off1
  ldr  R12, [R16, #144]    # int ! Field: off2
  add  R11, R11, R10
  mov R11, R11    # long -> ptr
  add  R12, R12, R10
  mov R10, R10    # long -> ptr
  add R11, R11, #1030    # ptr
  str  R17, [R11]    # int
  add R10, R10, #1023    # ptr
  str  R17, [R10]    # int
  mov R10, R12    # long -> ptr
  add R10, R10, #1001    # ptr
  str  R17, [R10]    # int

In aarch64, the conversion from long to pointer could be a nop but C2 doesn't know it. On the existing code, we do nothing for mov dst src only when dst == src [1], then we have assembly:

  ldr    x10, [x15,#120]
  ldp    x11, x12, [x16,#136]
  add    x11, x11, x10
  add    x12, x12, x10
  add    x11, x11, #0x406
  str    x17, [x11]
  add    x10, x10, #0x3ff
  str    x17, [x10]
  mov    x10, x12  <--- extra register copy
  add    x10, x10, #0x3e9
  str    x17, [x10]

There is still one extra register copy, which we're trying to remove in this patch.

This patch folds CastX2P into memory operands by introducing indirectX2P and indOffX2P. We also create a new opclass iRegPorL2P to remove extra copies from CastX2P in pointer addition.

Tier 1~3 passed on aarch64. No obvious change in size of libjvm.so

[1]

instruct castX2P(iRegPNoSp dst, iRegL src) %{


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8336245: AArch64: remove extra register copy when converting from long to pointer (Enhancement - P4)

Reviewers

Contributors

  • Andrew Haley <aph@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20157/head:pull/20157
$ git checkout pull/20157

Update a local copy of the PR:
$ git checkout pull/20157
$ git pull https://git.openjdk.org/jdk.git pull/20157/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20157

View PR using the GUI difftool:
$ git pr show -t 20157

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20157.diff

Webrev

Link to Webrev Comment

…g to pointer

In the cases like:
```
  UNSAFE.putLong(address + off1 + 1030, lseed);
  UNSAFE.putLong(address + 1023, lseed);
  UNSAFE.putLong(address + off2 + 1001, lseed);
```

Unsafe intrinsifies direct memory access using a long as
the base address, generating a `CastX2P` node converting
long to pointer in C2. Then we get optoassembly code like:
```
  ldr  R10, [R15, openjdk#120]    # int ! Field: address
  ldr  R11, [R16, openjdk#136]    # int ! Field: off1
  ldr  R12, [R16, openjdk#144]    # int ! Field: off2
  add  R11, R11, R10
  mov R11, R11    # long -> ptr
  add  R12, R12, R10
  mov R10, R10    # long -> ptr
  add R11, R11, openjdk#1030    # ptr
  str  R17, [R11]    # int
  add R10, R10, openjdk#1023    # ptr
  str  R17, [R10]    # int
  mov R10, R12    # long -> ptr
  add R10, R10, openjdk#1001    # ptr
  str  R17, [R10]    # int
```

In aarch64, the conversion from long to pointer could be
a nop but C2 doesn't know it. On the existing code, we
do nothing for `mov dst src` only when `dst` == `src` [1],
then we have assembly:
```
  ldr    x10, [x15,openjdk#120]
  ldp    x11, x12, [x16,openjdk#136]
  add    x11, x11, x10
  add    x12, x12, x10
  add    x11, x11, #0x406
  str    x17, [x11]
  add    x10, x10, #0x3ff
  str    x17, [x10]
  mov    x10, x12  <--- extra register copy
  add    x10, x10, #0x3e9
  str    x17, [x10]
```

There is still one extra register copy, which we're trying
to remove in this patch.

This patch folds `CastX2P` into memory operands by introducing
`indirectX2P` and `indOffX2P`. We also create a new opclass
`iRegPorL2P` to remove extra copies from `CastX2P` in pointer
addition.

Tier 1~3 passed on aarch64. No obvious change in size
of libjvm.so

[1] https://github.com/openjdk/jdk/blob/5c612c230b0a852aed5fd36e58b82ebf2e1838af/src/hotspot/cpu/aarch64/aarch64.ad#L7906
@bridgekeeper
Copy link

bridgekeeper bot commented Jul 12, 2024

👋 Welcome back fgao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 12, 2024

@fg1417 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8336245: AArch64: remove extra register copy when converting from long to pointer

Co-authored-by: Andrew Haley <aph@openjdk.org>
Reviewed-by: aph, adinn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 173 new commits pushed to the master branch:

  • 34ee06f: 8332850: javac crashes if container for repeatable annotation is not found
  • 00b5348: 8337192: [BACKOUT] JDK-8336098 G1: Refactor G1RebuildRSAndScrubTask
  • 3baff2a: 8335393: C2: assert(!had_error) failed: bad dominance
  • 8081f87: 8334342: Add MergeStore JMH benchmarks
  • 9d87918: 8335191: RISC-V: verify perf of chacha20
  • 6e228ce: 8336254: Virtual thread implementation + test updates
  • d3e51da: 8334085: Test crash: assert(thread->held_monitor_count() == 0) failed: Must be
  • 0898ab7: 8336741: Optimize LocalTime.toString with StringBuilder.repeat
  • 24f67d0: 8334232: Optimize C1 classes layout
  • 021c2c3: 8337067: Test runtime/classFileParserBug/Bad_NCDFE_Msg.java won't compile
  • ... and 163 more: https://git.openjdk.org/jdk/compare/fb9a227e02ebf826edb762283e15dd7e402f8433...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 12, 2024
@openjdk
Copy link

openjdk bot commented Jul 12, 2024

@fg1417 The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Jul 12, 2024
@fg1417
Copy link
Author

fg1417 commented Jul 12, 2024

/contributor add @theRealAph

@openjdk
Copy link

openjdk bot commented Jul 12, 2024

@fg1417
Contributor Andrew Haley <aph@openjdk.org> successfully added.

if (t->isa_intptr_t() && offset != 0 && offset != Type::OffsetBot) {

if (t->isa_intptr_t() &&
#if !defined(AARCH64)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After applying the operand "IndirectX2P", we may have some patterns like:

str val, [CastX2P base]

The code path here will resolve the base, which is actually a intptr, not a ptr, and the offset is 0.

I guess the code here was intended to support [base, offset], where base can be a intptr but offset can not be 0. I'm not sure why there is such a limitation that offset can not be 0, maybe for some old machines?

I don't think the limitation is applied to aarch64 machines now. So I unblock it for aarch64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the other way around. Isn't this code saying that if the address is an intptr + a nonzero offset, then the returned type is bottom, ie nothing? What effect does this change have?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review! Yeah, this code says if the address is an intptr + a nonzero offset, then return TypeRawPtr::BOTTOM. Then it continues the verification.

Without the change here, an intptr + a zero offset would fail to assert on next lines,

// be conservative if we do not recognize the type
if (tp == nullptr) {
assert(false, "this path may produce not optimal code");
return TypePtr::BOTTOM;
}

AFAIK, the return value TypeRawPtr::BOTTOM represents raw access to memory here. And an intptr + a zero offset is also a valid raw access, so I unblock it here. WDYT? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learn something every day, I guess. It's been a long while since I looked, but I expected "pointer to anything" to be TOP, not BOTTOM. Thinking some more, a TypeRawPtr must be casted to a usable physical type in order to use it, so "pointer to nothing" makes more sense than "pointer to anything".

@mlbridge
Copy link

mlbridge bot commented Jul 12, 2024

Webrevs

fg1417 added a commit to fg1417/jdk that referenced this pull request Jul 12, 2024
This patch forces `CastX2P` to be a two-address instruction,
so that C2 could allocate the same register for dst and
src. Then we can remove the instruction completely in the
assembly.

The motivation comes from some cast operations like `castPP`.
The difference for ADLC between `castPP` and `CastX2P` lies in
that `CastX2P` always has different types for dst and src.
We can force ADLC to generate an extra `two_adr()` for `CastX2P`
like it does automatically for `castPP`, which could tell register
allocator that the instruction needs the same register for dst
and src.

However, sometimes, RA and GCM in C2 can't work as we expected.

For example, we have Assembly on the existing code:
```
  ldp    x10, x11, [x17,openjdk#136]
  add    x10, x10, x15
  add    x11, x11, x10
  ldr    x12, [x17,openjdk#152]
  str    x16, [x10]
  add    x10, x12, x15
  str    x16, [x11]
  str    x16, [x10]
```

After applying the patch, the assembly is:
```
  ldr    x10, [x16,openjdk#136]  <--- 1
  add    x10, x10, x15
  ldr    x11, [x16,openjdk#144]  <--- 2
  mov    x13, x10         <--- 3
  str    x17, [x13]
  ldr    x12, [x16,openjdk#152]
  add    x10, x11, x10
  str    x17, [x10]
  add    x10, x12, x15
  str    x17, [x10]
```

C2 generate a totally extra mov, see 3, and we even lost the chance
to merge load pair, see 1 and 2. That's terrible.

Although this scenario would disappear after combining with
openjdk#20157, I'm
still not sure if this patch is worthwhile.
@fg1417
Copy link
Author

fg1417 commented Jul 12, 2024

#20159 is also to fix the same issue. Please feel free to review the draft PR. Thanks.

@theRealAph
Copy link
Contributor

theRealAph commented Jul 15, 2024

This will need quite a lot of testing, perhaps higher tiers and jcstress. You can test these two PRs together.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 15, 2024
@fg1417
Copy link
Author

fg1417 commented Jul 17, 2024

#20159 is also to fix the same issue. Please feel free to review the draft PR. Thanks.

This will need quite a lot of testing, perhaps higher tiers and jcstress. You can test these two PRs together.

Thanks for approval, @theRealAph. I'll test jcstress on my local.

Could you help review and test these two PRs with higher tiers please? @TobiHartmann @vnkozlov Thanks.

@TobiHartmann
Copy link
Member

Sure, I'll run this through our testing and report back once it passed.

@TobiHartmann
Copy link
Member

All tests passed.

@fg1417
Copy link
Author

fg1417 commented Jul 22, 2024

All tests passed.

@TobiHartmann thanks for your testing!

Jcstress also passed.

@fg1417
Copy link
Author

fg1417 commented Jul 25, 2024

Can I have a second review please :-)

operand immLOffset()
%{
predicate(Address::offset_ok_for_immed(n->get_long(), 0));
predicate(n->get_long() >= -256 && n->get_long() <= 65520);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this using hard wired constants rather than using Address::offset_ok_for_immed?

Also, why is the constant value 65520?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Address::offset_ok_for_immed is too restrictive: we want a predicate that is the superset of all possible address offsets.

jshell> ((1<<12)-1) <<4
$3 ==> 65520

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I realise that this is 16 less than 65536. However, there are two things I don't follow.

In the original code immLoffset was only used to define indOffLN i.e. a long offset used with a narrow pointer. The use of Address::offset_ok_for_immed(n->get_long(), 0) in the predicate limited narrow pointer offsets to -256 <= offset <= (2^12 - 1). With this change the top end of the range is now (2^12 - 1) << 4. I am wondering why that is appropriate?

The change allows immLOffset to be used in the definition of indOffX2P. I am not clear why indOffX2P is not just defined using the existing operand immLoffset16 which has as its predicate Address::offset_ok_for_immed(n->get_long(), 4). The only difference I can see is that the alternative predicate used here will accept a positive offset that is not 16 byte aligned. Is that the intention of the redefinition? Again, why is that appropriate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change allows immLOffset to be used in the definition of indOffX2P. I am not clear why indOffX2P is not just defined using the existing operand immLoffset16 which has as its predicate Address::offset_ok_for_immed(n->get_long(), 4).

After this change, immLOffset is a more general-purpose type than immLoffset16. immLOffset matches all possible address offsets, along with some impossible ones. For example, it matches all of the misaligned Unsafe accesses at any offset, regardless of operand size. In the (rare) event that an operand size and offset don't fit a single instruction, we'll split the instruction when we emit it.

After this patch there will be a few rare cases where we have a regression in code size, but it's worth it for the simplicity and the size of the matcher logic, which would otherwise explode. I don't expect any significant regression in execution time.

This patch is not the last word on the matter; later patches may well further reduce the number of integer offset types in a similar way. I don't think that many of the offsetL/I/X/P types do anything useful, and we'd probably profit from removing them, but that's another patch for anther day.

Copy link
Contributor

@adinn adinn Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. The use of immLoffset as defined in this patch is actually correct for narrow oops and, indeed, for all other address base types. It allows for all possible offsets that might fit into a load an immediate slot. Whether we can legitimately encode the operand offset as an immediate or need instead to use an auxiliary add does not actually depend on the type of the address base but on the size of the datum fetched by the indirect load that consumes the operand. So, an indirect operand with offset 4098 would be too big to encode in an ldrb, fine to encode in an ldrh and invalid for encoding in an ldrw or ldrx because it is not suitably aligned.

That does imply we should get rid of the other (now redundant) immLoffset operands. However, we can do that in a follow-up patch because it is not what this fix is addressing

@fg1417
Copy link
Author

fg1417 commented Jul 26, 2024

@theRealAph @adinn Thanks for your reviews!

I'll integrate it.

/integrate

@openjdk
Copy link

openjdk bot commented Jul 26, 2024

Going to push as commit d10afa2.
Since your change was applied there have been 184 commits pushed to the master branch:

  • 7f11935: 8337167: StringSize deduplication
  • 487450c: 8216471: GTK LnF: Frame is clipped and does not show JTable,Tooltip and JTree demo in SwingSet2 demo
  • 29f0f17: 8336879: Always true condition 'img != null' in GTKPainter.paintPopupMenuBackground
  • ee839b7: 8337239: Fix simple -Wzero-as-null-pointer-constant warnings in classfile code
  • 0584af2: 8336685: Shenandoah: Remove experimental incremental update mode
  • 8c28f25: 8337124: (fs) sun.nio.fs.WindowsSecurity.enablePrivilege should pin when continuations supported
  • b5b5a5b: 8336667: IAE in DerInputStream.toByteArray
  • cf0d9e0: 8337037: compiler internal options are not printing the stacktrace after a compiler crash
  • 81ed028: 8337060: Test java/foreign/TestConcurrentClose.java failed: IllegalStateException: SegmentAccessor::doAccess method not being compiled
  • e74edba: 8336640: Shenandoah: Parallel worker use in parallel_heap_region_iterate
  • ... and 174 more: https://git.openjdk.org/jdk/compare/fb9a227e02ebf826edb762283e15dd7e402f8433...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 26, 2024
@openjdk openjdk bot closed this Jul 26, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jul 26, 2024
@openjdk
Copy link

openjdk bot commented Jul 26, 2024

@fg1417 Pushed as commit d10afa2.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

4 participants