-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8338442: AArch64: Clean up IndOffXX type and let legitimize_address() fix out-of-range operands #16991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…2 compiler bug" On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by: ``` cast<64> (32-bit compressed reference) + field_offset ``` When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by LDR/STR instructions. But for certain data width, not all immediates can be encoded in the instruction field of LDR/STR[1]. The ranges are different as data widths vary. For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit LDR/STR, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760[2]. `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail. In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead. Since 8-bit and 16-bit LDR/STR instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse. Tier 1-3 passed on aarch64. [1] https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33 [2] https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en
|
👋 Welcome back fgao! A progress list of the required criteria for merging this PR into |
Webrevs
|
|
This is a complex fix for a corner case that can trivially be fixed via legitimize_address. Unless we can find something simpler, we should do so. |
|
One question: is this only about misaligned loads? |
It looks to me that it also handles aligned loads with too-large offsets. |
|
After this change, |
Yes. E.g., for long type, before the patch, it only accepts aligned offsets ranging from 0 to 4095, after the patch, it now accepts aligned offsets ranging from 0 to 32760. |
Removed them in the new commit. Thanks! |
|
I think this patch is excessive for the problem and introduces a lot of code dupiication. Maybe it would be simpler, smaller, and faster to check for what we need: |
|
@theRealAph , your patch only works if when |
Yes, it is, but clearly that does not scale, leading to a great profusion of operand kinds.
Sure, that would be better still. Best of all would IMO be for each kind of memory access to check its offset operand by calling |
|
@fg1417 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
|
@fg1417 what is the state on this? The example here may look like a strange edge-case. But with my plans this may become more common: I will merge stores, which can then look like unaligned stores of a larger type. |
The problem with this PR is that the code is way too complex for such a simple problem. The port is correct as it is, in the release build. The only problem is an assertion. We could simply remove that assertion, but if it were me I'd fix the problem properly. Both @dean-long and I have suggested ways to improve this patch with less code. If @fg1417 decides to drop this PR I'll fix it. |
|
Sorry, I can't work on this right now. @theRealAph could you help to push the changes please? Thanks. |
|
@theRealAph do you want to fix this? Otherwise I'll just push a PR removing the assert, since it blocks this: #16245 |
I've just started working on it. If you like, push a removing the assert, and then my fix will re-insert it. |
|
I'm withdrawing my objection to this PR. I've had a good look at some alternatives, and none of them are any better. I've reluctantly concluded that, given the design of C2, there's no better way to fix it. My apologies to @fg1417 . You were right. |
|
@theRealAph thanks for your kind review. All comments inspired me a lot and helped me think more. Also, thanks for your help on fixing it! |
|
@fg1417 are you still working on this? |
The best thing to do now is to remove the failing assertion. |
|
On 6/6/24 13:42, Fei Gao wrote:
We know that, on many Arm cores, Store μOPs are split into address and
or a pair of instructions
Of course, a pair of instructions occupies twice as much icache space, |
Thanks for your kind explanation @theRealAph . That quite makes sense to me. I'll continue processing this pull request to implement it. |
Hi @theRealAph @dean-long , the new commit implemented the scheme discussed before. It keeps two general operands Could you help review it please? Thanks! |
|
That looks good. I'll do a more detailed review tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This is a good cleanup.
I've tried a bunch of different things, and I haven't seen any measurable performance regressions.
But I do see this code expansion for an unaligned store long:
0x000000010fca2210: mov x2, x22
;; legitimize_address {
0x000000010fca2214: add x8, x2, #0x3ff
;; } legitimize_address
0x000000010fca2218: str x4, [x8]
instead of, currently,
0x000000010e5024f0: add x3, x22, #0x3ff
0x000000010e5024f4: str x2, [x3]
Huh? What on Earth can have caused that added register copy? It looks like C2 has decided that the indOffL operand must clobber its input register x22, but it can't be that.
Looking at OptoAssembly for the same test, I see:
21c + mov R11, R23 # long -> ptr
220 str R12, [R11, #1023] # int
whereas existing code has OptoAssembly
0184 + add R2, R6, #1023 # ptr
0188 str R4, [R2] # int
At a guess, the add operation in the existing code does the long->ptr conversions as well as the addition.
This is very annoying. It's not terribly important, though.
|
I think this fixes it. |
|
Hi @theRealAph , thanks for your work and kind explanation! I tried to write a small case to reproduce the problem you found, with The existing code has the following OptoAssembly: Because we do nothing for We can still see the extra register copy. So, I guess it is not caused by this PR and it does exist before? Maybe, as you said, C2 needs to be told that the conversion from long->ptr is a nop. After applying the PR, we have: After applying the PR with your fix, we have: I'll update the PR with your fix soon. Thanks! |
If so then might it be better to relocate Andrew's patch to a separate issue? We may want to backport it independent from this change. |
Thanks for your review @adinn . That sounds quite worthwhile. I'm afraid we can't relocate the fix directly, which depends on part of my cleanup in this PR. I'll try to withdraw the common part to make the separate fix simple and clean. Thanks. |
And there may be a more powerful way to fix it than my suggestion. I'm thinking of something analgous to |
Hi @theRealAph @adinn, I proposed two pull requests, #20157 and #20159, for the issue. #20159 is a draft because I'm not sure if it's worthwhile. I'd appreciate it if you could help review them. Thanks! |
|
Rebased to the latest JDK after merging d10afa2 I don't think the failure of linux-x64 build is caused by this PR. Can I have a review please? Thanks. |
|
Hi @theRealAph @adinn @dean-long , can I have a review please? Thanks :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. That's a nice simplification and cleanup.
@theRealAph thanks for your approval! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
|
This PR is still referring to JDK-8319690 which got integrated. Should this refer to a new RFE? |
|
@TobiHartmann @fg1417 Yes, the bug as such is already "fixed" (I had removed the assert because it was not necessary for correctness). I guess this should be an RFE. Either as a cleanup or performance improvement (if it can be measured). |
|
@dean-long thanks for your approval! Thanks for your suggestions @eme64 @TobiHartmann. Updated with a new RFE :) I'll integrate it. /integrate |
|
Going to push as commit 3859131.
Your commit was automatically rebased without conflicts. |
On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by:
When the
field_offsetis an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e.,base_plus_offset, supported byLDR/STRinstructions. But for certain data width, not all immediates can be encoded in the instruction field ofLDR/STR[1]. The ranges are different as data widths vary.For example, when we try to load a value of long type at offset of
1030, the address expression is(AddP (DecodeN base) 1030). Before the patch, the expression was matching withoperand indOffIN(). But, for 64-bitLDR/STR, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [2].1030can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail.In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as
base_plus_offset. We introduceindOffIN4/indOffLN4andindOffIN8/indOffLN8for 32-bit data type and 64-bit data type separately in the patch. E.g., formemory4, we remove the genericindOffIN/indOffLN, which matches wrong unscaled immediate range, and replace them withindOffIN4/indOffLN4instead.Since 8-bit and 16-bit
LDR/STRinstructions also support the unscaled decoding pattern, we add the addressing mode in the lists ofmemory1andmemory2by introducingindOffIN1/indOffLN1andindOffIN2/indOffLN2.We also remove unused operands
indOffI/indOffl/indOffIN/indOffLNto avoid misuse.Tier 1-3 passed on aarch64.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16991/head:pull/16991$ git checkout pull/16991Update a local copy of the PR:
$ git checkout pull/16991$ git pull https://git.openjdk.org/jdk.git pull/16991/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16991View PR using the GUI difftool:
$ git pr show -t 16991Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16991.diff
Webrev
Link to Webrev Comment