-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8253049: Enhance itable_stub for AArch64 and x86_64 #128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @kuaiwei, welcome to this OpenJDK project and thanks for contributing! We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user kuaiwei" as summary for the issue. If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing |
|
@kuaiwei The following label will be automatically applied to this pull request: When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the |
|
/covered |
|
Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated! |
|
/label add hotspot-compiler |
|
@kuaiwei Usage:
|
|
/label add hotspot-compiler |
|
@kuaiwei |
Webrevs
|
|
Mailing list message from Vladimir Ivanov on hotspot-dev: Hi Kevin, Very interesting observations. I like the idea to optimize for the case Fusing 2 passes over the itable into one does look attractive, but I'm I'm curious what kind of benchmarks you used and what are the One suggestion about the implementation: src/hotspot/cpu/x86/macroAssembler_x86.cpp: +void MacroAssembler::lookup_interface_method_in_stub(Register recv_klass, I'd like to avoid having 2 independent implementations of itable lookup What MacroAssembler::lookup_interface_method(..., true As a possible path forward, you could introduce the fast path check Then you could refactor MacroAssembler::lookup_interface_method() to Best regards, On 14.09.2020 13:52, kuaiwei wrote: |
|
Mailing list message from Kuai Wei on hotspot-compiler-dev: Hi Vladimir, Thanks for your review. I updated my test cases in test/micro/org/openjdk/bench/vm/compiler/InterfaceCalls.java . My tests will not inline interface methods and most cpu are used by itable_stub. aarch64: === testStubPoly5 === === testSlowStubPoly3 === === testSlowStubPoly5 === x86: === testStubPoly5 === === testSlowStubPoly3 === === testSlowStubPoly5 === I think lookup_interface_method can be reused as fast path. And it is also used by templateTable::invoke_interface and generate_method_handle_dispatch. Thanks, ------------------------------------------------------------------ Hi Kevin, Very interesting observations. I like the idea to optimize for the case Fusing 2 passes over the itable into one does look attractive, but I'm I'm curious what kind of benchmarks you used and what are the One suggestion about the implementation: src/hotspot/cpu/x86/macroAssembler_x86.cpp: +void MacroAssembler::lookup_interface_method_in_stub(Register recv_klass, I'd like to avoid having 2 independent implementations of itable lookup What MacroAssembler::lookup_interface_method(..., true As a possible path forward, you could introduce the fast path check Then you could refactor MacroAssembler::lookup_interface_method() to Best regards, On 14.09.2020 13:52, kuaiwei wrote: |
|
Mailing list message from Vladimir Ivanov on hotspot-dev:
Good, thanks for the numbers. I'm curious have you observed any I'm asking because linear scan is already far from optimal when there
Frankly speaking, I'd like to avoid the duplication. Also, absence of guarantees about order of interfaces in the itable And speaking of the overall approach (as it is implemented now), IMO But I'm happy to change my mind if the rewritten implementation makes it (FTR subtype checks suffer from a similar problem: unless Best regards,
|
|
Mailing list message from Andrew Haley on hotspot-dev: On 15/09/2020 10:02, Vladimir Ivanov wrote:
Indeed. When I first came to HotSpot after working on GCJ for years The code improvements look to be fairly minor. -- |
|
Mailing list message from Kuai Wei on hotspot-dev: Thanks for your quick reply.
Good, thanks for the numbers. I'm curious have you observed any I'm asking because linear scan is already far from optimal when there Kevin: itable_stub was found hot on several online applications. So I started to work on this. Now I don't have chance to verify it online. So I uses microbenchmarks to verify. I will
Frankly speaking, I'd like to avoid the duplication. Kevin: Ok, I will try to merge them. Also, absence of guarantees about order of interfaces in the itable Kevin: I use a counter for matching. If it reaches zero, the iteration can exit early. And speaking of the overall approach (as it is implemented now), IMO Kevin: I agree we can improve itable design. My initial think is jvm may reorder itable at safepoint. I can take it as a follow up optimization. But I'm happy to change my mind if the rewritten implementation makes it (FTR subtype checks suffer from a similar problem: unless Regards, ------------------------------------------------------------------
Good, thanks for the numbers. I'm curious have you observed any I'm asking because linear scan is already far from optimal when there
Frankly speaking, I'd like to avoid the duplication. Also, absence of guarantees about order of interfaces in the itable And speaking of the overall approach (as it is implemented now), IMO But I'm happy to change my mind if the rewritten implementation makes it (FTR subtype checks suffer from a similar problem: unless Best regards,
|
|
Mailing list message from Vladimir Ivanov on hotspot-dev:
FTR Erik? has been looking into rewriting virtual dispatch logic: http://openjdk.java.net/jeps/8221828 Best regards,
|
|
Mailing list message from Vladimir Ivanov on hotspot-dev:
That's unfortunate. It would be very helpful to confirm the results of
Good. Thanks for the clarification. Alternatively, you could use 2 bits in the temp register to code the Or even explicitly encode the state in the code as an automaton by Also, on naming: I find it hard to reason about the logic. As an example: movptr(method_result, Address(recv_klass, holder_klass,
Well, I would definitely prefer to avoid additional runtime changes (to Best regards,
|
|
@kuaiwei This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
|
@kuaiwei This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! |
r18 should not be used as it is reserved as platform register. Linux is fine with userspace using it, but Windows and also recently macOS ( openjdk/jdk11u-dev#301 (comment) ) are actually using it on the kernel side. The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to specify which registers to spill; fortunately this helper is only used here: https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404 I haven't seen causing this particular instance any issues in practice _yet_, presumably because it looks hard to align the stars in order to trigger a problem (between stp and ldp of r18 a transition to kernel space must happen *and* the kernel needs to do something with r18). But jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that causes troubles as explained in the link above. Output of `-XX:+PrintInterpreter` before this change: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000138809b00, 0x000000013880a280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000138809b00: ldr x2, [x12, #16] 0x0000000138809b04: ldrh w2, [x2, #44] 0x0000000138809b08: add x24, x20, x2, uxtx #3 0x0000000138809b0c: sub x24, x24, #0x8 [...] 0x0000000138809fa4: stp x16, x17, [sp, #128] 0x0000000138809fa8: stp x18, x19, [sp, #144] 0x0000000138809fac: stp x20, x21, [sp, #160] [...] 0x0000000138809fc0: stp x30, xzr, [sp, #240] 0x0000000138809fc4: mov x0, x28 ;; 0x10864ACCC 0x0000000138809fc8: mov x9, #0xaccc // #44236 0x0000000138809fcc: movk x9, #0x864, lsl #16 0x0000000138809fd0: movk x9, #0x1, lsl #32 0x0000000138809fd4: blr x9 0x0000000138809fd8: ldp x2, x3, [sp, #16] [...] 0x0000000138809ff4: ldp x16, x17, [sp, #128] 0x0000000138809ff8: ldp x18, x19, [sp, #144] 0x0000000138809ffc: ldp x20, x21, [sp, #160] ``` After: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000108e4db00, 0x0000000108e4e280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000108e4db00: ldr x2, [x12, #16] 0x0000000108e4db04: ldrh w2, [x2, #44] 0x0000000108e4db08: add x24, x20, x2, uxtx #3 0x0000000108e4db0c: sub x24, x24, #0x8 [...] 0x0000000108e4dfa4: stp x16, x17, [sp, #128] 0x0000000108e4dfa8: stp x19, x20, [sp, #144] 0x0000000108e4dfac: stp x21, x22, [sp, #160] [...] 0x0000000108e4dfbc: stp x29, x30, [sp, #224] 0x0000000108e4dfc0: mov x0, x28 ;; 0x107E4A06C 0x0000000108e4dfc4: mov x9, #0xa06c // #41068 0x0000000108e4dfc8: movk x9, #0x7e4, lsl #16 0x0000000108e4dfcc: movk x9, #0x1, lsl #32 0x0000000108e4dfd0: blr x9 0x0000000108e4dfd4: ldp x2, x3, [sp, #16] [...] 0x0000000108e4dff0: ldp x16, x17, [sp, #128] 0x0000000108e4dff4: ldp x19, x20, [sp, #144] 0x0000000108e4dff8: ldp x21, x22, [sp, #160] [...] ```
Restore looks like this now: ``` 0x0000000106e4dfcc: movk x9, #0x5e4, lsl openjdk#16 0x0000000106e4dfd0: movk x9, #0x1, lsl openjdk#32 0x0000000106e4dfd4: blr x9 0x0000000106e4dfd8: ldp x2, x3, [sp, openjdk#16] 0x0000000106e4dfdc: ldp x4, x5, [sp, openjdk#32] 0x0000000106e4dfe0: ldp x6, x7, [sp, openjdk#48] 0x0000000106e4dfe4: ldp x8, x9, [sp, openjdk#64] 0x0000000106e4dfe8: ldp x10, x11, [sp, openjdk#80] 0x0000000106e4dfec: ldp x12, x13, [sp, openjdk#96] 0x0000000106e4dff0: ldp x14, x15, [sp, openjdk#112] 0x0000000106e4dff4: ldp x16, x17, [sp, openjdk#128] 0x0000000106e4dff8: ldp x0, x1, [sp], openjdk#144 0x0000000106e4dffc: ldp xzr, x19, [sp], openjdk#16 0x0000000106e4e000: ldp x22, x23, [sp, openjdk#16] 0x0000000106e4e004: ldp x24, x25, [sp, openjdk#32] 0x0000000106e4e008: ldp x26, x27, [sp, openjdk#48] 0x0000000106e4e00c: ldp x28, x29, [sp, openjdk#64] 0x0000000106e4e010: ldp x30, xzr, [sp, openjdk#80] 0x0000000106e4e014: ldp x20, x21, [sp], openjdk#96 0x0000000106e4e018: ldur x12, [x29, #-24] 0x0000000106e4e01c: ldr x22, [x12, openjdk#16] 0x0000000106e4e020: add x22, x22, #0x30 0x0000000106e4e024: ldr x8, [x28, openjdk#8] ```
Now itable_stub will go through instanceKlass's itable twice to look up a method entry. resolved klass is used for type checking and method holder klass is used to find method entry. In many cases , we observed resolved klass is as same as holder klass. So we can improve itable stub based on it. If they are same klass, stub uses a fast loop to check only one klass. If not, a slow loop is used to checking both klasses.
Even entering in slow loop, new implementation can be better than old one in some cases. Because new stub just need go through itable once and reduce memory operations.
bug: https://bugs.openjdk.java.net/browse/JDK-8253049
Progress
Issue
Download
$ git fetch https://git.openjdk.java.net/jdk pull/128/head:pull/128$ git checkout pull/128