Skip to content

Conversation

@kuaiwei
Copy link
Contributor

@kuaiwei kuaiwei commented Sep 11, 2020

Now itable_stub will go through instanceKlass's itable twice to look up a method entry. resolved klass is used for type checking and method holder klass is used to find method entry. In many cases , we observed resolved klass is as same as holder klass. So we can improve itable stub based on it. If they are same klass, stub uses a fast loop to check only one klass. If not, a slow loop is used to checking both klasses.

Even entering in slow loop, new implementation can be better than old one in some cases. Because new stub just need go through itable once and reduce memory operations.

bug: https://bugs.openjdk.java.net/browse/JDK-8253049


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8253049: Enhance itable_stub for AArch64 and x86_64

Download

$ git fetch https://git.openjdk.java.net/jdk pull/128/head:pull/128
$ git checkout pull/128

@bridgekeeper bridgekeeper bot added the oca Needs verification of OCA signatory status label Sep 11, 2020
@bridgekeeper
Copy link

bridgekeeper bot commented Sep 11, 2020

Hi @kuaiwei, welcome to this OpenJDK project and thanks for contributing!

We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.

If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user kuaiwei" as summary for the issue.

If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

@openjdk
Copy link

openjdk bot commented Sep 11, 2020

@kuaiwei The following label will be automatically applied to this pull request: hotspot.

When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label (add|remove) "label" command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Sep 11, 2020
@kuaiwei
Copy link
Contributor Author

kuaiwei commented Sep 11, 2020

/covered

@bridgekeeper bridgekeeper bot added the oca-verify Needs verification of OCA signatory status label Sep 11, 2020
@bridgekeeper
Copy link

bridgekeeper bot commented Sep 11, 2020

Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

@kuaiwei
Copy link
Contributor Author

kuaiwei commented Sep 11, 2020

/label add hotspot-compiler

@openjdk
Copy link

openjdk bot commented Sep 11, 2020

@kuaiwei Usage: /label <add|remove> [label[, label, ...]]wherelabel` is an additional classification that should be applied to this PR. These labels are valid:

  • serviceability
  • hotspot
  • sound
  • hotspot-compiler
  • kulla
  • i18n
  • shenandoah
  • jdk
  • javadoc
  • 2d
  • security
  • swing
  • hotspot-runtime
  • jmx
  • build
  • nio
  • beans
  • core-libs
  • compiler
  • net
  • hotspot-gc
  • hotspot-jfr
  • awt

@kuaiwei
Copy link
Contributor Author

kuaiwei commented Sep 11, 2020

/label add hotspot-compiler

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Sep 11, 2020
@openjdk
Copy link

openjdk bot commented Sep 11, 2020

@kuaiwei
The hotspot-compiler label was successfully added.

@bridgekeeper bridgekeeper bot removed oca Needs verification of OCA signatory status oca-verify Needs verification of OCA signatory status labels Sep 14, 2020
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 14, 2020
@mlbridge
Copy link

mlbridge bot commented Sep 14, 2020

Webrevs

@mlbridge
Copy link

mlbridge bot commented Sep 14, 2020

Mailing list message from Vladimir Ivanov on hotspot-dev:

Hi Kevin,

Very interesting observations. I like the idea to optimize for the case
when REFC == DECC.

Fusing 2 passes over the itable into one does look attractive, but I'm
not sure the proposed variant is correct. I suggest to split the patch
into 2 enhancements and handle them separately.

I'm curious what kind of benchmarks you used and what are the
improvements observed with the patch.

One suggestion about the implementation:

src/hotspot/cpu/x86/macroAssembler_x86.cpp:

+void MacroAssembler::lookup_interface_method_in_stub(Register recv_klass,

I'd like to avoid having 2 independent implementations of itable lookup
(MacroAssembler::lookup_interface_method_in_stub() and
MacroAssembler::lookup_interface_method()). It would be nice to keep the
implementation unified between itable and MethodHandle linkToInterface
linker stubs.

What MacroAssembler::lookup_interface_method(..., true
/*return_method*/) does is interface method lookup w/o proper subtype
check and it is equivalent to fast loop in
MacroAssembler::lookup_interface_method_in_stub().

As a possible path forward, you could introduce the fast path check
first by moving the fast path check into
VtableStubs::create_itable_stub() and guard the first path over the
itable. It would make the type checking pass over itable optional based
on runtime check.

Then you could refactor MacroAssembler::lookup_interface_method() to
optionally do REFC and DECC checks on every iteration and migrate
VtableStubs::create_itable_stub() and
MethodHandles::generate_method_handle_dispatch() to it.

Best regards,
Vladimir Ivanov

On 14.09.2020 13:52, kuaiwei wrote:

@mlbridge
Copy link

mlbridge bot commented Sep 15, 2020

Mailing list message from Kuai Wei on hotspot-compiler-dev:

Hi Vladimir,

Thanks for your review.

I updated my test cases in test/micro/org/openjdk/bench/vm/compiler/InterfaceCalls.java . My tests will not inline interface methods and most cpu are used by itable_stub.
every test will run 10 warmup iterations and 5 measure iterations for one score. I took 3 score for every test.
Below is test result on my machines, it looks slow loop has more improvement than origin one.

aarch64:
=== testStubPoly3 ===
orig: 38430308.215 38438769.040 38325616.152
opt : 39425275.311 39626194.985 39374242.065

=== testStubPoly5 ===
orig: 23227433.053 23210843.937 23212518.073
opt : 23805995.657 23797837.061 23861764.978

=== testSlowStubPoly3 ===
orig: 30838750.839 30886603.202 30841314.152
opt : 36166775.967 36242733.807 36041506.263

=== testSlowStubPoly5 ===
orig: 18713218.115 18706994.686 18686729.040
opt : 21827549.808 21836822.173 21861920.069

x86:
=== testStubPoly3 ===
orig: 36339726.912 36322863.060 36363196.132
opt : 38631086.341 38465649.400 38466044.926

=== testStubPoly5 ===
orig: 22240149.674 22218724.450 22225970.358
opt : 23498941.840 23454580.221 23497053.570

=== testSlowStubPoly3 ===
orig: 28693696.199 28700714.257 28587900.429
opt : 34187319.519 34171321.762 34138648.599

=== testSlowStubPoly5 ===
orig: 17388480.977 17389247.386 17177206.666
opt : 20697609.518 20771108.051 20699215.655

I think lookup_interface_method can be reused as fast path. And it is also used by templateTable::invoke_interface and generate_method_handle_dispatch.
My implementation in slow path need more registers (6 registers so far), I need to check if there's register conflict in these methods. I'd like to keep a separate
slow path implementation. How do you think about it?

Thanks,
Kevin

------------------------------------------------------------------
From:Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
Send Time:2020?9?14?(???) 22:10
To:kuaiwei <github.com+1981974+kuaiwei at openjdk.java.net>; hotspot-dev <hotspot-dev at openjdk.java.net>; hotspot-compiler-dev <hotspot-compiler-dev at openjdk.java.net>
Subject:Re: RFR: 8253049: Enhance itable_stub for AArch64 and x86_64

Hi Kevin,

Very interesting observations. I like the idea to optimize for the case
when REFC == DECC.

Fusing 2 passes over the itable into one does look attractive, but I'm
not sure the proposed variant is correct. I suggest to split the patch
into 2 enhancements and handle them separately.

I'm curious what kind of benchmarks you used and what are the
improvements observed with the patch.

One suggestion about the implementation:

src/hotspot/cpu/x86/macroAssembler_x86.cpp:

+void MacroAssembler::lookup_interface_method_in_stub(Register recv_klass,

I'd like to avoid having 2 independent implementations of itable lookup
(MacroAssembler::lookup_interface_method_in_stub() and
MacroAssembler::lookup_interface_method()). It would be nice to keep the
implementation unified between itable and MethodHandle linkToInterface
linker stubs.

What MacroAssembler::lookup_interface_method(..., true
/*return_method*/) does is interface method lookup w/o proper subtype
check and it is equivalent to fast loop in
MacroAssembler::lookup_interface_method_in_stub().

As a possible path forward, you could introduce the fast path check
first by moving the fast path check into
VtableStubs::create_itable_stub() and guard the first path over the
itable. It would make the type checking pass over itable optional based
on runtime check.

Then you could refactor MacroAssembler::lookup_interface_method() to
optionally do REFC and DECC checks on every iteration and migrate
VtableStubs::create_itable_stub() and
MethodHandles::generate_method_handle_dispatch() to it.

Best regards,
Vladimir Ivanov

On 14.09.2020 13:52, kuaiwei wrote:

@mlbridge
Copy link

mlbridge bot commented Sep 15, 2020

Mailing list message from Vladimir Ivanov on hotspot-dev:

I updated my test cases in test/micro/org/openjdk/bench/vm/compiler/InterfaceCalls.java . My tests will not inline interface methods and most cpu are used by itable_stub.
every test will run 10 warmup iterations and 5 measure iterations for one score. I took 3 score for every test.
Below is test result on my machines, it looks slow loop has more improvement than origin one.

Good, thanks for the numbers. I'm curious have you observed any
improvements on larger scale benchmarks or real world apps?

I'm asking because linear scan is already far from optimal when there
are many superinterfaces present.

I think lookup_interface_method can be reused as fast path. And it is also used by templateTable::invoke_interface and generate_method_handle_dispatch.
My implementation in slow path need more registers (6 registers so far), I need to check if there's register conflict in these methods. I'd like to keep a separate
slow path implementation. How do you think about it?

Frankly speaking, I'd like to avoid the duplication.

Also, absence of guarantees about order of interfaces in the itable
complicates things: REFC and DECC can be encountered in arbitrary order
and the pass should take that into account. For example, I don't see
early exit on success in slow variant, so every lookup has to go through
the whole itable irrespective of whether it succeeds or fails. I
attribute that to the complications induced by aforementioned aspect.

And speaking of the overall approach (as it is implemented now), IMO
increased complexity doesn't worth it. If interface calls become a
bottleneck, the problem lies not in itable stub, but the overall design
which requires linear scan over itables. It's better to put the effort
there than micro-optimizing the stub.

But I'm happy to change my mind if the rewritten implementation makes it
easier to reason about the code.

(FTR subtype checks suffer from a similar problem: unless
Klass::_secondary_super_cache catches it, subtype check for an interface
does a linear scan over _secondary_supers array.)

Best regards,
Vladimir Ivanov

------------------------------------------------------------------
From:Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
Send Time:2020?9?14?(???) 22:10
To:kuaiwei <github.com+1981974+kuaiwei at openjdk.java.net>; hotspot-dev <hotspot-dev at openjdk.java.net>; hotspot-compiler-dev <hotspot-compiler-dev at openjdk.java.net>
Subject:Re: RFR: 8253049: Enhance itable_stub for AArch64 and x86_64

Hi Kevin,

Very interesting observations. I like the idea to optimize for the case
when REFC == DECC.

Fusing 2 passes over the itable into one does look attractive, but I'm
not sure the proposed variant is correct. I suggest to split the patch
into 2 enhancements and handle them separately.

I'm curious what kind of benchmarks you used and what are the
improvements observed with the patch.

One suggestion about the implementation:

src/hotspot/cpu/x86/macroAssembler_x86.cpp:

+void MacroAssembler::lookup_interface_method_in_stub(Register recv_klass,

I'd like to avoid having 2 independent implementations of itable lookup
(MacroAssembler::lookup_interface_method_in_stub() and
MacroAssembler::lookup_interface_method()). It would be nice to keep the
implementation unified between itable and MethodHandle linkToInterface
linker stubs.

What MacroAssembler::lookup_interface_method(..., true
/*return_method*/) does is interface method lookup w/o proper subtype
check and it is equivalent to fast loop in
MacroAssembler::lookup_interface_method_in_stub().

As a possible path forward, you could introduce the fast path check
first by moving the fast path check into
VtableStubs::create_itable_stub() and guard the first path over the
itable. It would make the type checking pass over itable optional based
on runtime check.

Then you could refactor MacroAssembler::lookup_interface_method() to
optionally do REFC and DECC checks on every iteration and migrate
VtableStubs::create_itable_stub() and
MethodHandles::generate_method_handle_dispatch() to it.

Best regards,
Vladimir Ivanov

On 14.09.2020 13:52, kuaiwei wrote:

@mlbridge
Copy link

mlbridge bot commented Sep 15, 2020

Mailing list message from Andrew Haley on hotspot-dev:

On 15/09/2020 10:02, Vladimir Ivanov wrote:

And speaking of the overall approach (as it is implemented now), IMO
increased complexity doesn't worth it. If interface calls become a
bottleneck, the problem lies not in itable stub, but the overall design
which requires linear scan over itables. It's better to put the effort
there than micro-optimizing the stub.

Indeed. When I first came to HotSpot after working on GCJ for years
I was very surprised to see a linear scan used for interface dispatch.

The code improvements look to be fairly minor.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

@mlbridge
Copy link

mlbridge bot commented Sep 15, 2020

Mailing list message from Kuai Wei on hotspot-dev:

Thanks for your quick reply.

I updated my test cases in test/micro/org/openjdk/bench/vm/compiler/InterfaceCalls.java . My tests will not inline interface methods and most cpu are used by itable_stub.
every test will run 10 warmup iterations and 5 measure iterations for one score. I took 3 score for every test.
Below is test result on my machines, it looks slow loop has more improvement than origin one.

Good, thanks for the numbers. I'm curious have you observed any
improvements on larger scale benchmarks or real world apps?

I'm asking because linear scan is already far from optimal when there
are many superinterfaces present.

Kevin: itable_stub was found hot on several online applications. So I started to work on this. Now I don't have chance to verify it online. So I uses microbenchmarks to verify. I will
test with some benchmarks.

I think lookup_interface_method can be reused as fast path. And it is also used by templateTable::invoke_interface and generate_method_handle_dispatch.
My implementation in slow path need more registers (6 registers so far), I need to check if there's register conflict in these methods. I'd like to keep a separate
slow path implementation. How do you think about it?

Frankly speaking, I'd like to avoid the duplication.

Kevin: Ok, I will try to merge them.

Also, absence of guarantees about order of interfaces in the itable
complicates things: REFC and DECC can be encountered in arbitrary order
and the pass should take that into account. For example, I don't see
early exit on success in slow variant, so every lookup has to go through
the whole itable irrespective of whether it succeeds or fails. I
attribute that to the complications induced by aforementioned aspect.

Kevin: I use a counter for matching. If it reaches zero, the iteration can exit early.

And speaking of the overall approach (as it is implemented now), IMO
increased complexity doesn't worth it. If interface calls become a
bottleneck, the problem lies not in itable stub, but the overall design
which requires linear scan over itables. It's better to put the effort
there than micro-optimizing the stub.

Kevin: I agree we can improve itable design. My initial think is jvm may reorder itable at safepoint. I can take it as a follow up optimization.

But I'm happy to change my mind if the rewritten implementation makes it
easier to reason about the code.

(FTR subtype checks suffer from a similar problem: unless
Klass::_secondary_super_cache catches it, subtype check for an interface
does a linear scan over _secondary_supers array.)

Regards,
Kevin

------------------------------------------------------------------
From:Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
Send Time:2020?9?15?(???) 17:03
To:??(??) <kuaiwei.kw at alibaba-inc.com>; hotspot-dev <hotspot-dev-retn at openjdk.java.net>; kuaiwei <github.com+1981974+kuaiwei at openjdk.java.net>; hotspot-dev <hotspot-dev at openjdk.java.net>; hotspot-compiler-dev <hotspot-compiler-dev at openjdk.java.net>
Subject:Re: RFR: 8253049: Enhance itable_stub for AArch64 and x86_64

I updated my test cases in test/micro/org/openjdk/bench/vm/compiler/InterfaceCalls.java . My tests will not inline interface methods and most cpu are used by itable_stub.
every test will run 10 warmup iterations and 5 measure iterations for one score. I took 3 score for every test.
Below is test result on my machines, it looks slow loop has more improvement than origin one.

Good, thanks for the numbers. I'm curious have you observed any
improvements on larger scale benchmarks or real world apps?

I'm asking because linear scan is already far from optimal when there
are many superinterfaces present.

I think lookup_interface_method can be reused as fast path. And it is also used by templateTable::invoke_interface and generate_method_handle_dispatch.
My implementation in slow path need more registers (6 registers so far), I need to check if there's register conflict in these methods. I'd like to keep a separate
slow path implementation. How do you think about it?

Frankly speaking, I'd like to avoid the duplication.

Also, absence of guarantees about order of interfaces in the itable
complicates things: REFC and DECC can be encountered in arbitrary order
and the pass should take that into account. For example, I don't see
early exit on success in slow variant, so every lookup has to go through
the whole itable irrespective of whether it succeeds or fails. I
attribute that to the complications induced by aforementioned aspect.

And speaking of the overall approach (as it is implemented now), IMO
increased complexity doesn't worth it. If interface calls become a
bottleneck, the problem lies not in itable stub, but the overall design
which requires linear scan over itables. It's better to put the effort
there than micro-optimizing the stub.

But I'm happy to change my mind if the rewritten implementation makes it
easier to reason about the code.

(FTR subtype checks suffer from a similar problem: unless
Klass::_secondary_super_cache catches it, subtype check for an interface
does a linear scan over _secondary_supers array.)

Best regards,
Vladimir Ivanov

------------------------------------------------------------------
From:Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
Send Time:2020?9?14?(???) 22:10
To:kuaiwei <github.com+1981974+kuaiwei at openjdk.java.net>; hotspot-dev <hotspot-dev at openjdk.java.net>; hotspot-compiler-dev <hotspot-compiler-dev at openjdk.java.net>
Subject:Re: RFR: 8253049: Enhance itable_stub for AArch64 and x86_64

Hi Kevin,

Very interesting observations. I like the idea to optimize for the case
when REFC == DECC.

Fusing 2 passes over the itable into one does look attractive, but I'm
not sure the proposed variant is correct. I suggest to split the patch
into 2 enhancements and handle them separately.

I'm curious what kind of benchmarks you used and what are the
improvements observed with the patch.

One suggestion about the implementation:

src/hotspot/cpu/x86/macroAssembler_x86.cpp:

+void MacroAssembler::lookup_interface_method_in_stub(Register recv_klass,

I'd like to avoid having 2 independent implementations of itable lookup
(MacroAssembler::lookup_interface_method_in_stub() and
MacroAssembler::lookup_interface_method()). It would be nice to keep the
implementation unified between itable and MethodHandle linkToInterface
linker stubs.

What MacroAssembler::lookup_interface_method(..., true
/*return_method*/) does is interface method lookup w/o proper subtype
check and it is equivalent to fast loop in
MacroAssembler::lookup_interface_method_in_stub().

As a possible path forward, you could introduce the fast path check
first by moving the fast path check into
VtableStubs::create_itable_stub() and guard the first path over the
itable. It would make the type checking pass over itable optional based
on runtime check.

Then you could refactor MacroAssembler::lookup_interface_method() to
optionally do REFC and DECC checks on every iteration and migrate
VtableStubs::create_itable_stub() and
MethodHandles::generate_method_handle_dispatch() to it.

Best regards,
Vladimir Ivanov

On 14.09.2020 13:52, kuaiwei wrote:

@mlbridge
Copy link

mlbridge bot commented Sep 15, 2020

Mailing list message from Vladimir Ivanov on hotspot-dev:

And speaking of the overall approach (as it is implemented now), IMO
increased complexity doesn't worth it. If interface calls become a
bottleneck, the problem lies not in itable stub, but the overall design
which requires linear scan over itables. It's better to put the effort
there than micro-optimizing the stub.

Indeed. When I first came to HotSpot after working on GCJ for years
I was very surprised to see a linear scan used for interface dispatch.

FTR Erik? has been looking into rewriting virtual dispatch logic:

http://openjdk.java.net/jeps/8221828

Best regards,
Vladimir Ivanov

The code improvements look to be fairly minor.

@mlbridge
Copy link

mlbridge bot commented Sep 15, 2020

Mailing list message from Vladimir Ivanov on hotspot-dev:

????I?updated?my?test?cases?in?test/micro/org/openjdk/bench/vm/compiler/InterfaceCalls.java?.?My?tests?will?not?inline?interface?methods?and?most?cpu?are?used?by?itable_stub.
?every?test?will?run?10?warmup?iterations?and?5?measure?iterations?for?one?score.?I?took?3?score?for?every?test.
????Below?is?test?result?on?my?machines,?it?looks?slow?loop?has?more?improvement?than?origin?one.

Good,?thanks?for?the?numbers.?I'm?curious?have?you?observed?any
improvements?on?larger?scale?benchmarks?or?real?world?apps?

I'm?asking?because?linear?scan?is?already?far?from?optimal?when?there
are?many?superinterfaces?present.

Kevin: itable_stub was found hot on several online applications. So I
started to work on this. Now I don't have chance to verify it online. So
I uses microbenchmarks to verify. I will
test with some benchmarks.

That's unfortunate. It would be very helpful to confirm the results of
the micro-benchmarks (nano-, in this particular case).

????I?think?lookup_interface_method?can?be?reused?as?fast?path.?And?it?is?also?used?by?templateTable::invoke_interface?and?generate_method_handle_dispatch.
?My?implementation?in?slow?path?need?more?registers?(6?registers?so?far),?I?need?to?check?if?there's?register?conflict?in?these?methods.?I'd?like?to?keep?a?separate
?slow?path?implementation.?How?do?you?think?about?it?

Frankly?speaking,?I'd?like?to?avoid?the?duplication.

Kevin: Ok, I will try to merge them.

Also,?absence?of?guarantees?about?order?of?interfaces?in?the?itable
complicates?things:?REFC?and?DECC?can?be?encountered?in?arbitrary?order
and?the?pass?should?take?that?into?account.?For?example,?I?don't?see
early?exit?on?success?in?slow?variant,?so?every?lookup?has?to?go?through
the?whole?itable?irrespective?of?whether?it?succeeds?or?fails.?I
attribute?that?to?the?complications?induced?by?aforementioned?aspect.

Kevin: I use a counter for matching. If it reaches zero, the iteration
can exit early.

Good. Thanks for the clarification.

Alternatively, you could use 2 bits in the temp register to code the
state. IMO it's clearer and more robust w.r.t. possible bugs.

Or even explicitly encode the state in the code as an automaton by
generating 3 loop variants (check REFC + check DECC + check both).
But IMO it falls into over-engineering category :-)

Also, on naming: I find it hard to reason about the logic.
Registers are re-used for different purposes and the names don't help at
all (even adds to the confusion).

As an example:

movptr(method_result, Address(recv_klass, holder_klass,
Address::times_1));

And?speaking?of?the?overall?approach?(as?it?is?implemented?now),?IMO
increased?complexity?doesn't?worth?it.?If?interface?calls?become?a
bottleneck,?the?problem?lies?not?in?itable?stub,?but?the?overall?design
which?requires?linear?scan?over?itables.?It's?better?to?put?the?effort
there?than?micro-optimizing?the?stub.

Kevin: I agree we can improve itable design. My initial think is jvm may
reorder itable at safepoint. I can take it as a follow up optimization.

Well, I would definitely prefer to avoid additional runtime changes (to
sort interfaces in itables and verify their order later) just to support
minor improvements in itable stubs.

Best regards,
Vladimir Ivanov

\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
From\:Vladimir Ivanov \<vladimir\.x\.ivanov at oracle\.com>
Send Time\:2020\?9\?15\?\(\?\?\?\) 17\:03
To\:\?\?\(\?\?\) \<kuaiwei\.kw at alibaba\-inc\.com>\; hotspot\-dev
\<hotspot\-dev\-retn at openjdk\.java\.net>\; kuaiwei
\<github\.com\+1981974\+kuaiwei at openjdk\.java\.net>\; hotspot\-dev
\<hotspot\-dev at openjdk\.java\.net>\; hotspot\-compiler\-dev
\<hotspot\-compiler\-dev at openjdk\.java\.net>
Subject\:Re\: RFR\: 8253049\: Enhance itable\_stub for AArch64 and x86\_64


 >\?\?\?\?I\?updated\?my\?test\?cases\?in\?test\/micro\/org\/openjdk\/bench\/vm\/compiler\/InterfaceCalls\.java\?\.\?My\?tests\?will\?not\?inline\?interface\?methods\?and\?most\?cpu\?are\?used\?by\?itable\_stub\.
 >\?every\?test\?will\?run\?10\?warmup\?iterations\?and\?5\?measure\?iterations\?for\?one\?score\.\?I\?took\?3\?score\?for\?every\?test\.
 >\?\?\?\?Below\?is\?test\?result\?on\?my\?machines\,\?it\?looks\?slow\?loop\?has\?more\?improvement\?than\?origin\?one\.

Good\,\?thanks\?for\?the\?numbers\.\?I\'m\?curious\?have\?you\?observed\?any
improvements\?on\?larger\?scale\?benchmarks\?or\?real\?world\?apps\?

I\'m\?asking\?because\?linear\?scan\?is\?already\?far\?from\?optimal\?when\?there
are\?many\?superinterfaces\?present\.

 >\?\?\?\?I\?think\?lookup\_interface\_method\?can\?be\?reused\?as\?fast\?path\.\?And\?it\?is\?also\?used\?by\?templateTable\:\:invoke\_interface\?and\?generate\_method\_handle\_dispatch\.
 >\?My\?implementation\?in\?slow\?path\?need\?more\?registers\?\(6\?registers\?so\?far\)\,\?I\?need\?to\?check\?if\?there\'s\?register\?conflict\?in\?these\?methods\.\?I\'d\?like\?to\?keep\?a\?separate
 >\?slow\?path\?implementation\.\?How\?do\?you\?think\?about\?it\?

Frankly\?speaking\,\?I\'d\?like\?to\?avoid\?the\?duplication\.

Also\,\?absence\?of\?guarantees\?about\?order\?of\?interfaces\?in\?the\?itable
complicates\?things\:\?REFC\?and\?DECC\?can\?be\?encountered\?in\?arbitrary\?order
and\?the\?pass\?should\?take\?that\?into\?account\.\?For\?example\,\?I\?don\'t\?see
early\?exit\?on\?success\?in\?slow\?variant\,\?so\?every\?lookup\?has\?to\?go\?through

the\?whole\?itable\?irrespective\?of\?whether\?it\?succeeds\?or\?fails\.\?I
attribute\?that\?to\?the\?complications\?induced\?by\?aforementioned\?aspect\.

And\?speaking\?of\?the\?overall\?approach\?\(as\?it\?is\?implemented\?now\)\,\?IMO
increased\?complexity\?doesn\'t\?worth\?it\.\?If\?interface\?calls\?become\?a
bottleneck\,\?the\?problem\?lies\?not\?in\?itable\?stub\,\?but\?the\?overall\?design
which\?requires\?linear\?scan\?over\?itables\.\?It\'s\?better\?to\?put\?the\?effort
there\?than\?micro\-optimizing\?the\?stub\.

But\?I\'m\?happy\?to\?change\?my\?mind\?if\?the\?rewritten\?implementation\?makes\?it

easier\?to\?reason\?about\?the\?code\.

\(FTR\?subtype\?checks\?suffer\?from\?a\?similar\?problem\:\?unless
Klass\:\:\_secondary\_super\_cache\?catches\?it\,\?subtype\?check\?for\?an\?interface

does\?a\?linear\?scan\?over\?\_secondary\_supers\?array\.\)

Best\?regards\,
Vladimir\?Ivanov


 >\?\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
 >\?From\:Vladimir\?Ivanov\?\<vladimir\.x\.ivanov at oracle\.com>
 >\?Send\?Time\:2020\?9\?14\?\(\?\?\?\)\?22\:10
 >\?To\:kuaiwei\?\<github\.com\+1981974\+kuaiwei at openjdk\.java\.net>\;\?hotspot\-dev\?\<hotspot\-dev at openjdk\.java\.net>\;\?hotspot\-compiler\-dev\?\<hotspot\-compiler\-dev at openjdk\.java\.net>
 >\?Subject\:Re\:\?RFR\:\?8253049\:\?Enhance\?itable\_stub\?for\?AArch64\?and\?x86\_64
 >
 >\?Hi\?Kevin\,
 >
 >\?Very\?interesting\?observations\.\?I\?like\?the\?idea\?to\?optimize\?for\?the\?case
 >\?when\?REFC\?\=\=\?DECC\.
 >
 >\?Fusing\?2\?passes\?over\?the\?itable\?into\?one\?does\?look\?attractive\,\?but\?I\'m
 >\?not\?sure\?the\?proposed\?variant\?is\?correct\.\?I\?suggest\?to\?split\?the\?patch
 >\?into\?2\?enhancements\?and\?handle\?them\?separately\.
 >
 >\?I\'m\?curious\?what\?kind\?of\?benchmarks\?you\?used\?and\?what\?are\?the
 >\?improvements\?observed\?with\?the\?patch\.
 >
 >\?One\?suggestion\?about\?the\?implementation\:
 >
 >\?src\/hotspot\/cpu\/x86\/macroAssembler\_x86\.cpp\:
 >
 >\?\+void\?MacroAssembler\:\:lookup\_interface\_method\_in\_stub\(Register\?recv\_klass\,
 >
 >\?I\'d\?like\?to\?avoid\?having\?2\?independent\?implementations\?of\?itable\?lookup
 >\?\(MacroAssembler\:\:lookup\_interface\_method\_in\_stub\(\)\?and
 >\?MacroAssembler\:\:lookup\_interface\_method\(\)\)\.\?It\?would\?be\?nice\?to\?keep\?the
 >\?implementation\?unified\?between\?itable\?and\?MethodHandle\?linkToInterface
 >\?linker\?stubs\.
 >
 >\?What\?MacroAssembler\:\:lookup\_interface\_method\(\.\.\.\,\?true
 >\?\/\*return\_method\*\/\)\?does\?is\?interface\?method\?lookup\?w\/o\?proper\?subtype
 >\?check\?and\?it\?is\?equivalent\?to\?fast\?loop\?in
 >\?MacroAssembler\:\:lookup\_interface\_method\_in\_stub\(\)\.
 >
 >\?As\?a\?possible\?path\?forward\,\?you\?could\?introduce\?the\?fast\?path\?check
 >\?first\?by\?moving\?the\?fast\?path\?check\?into
 >\?VtableStubs\:\:create\_itable\_stub\(\)\?and\?guard\?the\?first\?path\?over\?the
 >\?itable\.\?It\?would\?make\?the\?type\?checking\?pass\?over\?itable\?optional\?based
 >\?on\?runtime\?check\.
 >
 >\?Then\?you\?could\?refactor\?MacroAssembler\:\:lookup\_interface\_method\(\)\?to
 >\?optionally\?do\?REFC\?and\?DECC\?checks\?on\?every\?iteration\?and\?migrate
 >\?VtableStubs\:\:create\_itable\_stub\(\)\?\?and
 >\?MethodHandles\:\:generate\_method\_handle\_dispatch\(\)\?to\?it\.
 >
 >\?Best\?regards\,
 >\?Vladimir\?Ivanov
 >
 >\?On\?14\.09\.2020\?13\:52\,\?kuaiwei\?wrote\:
https\:\/\/webrevs\.openjdk\.java\.net\/\?repo\=jdk\&pr\=128\&range\=00
https\:\/\/git\.openjdk\.java\.net\/jdk\?pull\/128\/head\:pull\/128
 > 

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 13, 2020

@kuaiwei This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 10, 2020

@kuaiwei This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it!

@bridgekeeper bridgekeeper bot closed this Nov 10, 2020
openjdk-notifier bot pushed a commit that referenced this pull request Oct 5, 2021
r18 should not be used as it is reserved as platform register. Linux is
fine with userspace using it, but Windows and also recently macOS (
openjdk/jdk11u-dev#301 (comment) )
are actually using it on the kernel side.

The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to
specify which registers to spill; fortunately this helper is only used
here:
https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404

I haven't seen causing this particular instance any issues in practice
_yet_, presumably because it looks hard to align the stars in order to
trigger a problem (between stp and ldp of r18 a transition to kernel
space must happen *and* the kernel needs to do something with r18). But
jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that
causes troubles as explained in the link above.

Output of `-XX:+PrintInterpreter` before this change:
```
----------------------------------------------------------------------
method entry point (kind = native)  [0x0000000138809b00, 0x000000013880a280]  1920 bytes
--------------------------------------------------------------------------------
  0x0000000138809b00:   ldr x2, [x12, #16]
  0x0000000138809b04:   ldrh    w2, [x2, #44]
  0x0000000138809b08:   add x24, x20, x2, uxtx #3
  0x0000000138809b0c:   sub x24, x24, #0x8
[...]
  0x0000000138809fa4:   stp x16, x17, [sp, #128]
  0x0000000138809fa8:   stp x18, x19, [sp, #144]
  0x0000000138809fac:   stp x20, x21, [sp, #160]
[...]
  0x0000000138809fc0:   stp x30, xzr, [sp, #240]
  0x0000000138809fc4:   mov x0, x28
 ;; 0x10864ACCC
  0x0000000138809fc8:   mov x9, #0xaccc                 // #44236
  0x0000000138809fcc:   movk    x9, #0x864, lsl #16
  0x0000000138809fd0:   movk    x9, #0x1, lsl #32
  0x0000000138809fd4:   blr x9
  0x0000000138809fd8:   ldp x2, x3, [sp, #16]
[...]
  0x0000000138809ff4:   ldp x16, x17, [sp, #128]
  0x0000000138809ff8:   ldp x18, x19, [sp, #144]
  0x0000000138809ffc:   ldp x20, x21, [sp, #160]
```

After:
```
----------------------------------------------------------------------
method entry point (kind = native)  [0x0000000108e4db00, 0x0000000108e4e280]  1920 bytes

--------------------------------------------------------------------------------
  0x0000000108e4db00:   ldr x2, [x12, #16]
  0x0000000108e4db04:   ldrh    w2, [x2, #44]
  0x0000000108e4db08:   add x24, x20, x2, uxtx #3
  0x0000000108e4db0c:   sub x24, x24, #0x8
[...]
  0x0000000108e4dfa4:   stp x16, x17, [sp, #128]
  0x0000000108e4dfa8:   stp x19, x20, [sp, #144]
  0x0000000108e4dfac:   stp x21, x22, [sp, #160]
[...]
  0x0000000108e4dfbc:   stp x29, x30, [sp, #224]
  0x0000000108e4dfc0:   mov x0, x28
 ;; 0x107E4A06C
  0x0000000108e4dfc4:   mov x9, #0xa06c                 // #41068
  0x0000000108e4dfc8:   movk    x9, #0x7e4, lsl #16
  0x0000000108e4dfcc:   movk    x9, #0x1, lsl #32
  0x0000000108e4dfd0:   blr x9
  0x0000000108e4dfd4:   ldp x2, x3, [sp, #16]
[...]
  0x0000000108e4dff0:   ldp x16, x17, [sp, #128]
  0x0000000108e4dff4:   ldp x19, x20, [sp, #144]
  0x0000000108e4dff8:   ldp x21, x22, [sp, #160]
[...]
```
lewurm added a commit to lewurm/openjdk that referenced this pull request Oct 6, 2021
Restore looks like this now:
```
  0x0000000106e4dfcc:   movk    x9, #0x5e4, lsl openjdk#16
  0x0000000106e4dfd0:   movk    x9, #0x1, lsl openjdk#32
  0x0000000106e4dfd4:   blr x9
  0x0000000106e4dfd8:   ldp x2, x3, [sp, openjdk#16]
  0x0000000106e4dfdc:   ldp x4, x5, [sp, openjdk#32]
  0x0000000106e4dfe0:   ldp x6, x7, [sp, openjdk#48]
  0x0000000106e4dfe4:   ldp x8, x9, [sp, openjdk#64]
  0x0000000106e4dfe8:   ldp x10, x11, [sp, openjdk#80]
  0x0000000106e4dfec:   ldp x12, x13, [sp, openjdk#96]
  0x0000000106e4dff0:   ldp x14, x15, [sp, openjdk#112]
  0x0000000106e4dff4:   ldp x16, x17, [sp, openjdk#128]
  0x0000000106e4dff8:   ldp x0, x1, [sp], openjdk#144
  0x0000000106e4dffc:   ldp xzr, x19, [sp], openjdk#16
  0x0000000106e4e000:   ldp x22, x23, [sp, openjdk#16]
  0x0000000106e4e004:   ldp x24, x25, [sp, openjdk#32]
  0x0000000106e4e008:   ldp x26, x27, [sp, openjdk#48]
  0x0000000106e4e00c:   ldp x28, x29, [sp, openjdk#64]
  0x0000000106e4e010:   ldp x30, xzr, [sp, openjdk#80]
  0x0000000106e4e014:   ldp x20, x21, [sp], openjdk#96
  0x0000000106e4e018:   ldur    x12, [x29, #-24]
  0x0000000106e4e01c:   ldr x22, [x12, openjdk#16]
  0x0000000106e4e020:   add x22, x22, #0x30
  0x0000000106e4e024:   ldr x8, [x28, openjdk#8]
```
pfirmstone pushed a commit to pfirmstone/jdk-with-authorization that referenced this pull request Nov 18, 2024
pfirmstone pushed a commit to pfirmstone/jdk-with-authorization that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

1 participant