8291555: Implement alternative fast-locking scheme #10907

rkennke · 2022-10-28T20:17:37Z

This change adds a fast-locking scheme as an alternative to the current stack-locking implementation. It retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation when trying to access the mark-word. And because of the very racy nature, this turns out to be very complex and would involve a variant of the inflation protocol to ensure that the object header is stable. (The current implementation of setting/fetching the i-hash provides a glimpse into the complexity).

What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.

This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does not overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations (and we already do). The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols.

The lock-stack is fixed size, currently with 8 elements. According to my experiments with various workloads, this covers the vast majority of workloads (in-fact, most workloads seem to never exceed 5 active locks per thread at a time). We check for overflow in the fast-paths and when the lock-stack is full, we take the slow-path, which would inflate the lock to a monitor. That case should be very rare.

In contrast to stack-locking, fast-locking does not support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.

One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.

As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it.

This change enables to simplify (and speed-up!) a lot of code:

The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR

Also, and I might be mistaken here, this new lightweight locking would make synchronized work better with Loom: Because the lock-records are no longer scattered across the stack, but instead are densely packed into the lock-stack, it should be easy for a vthread to save its lock-stack upon unmounting and restore it when re-mounting. However, I am not sure about this, and this PR does not attempt to implement that support.

Testing:

tier1 x86_64 x aarch64 x +UseFastLocking
tier2 x86_64 x aarch64 x +UseFastLocking
tier3 x86_64 x aarch64 x +UseFastLocking
tier4 x86_64 x aarch64 x +UseFastLocking
tier1 x86_64 x aarch64 x -UseFastLocking
tier2 x86_64 x aarch64 x -UseFastLocking
tier3 x86_64 x aarch64 x -UseFastLocking
tier4 x86_64 x aarch64 x -UseFastLocking
Several real-world applications have been tested with this change in tandem with Lilliput without any problems, yet

Performance

Simple Microbenchmark

The microbenchmark exercises only the locking primitives for monitorenter and monitorexit, without contention. The benchmark can be found (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops.

	x86_64	aarch64
-UseFastLocking	20.651	20.764
+UseFastLocking	18.896	18.908

Renaissance

	x86_64			aarch64
	stack-locking	fast-locking		stack-locking	fast-locking
AkkaUct	841.884	836.948	0.59%	1475.774	1465.647	0.69%
Reactors	11041.427	11181.451	-1.25%	11381.751	11521.318	-1.21%
Als	1367.183	1359.358	0.58%	1678.103	1688.067	-0.59%
ChiSquare	577.021	577.398	-0.07%	986.619	988.063	-0.15%
GaussMix	817.459	819.073	-0.20%	1154.293	1155.522	-0.11%
LogRegression	598.343	603.371	-0.83%	638.052	644.306	-0.97%
MovieLens	8248.116	8314.576	-0.80%	7569.219	7646.828	-1.01%%
NaiveBayes	587.607	581.608	1.03%	541.583	550.059	-1.54%
PageRank	3260.553	3263.472	-0.09%	4376.405	4381.101	-0.11%
FjKmeans	979.978	976.122	0.40%	774.312	771.235	0.40%
FutureGenetic	2187.369	2183.271	0.19%	2685.722	2689.056	-0.12%
ParMnemonics	2434.551	2468.763	-1.39%	4278.225	4263.863	0.34%
Scrabble	111.882	111.768	0.10%	151.796	153.959	-1.40%
RxScrabble	210.252	211.38	-0.53%	310.116	315.594	-1.74%
Dotty	750.415	752.658	-0.30%	1033.636	1036.168	-0.24%
ScalaDoku	3072.05	3051.2	0.68%	3711.506	3690.04	0.58%
ScalaKmeans	211.427	209.957	0.70%	264.38	265.788	-0.53%
ScalaStmBench7	1017.795	1018.869	-0.11%	1088.182	1092.266	-0.37%
Philosophers	6450.124	6565.705	-1.76%	12017.964	11902.559	0.97%
FinagleChirper	3953.623	3972.647	-0.48%	4750.751	4769.274	-0.39%
FinagleHttp	3970.526	4005.341	-0.87%	5294.125	5296.224	-0.04%

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Change requires CSR request JDK-8305999 to be approved
Commit message must refer to an issue

Issues

JDK-8291555: Implement alternative fast-locking scheme
JDK-8305999: Add experimental -XX:LockingMode flag (CSR)

Reviewers

Daniel D. Daugherty (@dcubed-ojdk - Reviewer)
Thomas Stuefe (@tstuefe - Reviewer)
Aleksey Shipilev (@shipilev - Reviewer)
David Holmes (@dholmes-ora - Reviewer)
Dean Long (@dean-long - Reviewer) ⚠️ Review applies to 423dbcdb

Contributors

Fei Yang <fyang@openjdk.org>
Thomas Stuefe <stuefe@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/10907/head:pull/10907
$ git checkout pull/10907

Update a local copy of the PR:
$ git checkout pull/10907
$ git pull https://git.openjdk.org/jdk.git pull/10907/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10907

View PR using the GUI difftool:
$ git pr show -t 10907

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10907.diff

Webrev

Link to Webrev Comment

bridgekeeper · 2022-10-28T20:21:02Z

👋 Welcome back rkennke! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2022-10-28T20:24:09Z

⚠️ @rkennke This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

openjdk · 2022-10-28T20:24:55Z

@rkennke The following labels will be automatically applied to this pull request:

hotspot
serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

openjdk · 2022-10-29T20:56:53Z

@rkennke this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8291555-v2
git fetch https://git.openjdk.org/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

src/hotspot/share/runtime/thread.cpp

src/hotspot/share/runtime/objectMonitor.hpp

test/hotspot/jtreg/compiler/floatingpoint/TestFloatSyncJNIArgs.java

RealFYang · 2022-11-15T11:29:13Z

Hi, I have also updated the riscv part from #10590. Could you please put these changes in this PR? Thanks.
I can run non-trivial benchmark workloads with +UseFastLocking on linux-riscv64 platform. I will try to perform more tests as needed.

10907-riscv-v1.txt

dcubed-ojdk · 2023-05-08T15:31:27Z

Mach5 Tier[1-8] of v77 with forced-fast-locking results look good.
Mach5 Tier[1-8] of v77 with default-stack-locking results also look good.

I do still have to check in with Eric Caspole about the performance testing
of the baseline versus the default-stack-locking configuration. We did that
testing with a baseline of jdk-21+21-1704 and the v66 version of the patch
in default-stack-locking configuration. Eric also did testing of the v66 version
of the patch with forced-fast-locking, but those results are not a gate for
determining whether this patch gets integrated.

dcubed-ojdk · 2023-05-08T16:01:03Z

I discussed the perf testing results with Eric Caspole and here's our summary:

Promotion performance testing was done on the jdk-21+21-1704 baseline, the v66 patch with
default-stack-locking and the v66 patch with forced-fast-locking. Comparing the baseline with
default-stack-locking: 62 improvements and 54 regressions, none of the improvements or
regressions are statistically significant.

Comparing the baseline with forced-fast-locking: 68 improvements and 177 regressions, none
of the improvements or regressions are statistically significant.

Comparing default-stack-locking with forced-fast-locking: 45 improvements, 173 regressions,
none of the improvements or regressions are statistically significant.

Eric Caspole and I are both "go" from a performance testing POV! (This is for the stack-locking
as the default configuration.)

dcubed-ojdk

Thumbs up (still).

dcubed-ojdk · 2023-05-08T16:06:58Z

Checking platform specific code review coverage:
@dean-long and @dcubed-ojdk did the review of the arm32 and aarch64 changes.
@dholmes-ora and @dcubed-ojdk did the review of the X64/X86 changes.

Was there a specific reviewer for the RISC-V changes?

rkennke · 2023-05-08T16:08:34Z

Checking platform specific code review coverage: @dean-long and @dcubed-ojdk did the review of the arm32 and aarch64 changes. @dholmes-ora and @dcubed-ojdk did the review of the X64/X86 changes.

Was there a specific reviewer for the RISC-V changes?

No, not really. @RealFYang contributed the code and would also be the only guy that I know who would review it ;-)

dcubed-ojdk · 2023-05-08T16:09:44Z

Okay. I'm good with that decision (but we don't have RISC-V in our CI)...

tstuefe · 2023-05-08T16:43:24Z

Checking platform specific code review coverage: @dean-long and @dcubed-ojdk did the review of the arm32 and aarch64 changes. @dholmes-ora and @dcubed-ojdk did the review of the X64/X86 changes.
Was there a specific reviewer for the RISC-V changes?

No, not really. @RealFYang contributed the code and would also be the only guy that I know who would review it ;-)

The RiscV changes look okay to me.

tstuefe

LGTM

shipilev

All right, this looks generally good to me. There might be a need to touch up this code going forward, but probably in separate PRs, to avoid invalidating the testing.

shipilev · 2023-05-08T17:04:29Z

src/hotspot/cpu/arm/c1_MacroAssembler_arm.cpp

-  // On MP platforms the next load could return a 'stale' value if the memory location has been modified by another thread.
-  // That would be acceptable as ether CAS or slow case path is taken in that case.
+  if (LockingMode == LM_LIGHTWEIGHT) {
+    log_trace(fastlock)("C1_MacroAssembler::lock fast");


Here and later: I don't think we need these log statements?

They had been helpful during development, to verify code generation. Going forward they can be scrapped, but lets do that in a follow-up PR.

rkennke · 2023-05-08T17:50:08Z

Alright, let's ship! Hurray! Thanks everybody, that's been a long one!
/integrate

openjdk · 2023-05-08T17:52:13Z

Going to push as commit 7f6358a.
Since your change was applied there have been 40 commits pushed to the master branch:

4116b10: 8306461: ObjectInputStream::readObject() should handle negative array sizes without throwing NegativeArraySizeExceptions
93ee19f: 8305486: Add split() variants that keep the delimiters to String and j.u.r.Pattern
ad90fb6: 8293786: HttpClient will not send more than 64 kb of data from the 2nd request in http2
5a259d8: 8303153: Native interpreter frame missing mirror
4b02956: 8054022: HttpURLConnection timeouts with Expect: 100-Continue and no chunking
9f34e4f: 8304434: [AIX] Update minimum xlclang version
d2e0e53: 8307604: gcc12 based Alpine build broken build after JDK-8307301
e91f0d3: 8307571: Remove unused SomeConstants in WatcherThread class
26755a9: 8306408: Fix the format of several tables in building.md
64c0962: 8307569: Build with gcc8 is broken after JDK-8307301
... and 30 more: https://git.openjdk.org/jdk/compare/f143bf7c4554a689f17c373ea5d99b68dd518b2f...master

Your commit was automatically rebased without conflicts.

openjdk · 2023-05-08T17:52:24Z

@rkennke Pushed as commit 7f6358a.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

rkennke and others added 8 commits October 24, 2022 20:40

Initial stuff

5c0e46f

8295849: Consolidate Threads::owning_thread*

84639ab

Merge branch 'JDK-8295849' into JDK-8291555-v2

a0dc5f1

Fix OM::has_owner()

ca225c9

Improve condition in OM::has_owner()

04c780d

Fix has_owner() condition

37fa31b

Merge branch 'JDK-8295849' into JDK-8291555-v2

be0e9be

Implement most runtime and aarch64 parts

1abb143

rkennke mentioned this pull request Oct 28, 2022

8291555: Replace stack-locking with fast-locking #10590

Closed

8 tasks

openjdk bot added serviceability serviceability-dev@openjdk.org hotspot hotspot-dev@openjdk.org labels Oct 28, 2022

rkennke added 2 commits October 29, 2022 16:51

Couple of fixes

229a745

More fixes

adcedbf

openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Oct 29, 2022

rkennke added 4 commits October 30, 2022 13:03

Implement SA parts, and more fixes

7d30304

Implement x86_64 parts

8d3a209

More fixes

535c776

Merge remote-tracking branch 'upstream/master' into JDK-8291555-v2

b28fca5

openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Nov 8, 2022

rkennke added 2 commits November 8, 2022 13:39

Fix tabs

7510c2d

Fix x86_32

3430353

dholmes-ora reviewed Nov 13, 2022

View reviewed changes

src/hotspot/share/runtime/thread.cpp Outdated Show resolved Hide resolved

robehn reviewed Nov 14, 2022

View reviewed changes

src/hotspot/share/runtime/objectMonitor.hpp Outdated Show resolved Hide resolved

robehn reviewed Nov 14, 2022

View reviewed changes

test/hotspot/jtreg/compiler/floatingpoint/TestFloatSyncJNIArgs.java Outdated Show resolved Hide resolved

Start of optimization to avoid checking lock-stack size

6876a3a

openjdk bot added ready Pull request is ready to be integrated and removed ready Pull request is ready to be integrated labels May 8, 2023

dcubed-ojdk approved these changes May 8, 2023

View reviewed changes

tstuefe approved these changes May 8, 2023

View reviewed changes

openjdk bot added ready Pull request is ready to be integrated and removed ready Pull request is ready to be integrated labels May 8, 2023

shipilev approved these changes May 8, 2023

View reviewed changes

openjdk bot added ready Pull request is ready to be integrated and removed ready Pull request is ready to be integrated labels May 8, 2023

openjdk bot added the integrated Pull request has been integrated label May 8, 2023

openjdk bot closed this May 8, 2023

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 8, 2023

rkennke mentioned this pull request May 12, 2023

8305895: Implementation: JEP 450: Compact Object Headers (Experimental) #13961

Closed

20 tasks

8291555: Implement alternative fast-locking scheme #10907

8291555: Implement alternative fast-locking scheme #10907

Uh oh!

Conversation

rkennke commented Oct 28, 2022 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Simple Microbenchmark

Renaissance

Progress

Issues

Reviewers

Contributors

Reviewing

Webrev

Uh oh!

bridgekeeper bot commented Oct 28, 2022

Uh oh!

openjdk bot commented Oct 28, 2022

Uh oh!

openjdk bot commented Oct 28, 2022

Uh oh!

openjdk bot commented Oct 29, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RealFYang commented Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcubed-ojdk commented May 8, 2023

Uh oh!

dcubed-ojdk commented May 8, 2023

Uh oh!

dcubed-ojdk left a comment

Choose a reason for hiding this comment

Uh oh!

dcubed-ojdk commented May 8, 2023

Uh oh!

rkennke commented May 8, 2023

Uh oh!

dcubed-ojdk commented May 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tstuefe commented May 8, 2023

Uh oh!

tstuefe left a comment

Choose a reason for hiding this comment

Uh oh!

shipilev left a comment

Choose a reason for hiding this comment

Uh oh!

shipilev May 8, 2023

Choose a reason for hiding this comment

Uh oh!

tstuefe May 8, 2023

Choose a reason for hiding this comment

Uh oh!

rkennke commented May 8, 2023

Uh oh!

openjdk bot commented May 8, 2023

Uh oh!

openjdk bot commented May 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

15 participants

rkennke commented Oct 28, 2022 •

edited by openjdk bot

Loading

RealFYang commented Nov 15, 2022 •

edited

Loading

dcubed-ojdk commented May 8, 2023 •

edited

Loading