Skip to content

Conversation

@rkennke
Copy link
Contributor

@rkennke rkennke commented Oct 28, 2022

This change adds a fast-locking scheme as an alternative to the current stack-locking implementation. It retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation when trying to access the mark-word. And because of the very racy nature, this turns out to be very complex and would involve a variant of the inflation protocol to ensure that the object header is stable. (The current implementation of setting/fetching the i-hash provides a glimpse into the complexity).

What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.

This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does not overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations (and we already do). The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols.

The lock-stack is fixed size, currently with 8 elements. According to my experiments with various workloads, this covers the vast majority of workloads (in-fact, most workloads seem to never exceed 5 active locks per thread at a time). We check for overflow in the fast-paths and when the lock-stack is full, we take the slow-path, which would inflate the lock to a monitor. That case should be very rare.

In contrast to stack-locking, fast-locking does not support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.

One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.

As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it.

This change enables to simplify (and speed-up!) a lot of code:

  • The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
  • Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR

Also, and I might be mistaken here, this new lightweight locking would make synchronized work better with Loom: Because the lock-records are no longer scattered across the stack, but instead are densely packed into the lock-stack, it should be easy for a vthread to save its lock-stack upon unmounting and restore it when re-mounting. However, I am not sure about this, and this PR does not attempt to implement that support.

Testing:

  • tier1 x86_64 x aarch64 x +UseFastLocking
  • tier2 x86_64 x aarch64 x +UseFastLocking
  • tier3 x86_64 x aarch64 x +UseFastLocking
  • tier4 x86_64 x aarch64 x +UseFastLocking
  • tier1 x86_64 x aarch64 x -UseFastLocking
  • tier2 x86_64 x aarch64 x -UseFastLocking
  • tier3 x86_64 x aarch64 x -UseFastLocking
  • tier4 x86_64 x aarch64 x -UseFastLocking
  • Several real-world applications have been tested with this change in tandem with Lilliput without any problems, yet

Performance

Simple Microbenchmark

The microbenchmark exercises only the locking primitives for monitorenter and monitorexit, without contention. The benchmark can be found (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops.

x86_64 aarch64
-UseFastLocking 20.651 20.764
+UseFastLocking 18.896 18.908

Renaissance

  x86_64       aarch64    
  stack-locking fast-locking     stack-locking fast-locking  
AkkaUct 841.884 836.948 0.59%   1475.774 1465.647 0.69%
Reactors 11041.427 11181.451 -1.25%   11381.751 11521.318 -1.21%
Als 1367.183 1359.358 0.58%   1678.103 1688.067 -0.59%
ChiSquare 577.021 577.398 -0.07%   986.619 988.063 -0.15%
GaussMix 817.459 819.073 -0.20%   1154.293 1155.522 -0.11%
LogRegression 598.343 603.371 -0.83%   638.052 644.306 -0.97%
MovieLens 8248.116 8314.576 -0.80%   7569.219 7646.828 -1.01%%
NaiveBayes 587.607 581.608 1.03%   541.583 550.059 -1.54%
PageRank 3260.553 3263.472 -0.09%   4376.405 4381.101 -0.11%
FjKmeans 979.978 976.122 0.40%   774.312 771.235 0.40%
FutureGenetic 2187.369 2183.271 0.19%   2685.722 2689.056 -0.12%
ParMnemonics 2434.551 2468.763 -1.39%   4278.225 4263.863 0.34%
Scrabble 111.882 111.768 0.10%   151.796 153.959 -1.40%
RxScrabble 210.252 211.38 -0.53%   310.116 315.594 -1.74%
Dotty 750.415 752.658 -0.30%   1033.636 1036.168 -0.24%
ScalaDoku 3072.05 3051.2 0.68%   3711.506 3690.04 0.58%
ScalaKmeans 211.427 209.957 0.70%   264.38 265.788 -0.53%
ScalaStmBench7 1017.795 1018.869 -0.11%   1088.182 1092.266 -0.37%
Philosophers 6450.124 6565.705 -1.76%   12017.964 11902.559 0.97%
FinagleChirper 3953.623 3972.647 -0.48%   4750.751 4769.274 -0.39%
FinagleHttp 3970.526 4005.341 -0.87%   5294.125 5296.224 -0.04%

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Change requires CSR request JDK-8305999 to be approved
  • Commit message must refer to an issue

Issues

  • JDK-8291555: Implement alternative fast-locking scheme
  • JDK-8305999: Add experimental -XX:LockingMode flag (CSR)

Reviewers

Contributors

  • Fei Yang <fyang@openjdk.org>
  • Thomas Stuefe <stuefe@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/10907/head:pull/10907
$ git checkout pull/10907

Update a local copy of the PR:
$ git checkout pull/10907
$ git pull https://git.openjdk.org/jdk.git pull/10907/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10907

View PR using the GUI difftool:
$ git pr show -t 10907

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10907.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 28, 2022

👋 Welcome back rkennke! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 28, 2022

⚠️ @rkennke This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

@openjdk
Copy link

openjdk bot commented Oct 28, 2022

@rkennke The following labels will be automatically applied to this pull request:

  • hotspot
  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added serviceability serviceability-dev@openjdk.org hotspot hotspot-dev@openjdk.org labels Oct 28, 2022
@openjdk
Copy link

openjdk bot commented Oct 29, 2022

@rkennke this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8291555-v2
git fetch https://git.openjdk.org/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Oct 29, 2022
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Nov 8, 2022
@RealFYang
Copy link
Member

RealFYang commented Nov 15, 2022

Hi, I have also updated the riscv part from #10590. Could you please put these changes in this PR? Thanks.
I can run non-trivial benchmark workloads with +UseFastLocking on linux-riscv64 platform. I will try to perform more tests as needed.

10907-riscv-v1.txt

@openjdk openjdk bot added ready Pull request is ready to be integrated and removed ready Pull request is ready to be integrated labels May 8, 2023
@dcubed-ojdk
Copy link
Member

Mach5 Tier[1-8] of v77 with forced-fast-locking results look good.
Mach5 Tier[1-8] of v77 with default-stack-locking results also look good.

I do still have to check in with Eric Caspole about the performance testing
of the baseline versus the default-stack-locking configuration. We did that
testing with a baseline of jdk-21+21-1704 and the v66 version of the patch
in default-stack-locking configuration. Eric also did testing of the v66 version
of the patch with forced-fast-locking, but those results are not a gate for
determining whether this patch gets integrated.

@dcubed-ojdk
Copy link
Member

I discussed the perf testing results with Eric Caspole and here's our summary:

Promotion performance testing was done on the jdk-21+21-1704 baseline, the v66 patch with
default-stack-locking and the v66 patch with forced-fast-locking. Comparing the baseline with
default-stack-locking: 62 improvements and 54 regressions, none of the improvements or
regressions are statistically significant.

Comparing the baseline with forced-fast-locking: 68 improvements and 177 regressions, none
of the improvements or regressions are statistically significant.

Comparing default-stack-locking with forced-fast-locking: 45 improvements, 173 regressions,
none of the improvements or regressions are statistically significant.

Eric Caspole and I are both "go" from a performance testing POV! (This is for the stack-locking
as the default configuration.)

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs up (still).

@dcubed-ojdk
Copy link
Member

Checking platform specific code review coverage:
@dean-long and @dcubed-ojdk did the review of the arm32 and aarch64 changes.
@dholmes-ora and @dcubed-ojdk did the review of the X64/X86 changes.

Was there a specific reviewer for the RISC-V changes?

@rkennke
Copy link
Contributor Author

rkennke commented May 8, 2023

Checking platform specific code review coverage: @dean-long and @dcubed-ojdk did the review of the arm32 and aarch64 changes. @dholmes-ora and @dcubed-ojdk did the review of the X64/X86 changes.

Was there a specific reviewer for the RISC-V changes?

No, not really. @RealFYang contributed the code and would also be the only guy that I know who would review it ;-)

@dcubed-ojdk
Copy link
Member

dcubed-ojdk commented May 8, 2023

Okay. I'm good with that decision (but we don't have RISC-V in our CI)...

@tstuefe
Copy link
Member

tstuefe commented May 8, 2023

Checking platform specific code review coverage: @dean-long and @dcubed-ojdk did the review of the arm32 and aarch64 changes. @dholmes-ora and @dcubed-ojdk did the review of the X64/X86 changes.
Was there a specific reviewer for the RISC-V changes?

No, not really. @RealFYang contributed the code and would also be the only guy that I know who would review it ;-)

The RiscV changes look okay to me.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openjdk openjdk bot added ready Pull request is ready to be integrated and removed ready Pull request is ready to be integrated labels May 8, 2023
Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, this looks generally good to me. There might be a need to touch up this code going forward, but probably in separate PRs, to avoid invalidating the testing.

// On MP platforms the next load could return a 'stale' value if the memory location has been modified by another thread.
// That would be acceptable as ether CAS or slow case path is taken in that case.
if (LockingMode == LM_LIGHTWEIGHT) {
log_trace(fastlock)("C1_MacroAssembler::lock fast");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and later: I don't think we need these log statements?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They had been helpful during development, to verify code generation. Going forward they can be scrapped, but lets do that in a follow-up PR.

@openjdk openjdk bot added ready Pull request is ready to be integrated and removed ready Pull request is ready to be integrated labels May 8, 2023
@rkennke
Copy link
Contributor Author

rkennke commented May 8, 2023

Alright, let's ship! Hurray! Thanks everybody, that's been a long one!
/integrate

@openjdk
Copy link

openjdk bot commented May 8, 2023

Going to push as commit 7f6358a.
Since your change was applied there have been 40 commits pushed to the master branch:

  • 4116b10: 8306461: ObjectInputStream::readObject() should handle negative array sizes without throwing NegativeArraySizeExceptions
  • 93ee19f: 8305486: Add split() variants that keep the delimiters to String and j.u.r.Pattern
  • ad90fb6: 8293786: HttpClient will not send more than 64 kb of data from the 2nd request in http2
  • 5a259d8: 8303153: Native interpreter frame missing mirror
  • 4b02956: 8054022: HttpURLConnection timeouts with Expect: 100-Continue and no chunking
  • 9f34e4f: 8304434: [AIX] Update minimum xlclang version
  • d2e0e53: 8307604: gcc12 based Alpine build broken build after JDK-8307301
  • e91f0d3: 8307571: Remove unused SomeConstants in WatcherThread class
  • 26755a9: 8306408: Fix the format of several tables in building.md
  • 64c0962: 8307569: Build with gcc8 is broken after JDK-8307301
  • ... and 30 more: https://git.openjdk.org/jdk/compare/f143bf7c4554a689f17c373ea5d99b68dd518b2f...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 8, 2023
@openjdk openjdk bot closed this May 8, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 8, 2023
@openjdk
Copy link

openjdk bot commented May 8, 2023

@rkennke Pushed as commit 7f6358a.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.