Skip to content

Conversation

@rkennke
Copy link
Contributor

@rkennke rkennke commented Oct 6, 2022

This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable.

What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.

This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does not overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols.

In contrast to stack-locking, fast-locking does not support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.

One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.

As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it.

This change enables to simplify (and speed-up!) a lot of code:

  • The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
  • Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR

Benchmarks

All benchmarks are run on server-class metal machines. The JVM settings are always: -Xmx20g -Xms20g -XX:+UseParallelGC. All benchmarks are ms/ops, less is better.

DaCapo/AArch64

Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias against fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference?

benchmark baseline fast-locking % size
avrora 27859 27563 1.07% large
batik 20786 20847 -0.29% large
biojava 27421 27334 0.32% default
eclipse 59918 60522 -1.00% large
fop 3670 3678 -0.22% default
graphchi 2088 2060 1.36% default
h2 297391 291292 2.09% huge
jme 8762 8877 -1.30% default
jython 18938 18878 0.32% default
luindex 1339 1325 1.06% default
lusearch 918 936 -1.92% default
pmd 58291 58423 -0.23% large
sunflow 32617 24961 30.67% large
tomcat 25481 25992 -1.97% large
tradebeans 314640 311706 0.94% huge
tradesoap 107473 110246 -2.52% huge
xalan 6047 5882 2.81% default
zxing 970 926 4.75% default

DaCapo/x86_64

The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above.

benchmark baseline fast-Locking % size
avrora 127690 126749 0.74% large
batik 12736 12641 0.75% large
biojava 15423 15404 0.12% default
eclipse 41174 41498 -0.78% large
fop 2184 2172 0.55% default
graphchi 1579 1560 1.22% default
h2 227614 230040 -1.05% huge
jme 8591 8398 2.30% default
jython 13473 13356 0.88% default
luindex 824 813 1.35% default
lusearch 962 968 -0.62% default
pmd 40827 39654 2.96% large
sunflow 53362 43475 22.74% large
tomcat 27549 28029 -1.71% large
tradebeans 190757 190994 -0.12% huge
tradesoap 68099 67934 0.24% huge
xalan 7969 8178 -2.56% default
zxing 1176 1148 2.44% default

Renaissance/AArch64

This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings.

benchmark baseline fast-locking %
AkkaUct 2558.832 2513.594 1.80%
Reactors 14715.626 14311.246 2.83%
Als 1851.485 1869.622 -0.97%
ChiSquare 1007.788 1003.165 0.46%
GaussMix 1157.491 1149.969 0.65%
LogRegression 717.772 733.576 -2.15%
MovieLens 7916.181 8002.226 -1.08%
NaiveBayes 395.296 386.611 2.25%
PageRank 4294.939 4346.333 -1.18%
FjKmeans 496.076 493.873 0.45%
FutureGenetic 2578.504 2589.255 -0.42%
Mnemonics 4898.886 4903.689 -0.10%
ParMnemonics 4260.507 4210.121 1.20%
Scrabble 139.37 138.312 0.76%
RxScrabble 320.114 322.651 -0.79%
Dotty 1056.543 1068.492 -1.12%
ScalaDoku 3443.117 3449.477 -0.18%
ScalaKmeans 259.384 258.648 0.28%
Philosophers 24333.311 23438.22 3.82%
ScalaStmBench7 1102.43 1115.142 -1.14%
FinagleChirper 6814.192 6853.38 -0.57%
FinagleHttp 4762.902 4807.564 -0.93%

Renaissance/x86_64

benchmark baseline fast-locking %
AkkaUct 1117.185 1116.425 0.07%
Reactors 11561.354 11812.499 -2.13%
Als 1580.838 1575.318 0.35%
ChiSquare 459.601 467.109 -1.61%
GaussMix 705.944 685.595 2.97%
LogRegression 659.944 656.428 0.54%
MovieLens 7434.303 7592.271 -2.08%
NaiveBayes 413.482 417.369 -0.93%
PageRank 3259.233 3276.589 -0.53%
FjKmeans 946.429 938.991 0.79%
FutureGenetic 1760.672 1815.272 -3.01%
ParMnemonics 2016.917 2033.101 -0.80%
Scrabble 147.996 150.084 -1.39%
RxScrabble 177.755 177.956 -0.11%
Dotty 673.754 683.919 -1.49%
ScalaDoku 2193.562 1958.419 12.01%
ScalaKmeans 165.376 168.925 -2.10%
ScalaStmBench7 1080.187 1049.184 2.95%
Philosophers 14268.449 13308.87 7.21%
FinagleChirper 4722.13 4688.3 0.72%
FinagleHttp 3497.241 3605.118 -2.99%

Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often.

I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious).

Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here.

Testing

  • tier1 (x86_64, aarch64, x86_32)
  • tier2 (x86_64, aarch64)
  • tier3 (x86_64, aarch64)
  • tier4 (x86_64, aarch64)
  • jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64)

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

Contributors

  • Aleksey Shipilev <shade@openjdk.org>
  • Fei Yang <fyang@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590
$ git checkout pull/10590

Update a local copy of the PR:
$ git checkout pull/10590
$ git pull https://git.openjdk.org/jdk pull/10590/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10590

View PR using the GUI difftool:
$ git pr show -t 10590

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10590.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 6, 2022

👋 Welcome back rkennke! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 6, 2022
@openjdk
Copy link

openjdk bot commented Oct 6, 2022

@rkennke The following labels will be automatically applied to this pull request:

  • hotspot
  • serviceability
  • shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@rkennke
Copy link
Contributor Author

rkennke commented Oct 17, 2022

/contributor fyang

@openjdk
Copy link

openjdk bot commented Oct 17, 2022

@rkennke this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout fast-locking
git fetch https://git.openjdk.org/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added merge-conflict Pull request has merge conflict with target branch and removed rfr Pull request is ready for review labels Oct 17, 2022
@openjdk
Copy link

openjdk bot commented Oct 17, 2022

@rkennke Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <duke@openjdk.org>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

@rkennke
Copy link
Contributor Author

rkennke commented Oct 18, 2022

/contributor add fyang

@openjdk
Copy link

openjdk bot commented Oct 18, 2022

@rkennke
Contributor Fei Yang <fyang@openjdk.org> successfully added.

@openjdk openjdk bot added rfr Pull request is ready for review and removed merge-conflict Pull request has merge conflict with target branch labels Oct 24, 2022
@robehn
Copy link
Contributor

robehn commented Oct 24, 2022

First the "SharedRuntime::complete_monitor_locking_C" crash do not reproduce.

Secondly, a question/suggestion:
Many recursive cases do not interleave locks, meaning the recursive enter will happen with the lock/oop top of lock stack already.
Why not peak at top lock/oop in lock-stack if the is current just push it again and the locking is done? (instead of inflating)
(exit would need to check if this is the last one and then proper exit)

Worried about the size of the lock-stack?

@dcubed-ojdk
Copy link
Member

dcubed-ojdk commented Oct 27, 2022

This PR has been in "merge-conflict" state for about 10 days. When do you
plan to merge again with the jdk/jdk repo?

Update: Ignore this. For some reason the "removed merge-conflict" from four days
ago didn't show up until refreshed the browser (again)...

@rose00
Copy link
Contributor

rose00 commented Oct 27, 2022

Secondly, a question/suggestion: Many recursive cases do not interleave locks, meaning the recursive enter will happen with the lock/oop top of lock stack already. Why not peak at top lock/oop in lock-stack if the is current just push it again and the locking is done? (instead of inflating) (exit would need to check if this is the last one and then proper exit)

The CJM paper (Dice/Kogan 2021) mentions a "nesting" counter for this purpose. I suspect that a real counter is overkill, and the "unary" representation Robbin mentions would be fine, especially if there were a point (when the per-thread stack gets too big) at which we go and inflate anyway.

The CJM paper suggests a full search of the per-thread array to detect the recursive condition, but again I like Robbin's idea of checking only the most recent lock record.

So the data structure for lock records (per thread) could consist of a series of distinct values [ A B C ] and each of the values could be repeated, but only adjacently: [ A A A B C C ] for example. And there could be a depth limit as well. Any sequence of held locks not expressible within those limitations could go to inflation as a backup.

@dholmes-ora
Copy link
Member

So the data structure for lock records (per thread) could consist of a series of distinct values [ A B C ] and each of the values could be repeated, but only adjacently: [ A A A B C C ] for example.
@rose00 why only adjacently? Nested locking can be interleaved on different monitors.

@mlbridge
Copy link

mlbridge bot commented Oct 28, 2022

Mailing list message from Remi Forax on serviceability-dev:

----- Original Message -----

From: "John R Rose" <jrose at openjdk.org>
To: hotspot-dev at openjdk.org, serviceability-dev at openjdk.org, shenandoah-dev at openjdk.org
Sent: Thursday, October 27, 2022 10:41:44 PM
Subject: Re: RFR: 8291555: Replace stack-locking with fast-locking [v7]

On Mon, 24 Oct 2022 11:01:01 GMT, Robbin Ehn <rehn at openjdk.org> wrote:

Secondly, a question/suggestion: Many recursive cases do not interleave locks,
meaning the recursive enter will happen with the lock/oop top of lock stack
already. Why not peak at top lock/oop in lock-stack if the is current just push
it again and the locking is done? (instead of inflating) (exit would need to
check if this is the last one and then proper exit)

The CJM paper (Dice/Kogan 2021) mentions a "nesting" counter for this purpose.
I suspect that a real counter is overkill, and the "unary" representation
Robbin mentions would be fine, especially if there were a point (when the
per-thread stack gets too big) at which we go and inflate anyway.

The CJM paper suggests a full search of the per-thread array to detect the
recursive condition, but again I like Robbin's idea of checking only the most
recent lock record.

So the data structure for lock records (per thread) could consist of a series of
distinct values [ A B C ] and each of the values could be repeated, but only
adjacently: [ A A A B C C ] for example. And there could be a depth limit as
well. Any sequence of held locks not expressible within those limitations
could go to inflation as a backup.

Hi John,
a certainly stupid question, i've some trouble to see how it can be implemented given that because of lock coarsening (+ may be OSR), the number of time a lock is held is different between the interpreted code and the compiled code.

R?mi

@robehn
Copy link
Contributor

robehn commented Oct 28, 2022

i've some trouble to see how it can be implemented given that because of lock coarsening (+ may be OSR), the number of time a lock is held is different between the interpreted code and the compiled code.

Correct me if I'm wrong, only C2 eliminates locks and C2 only compile if there is proper structured locking.
This should mean that when we restore the eliminated locks in deopt we can inflate the recursive locks which are no longer interleaved and restructure the lock-stack accordingly.

Is there another situation than deopt where it would matter?

@rkennke
Copy link
Contributor Author

rkennke commented Oct 28, 2022

FYI: I am working on an alternative PR for this that makes fast-locking optional and opt-in behind an experimental switch. It will also be much less invasive (no structural changes except absolutely necessary, no cleanups) and thus easier to handle.
Update: The WIP PR is #10907, it is not yet complete, but has most relevant parts in place and runs on aarch64.

@rkennke
Copy link
Contributor Author

rkennke commented Nov 11, 2022

@dholmes-ora and all: I have prepared an alternative PR #10907 that implements the fast-locking behind a new experimental flag, and preserves the current stack-locking behavior as the default setting. It is currently implemented and tested on x86* and aarch64 arches. It is also less invasive because it keeps everything structurally the same (i.e. no method signature changes, no stack layout changes, etc). On the downside, it also means we can not have any of the associated cleanups and optimizations yet, but those are minor anyway. Also, there still is the risk that I make a mistake with the necessary factoring-out of current implementation. If we agree that this should be the way to go, then I would close this PR, and continue work on #10907.

@dholmes-ora
Copy link
Member

@rkennke not unexpectedly I greatly prefer the optional and opt-in version in PR #10907.

@rose00
Copy link
Contributor

rose00 commented Nov 14, 2022

So the data structure for lock records (per thread) could consist of a series of distinct values [ A B C ] and each of the values could be repeated, but only adjacently: [ A A A B C C ] for example.
@rose00 why only adjacently? Nested locking can be interleaved on different monitors.

Yes it can; you can have nesting A, B, A. But the thread-based fast-locking list might not cover that case. If it were restricted to only adjacent records in the way I sketched, it would need to use a different, slower technique for the A, B, A case. The trade-off is that if you only allow adjacent recursive locks on the list, you don't need to search the list beyond the first element, to detect re-locking. Dunno if that pencils out to a real advantage, though, since the fallback is slow.

@rose00
Copy link
Contributor

rose00 commented Nov 14, 2022

FTR I agree with Holmes that a conditional opt-in is better. While we are uncertain of the viability of the new scheme (which FTR I like!) for all our customers, we need to have a dynamic selection of the technique, so we can turn it on and off.

Off by default at first, then later on by default, then on with no option at all if all goes well (which I hope it does). Perhaps Lilliput can have it turned on by default, and throw an error if (for some reason) the user tries to turn it off again.

That's the way we phased in, and then phased out, biased locking, and it seems to me that this is a closely similar situation. Eventually, if all goes well, we can remove the stack locking code, as we did with biased locking.

For more details about that long-running saga, one might look at the history of -XX:+UseBiasedLocking in the source base, perhaps starting with $ git log -S UseBiasedLocking.

@rkennke
Copy link
Contributor Author

rkennke commented Nov 15, 2022

So the data structure for lock records (per thread) could consist of a series of distinct values [ A B C ] and each of the values could be repeated, but only adjacently: [ A A A B C C ] for example.
@rose00 why only adjacently? Nested locking can be interleaved on different monitors.

Yes it can; you can have nesting A, B, A. But the thread-based fast-locking list might not cover that case. If it were restricted to only adjacent records in the way I sketched, it would need to use a different, slower technique for the A, B, A case. The trade-off is that if you only allow adjacent recursive locks on the list, you don't need to search the list beyond the first element, to detect re-locking. Dunno if that pencils out to a real advantage, though, since the fallback is slow.

TBH, I don't currently think that making fast-locking recursive is very important. In-fact, the need for the fast-locking appears somewhat questionable to begin with - the scenario where it performs better than OM-locking is rather narrow and really only relevant for legacy code. Stack-locking and fast-locking only help workloads that 1. Do lots of uncontended, e.g. single-threaded locking and 2. Churn lots of monitor objects. It is not enough to use a single Vector a lot - the cost of allocating the OM would soon be amortized by lots of OM action. In order for stack-/fast-locking to be useful, you have to have a workload that keeps allocating new lock objects and use them only once or very few times. For example, I have seen this in OpenJDK's XML code, where the XSLT compiler would generate code that uses an ungodly amount StringBuffers (this probably warrants a separate fix). Now, where would recursive locking support for the fast-locking path be useful? I have yet to see a workload that suffers because of a lack of recursive locking support. Implementing recursive fast-locking means we'd have to add code in the fast-path, and that would affect non-recursive locking as well. I'd rather keep the implementation simple and fast.

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 13, 2022

@rkennke This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@rkennke
Copy link
Contributor Author

rkennke commented Dec 14, 2022

Closing this in favour of #10907.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot hotspot-dev@openjdk.org rfr Pull request is ready for review serviceability serviceability-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

7 participants