-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8291555: Replace stack-locking with fast-locking #10590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added tag jdk-20+17 for changeset 79ccc79
|
👋 Welcome back rkennke! A progress list of the required criteria for merging this PR into |
|
@rkennke The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
|
/contributor fyang |
|
@rkennke this pull request can not be integrated into git checkout fast-locking
git fetch https://git.openjdk.org/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
@rkennke Syntax:
User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address. |
|
/contributor add fyang |
|
@rkennke |
|
First the "SharedRuntime::complete_monitor_locking_C" crash do not reproduce. Secondly, a question/suggestion: Worried about the size of the lock-stack? |
|
This PR has been in "merge-conflict" state for about 10 days. When do you Update: Ignore this. For some reason the "removed merge-conflict" from four days |
The CJM paper (Dice/Kogan 2021) mentions a "nesting" counter for this purpose. I suspect that a real counter is overkill, and the "unary" representation Robbin mentions would be fine, especially if there were a point (when the per-thread stack gets too big) at which we go and inflate anyway. The CJM paper suggests a full search of the per-thread array to detect the recursive condition, but again I like Robbin's idea of checking only the most recent lock record. So the data structure for lock records (per thread) could consist of a series of distinct values [ A B C ] and each of the values could be repeated, but only adjacently: [ A A A B C C ] for example. And there could be a depth limit as well. Any sequence of held locks not expressible within those limitations could go to inflation as a backup. |
|
|
Mailing list message from Remi Forax on serviceability-dev: ----- Original Message -----
Hi John, R?mi |
Correct me if I'm wrong, only C2 eliminates locks and C2 only compile if there is proper structured locking. Is there another situation than deopt where it would matter? |
|
FYI: I am working on an alternative PR for this that makes fast-locking optional and opt-in behind an experimental switch. It will also be much less invasive (no structural changes except absolutely necessary, no cleanups) and thus easier to handle. |
|
@dholmes-ora and all: I have prepared an alternative PR #10907 that implements the fast-locking behind a new experimental flag, and preserves the current stack-locking behavior as the default setting. It is currently implemented and tested on x86* and aarch64 arches. It is also less invasive because it keeps everything structurally the same (i.e. no method signature changes, no stack layout changes, etc). On the downside, it also means we can not have any of the associated cleanups and optimizations yet, but those are minor anyway. Also, there still is the risk that I make a mistake with the necessary factoring-out of current implementation. If we agree that this should be the way to go, then I would close this PR, and continue work on #10907. |
Yes it can; you can have nesting A, B, A. But the thread-based fast-locking list might not cover that case. If it were restricted to only adjacent records in the way I sketched, it would need to use a different, slower technique for the A, B, A case. The trade-off is that if you only allow adjacent recursive locks on the list, you don't need to search the list beyond the first element, to detect re-locking. Dunno if that pencils out to a real advantage, though, since the fallback is slow. |
|
FTR I agree with Holmes that a conditional opt-in is better. While we are uncertain of the viability of the new scheme (which FTR I like!) for all our customers, we need to have a dynamic selection of the technique, so we can turn it on and off. Off by default at first, then later on by default, then on with no option at all if all goes well (which I hope it does). Perhaps Lilliput can have it turned on by default, and throw an error if (for some reason) the user tries to turn it off again. That's the way we phased in, and then phased out, biased locking, and it seems to me that this is a closely similar situation. Eventually, if all goes well, we can remove the stack locking code, as we did with biased locking. For more details about that long-running saga, one might look at the history of |
TBH, I don't currently think that making fast-locking recursive is very important. In-fact, the need for the fast-locking appears somewhat questionable to begin with - the scenario where it performs better than OM-locking is rather narrow and really only relevant for legacy code. Stack-locking and fast-locking only help workloads that 1. Do lots of uncontended, e.g. single-threaded locking and 2. Churn lots of monitor objects. It is not enough to use a single Vector a lot - the cost of allocating the OM would soon be amortized by lots of OM action. In order for stack-/fast-locking to be useful, you have to have a workload that keeps allocating new lock objects and use them only once or very few times. For example, I have seen this in OpenJDK's XML code, where the XSLT compiler would generate code that uses an ungodly amount StringBuffers (this probably warrants a separate fix). Now, where would recursive locking support for the fast-locking path be useful? I have yet to see a workload that suffers because of a lack of recursive locking support. Implementing recursive fast-locking means we'd have to add code in the fast-path, and that would affect non-recursive locking as well. I'd rather keep the implementation simple and fast. |
|
@rkennke This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
|
Closing this in favour of #10907. |
This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable.
What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does not overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols.
In contrast to stack-locking, fast-locking does not support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
This change enables to simplify (and speed-up!) a lot of code:
Benchmarks
All benchmarks are run on server-class metal machines. The JVM settings are always:
-Xmx20g -Xms20g -XX:+UseParallelGC. All benchmarks are ms/ops, less is better.DaCapo/AArch64
Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias against fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference?
DaCapo/x86_64
The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above.
Renaissance/AArch64
This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings.
Renaissance/x86_64
Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often.
I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious).
Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here.
Testing
Progress
Issue
Contributors
<shade@openjdk.org><fyang@openjdk.org>Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590$ git checkout pull/10590Update a local copy of the PR:
$ git checkout pull/10590$ git pull https://git.openjdk.org/jdk pull/10590/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 10590View PR using the GUI difftool:
$ git pr show -t 10590Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10590.diff