-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8291555: Implement alternative fast-locking scheme #10907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back rkennke! A progress list of the required criteria for merging this PR into |
|
|
|
@rkennke this pull request can not be integrated into git checkout JDK-8291555-v2
git fetch https://git.openjdk.org/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
test/hotspot/jtreg/compiler/floatingpoint/TestFloatSyncJNIArgs.java
Outdated
Show resolved
Hide resolved
|
Hi, I have also updated the riscv part from #10590. Could you please put these changes in this PR? Thanks. |
|
Mach5 Tier[1-8] of v77 with forced-fast-locking results look good. I do still have to check in with Eric Caspole about the performance testing |
|
I discussed the perf testing results with Eric Caspole and here's our summary: Promotion performance testing was done on the jdk-21+21-1704 baseline, the v66 patch with Comparing the baseline with forced-fast-locking: 68 improvements and 177 regressions, none Comparing default-stack-locking with forced-fast-locking: 45 improvements, 173 regressions, Eric Caspole and I are both "go" from a performance testing POV! (This is for the stack-locking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thumbs up (still).
|
Checking platform specific code review coverage: Was there a specific reviewer for the RISC-V changes? |
No, not really. @RealFYang contributed the code and would also be the only guy that I know who would review it ;-) |
|
Okay. I'm good with that decision (but we don't have RISC-V in our CI)... |
The RiscV changes look okay to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, this looks generally good to me. There might be a need to touch up this code going forward, but probably in separate PRs, to avoid invalidating the testing.
| // On MP platforms the next load could return a 'stale' value if the memory location has been modified by another thread. | ||
| // That would be acceptable as ether CAS or slow case path is taken in that case. | ||
| if (LockingMode == LM_LIGHTWEIGHT) { | ||
| log_trace(fastlock)("C1_MacroAssembler::lock fast"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and later: I don't think we need these log statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They had been helpful during development, to verify code generation. Going forward they can be scrapped, but lets do that in a follow-up PR.
|
Alright, let's ship! Hurray! Thanks everybody, that's been a long one! |
|
Going to push as commit 7f6358a.
Your commit was automatically rebased without conflicts. |
This change adds a fast-locking scheme as an alternative to the current stack-locking implementation. It retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation when trying to access the mark-word. And because of the very racy nature, this turns out to be very complex and would involve a variant of the inflation protocol to ensure that the object header is stable. (The current implementation of setting/fetching the i-hash provides a glimpse into the complexity).
What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does not overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations (and we already do). The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols.
The lock-stack is fixed size, currently with 8 elements. According to my experiments with various workloads, this covers the vast majority of workloads (in-fact, most workloads seem to never exceed 5 active locks per thread at a time). We check for overflow in the fast-paths and when the lock-stack is full, we take the slow-path, which would inflate the lock to a monitor. That case should be very rare.
In contrast to stack-locking, fast-locking does not support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
This change enables to simplify (and speed-up!) a lot of code:
Also, and I might be mistaken here, this new lightweight locking would make synchronized work better with Loom: Because the lock-records are no longer scattered across the stack, but instead are densely packed into the lock-stack, it should be easy for a vthread to save its lock-stack upon unmounting and restore it when re-mounting. However, I am not sure about this, and this PR does not attempt to implement that support.
Testing:
Performance
Simple Microbenchmark
The microbenchmark exercises only the locking primitives for monitorenter and monitorexit, without contention. The benchmark can be found (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops.
Renaissance
Progress
Issues
Reviewers
Contributors
<fyang@openjdk.org><stuefe@openjdk.org>Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/10907/head:pull/10907$ git checkout pull/10907Update a local copy of the PR:
$ git checkout pull/10907$ git pull https://git.openjdk.org/jdk.git pull/10907/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 10907View PR using the GUI difftool:
$ git pr show -t 10907Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10907.diff
Webrev
Link to Webrev Comment