-
Notifications
You must be signed in to change notification settings - Fork 6k
8341491: Reserve and commit memory operations should be protected by NMT lock #24084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back roberttoyonaga! A progress list of the required criteria for merging this PR into |
@roberttoyonaga This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 831 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@stefank, @tstuefe) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
@roberttoyonaga The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
c4e01e2
to
78ed01a
Compare
/label serviceability |
@roberttoyonaga |
67fc12c
to
f841f7c
Compare
f841f7c
to
86423d0
Compare
Webrevs
|
Ping @JoKern65 for AIX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we need to extend the lock to be held over the reserve/commit/alloc part.
If we only extend the lock on the release side, then it looks like we get the required exclusion:
lock
1.1 Thread_1 releases range_A.
1.2 Thread_1 tells NMT "range_A has been released".
unlock
2.1 Thread_2 reserves (the now free) range_A.
lock
2.2 Thread_2 tells NMT "range_A is reserved".
unlock
We can get ordering (1.1) (2.1) (1.2) (2.2) but we can't get (1.1) (2.1) (2.2) (1.2).
And isn't this locking scheme exactly what the current code is using? Have you seen an issue that this proposed PR intends to solve? If there is such a problem I wonder if there's just a missing lock extension in one of the "release" operations.
What about the case where one thread reserves a range and another thread releases it? 1 Thread A reserves range This would either result in an assert in NMT at step 3 when releasing a range NMT does not know. Or in an incorrectly booked range in step 4 without asserts. Am I making a thinking error somewhere? |
Hi @stefank, I think you're right about (1.1) (2.1) (2.2) (1.2) being prevented by the current implementation. Is there a reason that the current implementation only does the wider locking for release/uncommit? Maybe (2.1) (1.1) (1.2) (2.2) isn't much of an issue since it's unlikely that another thread would uncommit/release the same base address shortly after it's committed/reserved?
I haven't seen that race in the wild, I just noticed that the memory operations weren't protected and thought that it could be a problem. |
In a scenario like that, doesn't Thread A have to communicate somehow that Thread B can now start using and/or releasing that reservation? It sounds like you would have a race condition if you don't have synchronization to hand over the reservation from one thread to another. I would expect that such communication would be be placed after the NMT booking in Thread A.
Are there any code that we know of that doesn't fit into a synchronization pattern similar to the above? I can think of some contrived example where Thread B asks the OS for memory mappings and uses that to ascertain that a pre-determined address has been reserved, and how that could lead to an incorrect booking as you described, but do we really have code like that? If we do, should we really have code like that? Are there some other patterns that I'm missing? Related to this, I talked to @xmas92 and he mused about swapping the order of the release and NMT release booking as a way to shrink the lock scope for the release operation:
As long as we hold the reservation, no other thread can re-reserve the reservation, so Thread B can take its time to first perform the NMT release booking under the lock, and then perform the release without the lock. If another thread (say Thread C) manages to re-reserve the memory, it sounds reasonable that the NMT release booking should have already completed. If we were to try this out, we would have to verify/ensure that the release/reserve pairs perform enough synchronization to make this work. Axel can probably correct me if I mischaracterized what he wrote. |
I'm very curious to find out if anyone knows how this could happen without a race condition hand-over from one thread to another. (See my answer to Stüfe).
OK. Let's see if anyone finds a hole in my arguments given above. |
Hi @stefank, I think that's true. But if the release/uncommit does not complete successfully we would need to readjust the accounting afterward. To do that we would need to retrieve the original memtag (at least for reserved regions) and potentially need to retrieve the original callsite data (if we're in detailed mode). |
When does a release/uncommit fail? Would that be a JVM bug? What state is the memory in when such a failure happens? Do we even know if the memory is still committed if an uncommit fails? If I look at the man page for munmap it only fails if you pass in incorrect values, which sounds like a JVM bug to me. I don't understand why we don't treat that as a fatal error OR make sure that all call-sites handles that error, which they don't do today. |
From what I can tell, it doesn't look like that's happening anywhere, someone else may know better though. Similarly, for uncommit, the base address must be passed over from somewhere else in the JVM so relying on some external synchonization seems reasonable here too. If this problem scenario is not present in the current code and it's not expected it to become a possiblity in the future, then I suppose there's no reason to guard against it. Maybe just a comment explaining the reasoning is good enough (and a warning not to use such patterns).
On Windows, VirtualFree also looks like it only fails if an invalid set of arguments are passed. So if os::pd_release fails it's probably a JVM bug. Uncommit uses mmap, which could fail for a larger variety of reasons. Some reasons are out of control of the JVM. For example: "The number of mapped regions would exceed an implementation-defined limit (per process or per system)." See here
If release/uncommit fails, then it would be hard to know what state the target memory is in. If the arguments are invalid (bad base address), the target region may not even be allocated. Or, in the case of uncommit, if the base address is not aligned, maybe the target committed region does indeed exist but the uncommit still fails. So it would be hard to determine how to readjust the NMT accounting afterward.
I think release/uncommit failures should be handled by the callers. Currently, uncommit failure is handled in most places by the caller, release failure seems mostly not. Since, at least for uncommit, we could sometimes fail for valid reasons, I think we shouldn't fail fatally in the os:: functions. |
Right. And that failure is fatal, so there should be no need to fix any NMT bookkeeping for that.
Agreed. And this would be a pre-existing problem already. If a release/uncommit fails, then we have the similar issues for that as well.
I would like to drill a bit deeper into this. Do you have any concrete examples of an uncommit failure that should not be handled as a fatal error? |
Stefan, your analysis sounds reasonable. Don't see a hole. The original issue was from me I think, but I've never seen that variant in real life. So, I am fine with leaving that scenario out. |
I second @stefank here. Uncommit can fail, ironically, with an ENOMEM : if the uncommit punches a hole into a committed region, this would cause a new new VMA on the kernel-side. This may fail if we run against the limit for VMAs. Forgot what it was, some sysconf setting. All of this is Linux specific, though. I don't think this should be unconditionally a fatal error. Since the allocator (whatever it is) can decide to re-commit the region later, and thus "self-heal" itself. |
This happens when we hit the /proc/sys/vm/max_map_count limit, and this immediately crashes the JVM.
Is this referring to failures when we hit the max_map_count limit? I'm not convinced that you can recover from that without immediately hitting the same issue somewhere else in the code. Or maybe you are thinking about some of the other reasons for the uncommit to fail? |
Hi @stefank, Are you referring to the difficulty in determining the original allocation as being the pre-existing problem? I think that only becomes an issue if we decide to swap the order of NMT booking and the memory release/uncommit (assuming we don't just fail fatally). Since we don't need to readjust currently, if there's a failure we can just leave everything as it is.
|
My thinking is that if there is a failure you don't know what state the OS left the memory in. So, you don't know whether the memory was in fact unmapped as requested, or if it was left intact, or even something in-between. So, if you don't do the matching NMT bookkeeping there will be a mismatch between the state of the memory and what has been bookkeeped in NMT.
The above example shows code that assumes that it is OK to fail uncommitting and continuing. I'm trying to figure it that assumption is true. So, what I meant was that I was looking for a concrete example of a failure mode of uncommit that would be an acceptable (safe) failure to continue executing from. That is, a valid failure that don't mess up the memory in an unpredictable/unknowable way. |
Tonight we tested this PR on AIX and it failed in the gtest with Internal Error (os_aix.cpp:1917), pid=26476938, tid=258 This will happen if a |
Thank you for running the tests on AIX. I've excluded the file mapping tests that don't meet that criteria on AIX. |
I ran the tests over the weekend again and now they passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good to me, but please seek feedback from others as well.
I've added a couple of suggestions. None of them are required, but I think they would be nice to do.
src/hotspot/share/runtime/os.cpp
Outdated
// The scope of NmtVirtualMemoryLocker covers both pd_uncommit_memory and record_virtual_memory_uncommit because | ||
// these operations must happen atomically to avoid races. For example, if Thread_1 were to uncommit a region, but | ||
// before Thread_1 can update NMT accounting, Thead_2 commits overlapping the same region and updates NMT, | ||
// then Thread_1 finishes updating NMT. This would result in NMT perceiving part of the region being uncommited, | ||
// when it is actually committed. The opposite scenario is not guarded against. pd_commit_memory and | ||
// record_virtual_memory_commit do not happen atomically. We assume that there is some external synchronization | ||
// that prevents a region from being uncommitted before it is finished being committed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a requirement, but you get kudos from me if you keep comments lines below 80 lines. I typically don't like code to be 80 lines, but comments tend to be nicer if they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'll try to shorten these comments a bit
Co-authored-by: Stefan Karlsson <stefan.karlsson@oracle.com>
Co-authored-by: Stefan Karlsson <stefan.karlsson@oracle.com>
Co-authored-by: Stefan Karlsson <stefan.karlsson@oracle.com>
Co-authored-by: Stefan Karlsson <stefan.karlsson@oracle.com>
Thank you @stefank for the feedback. I've applied your suggestions. @tstuefe, when you have time, can you please have another look at this? Based on the discussion above, I've reverted the changes to the locking scope in favor of new comments explaining the asymmetrical locking and warning against patterns that lead to races. The new tests that are still relevant are kept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thank you for doing this. I am fine with this version.
Thank you everyone for the review! /integrate |
@roberttoyonaga |
/sponsor |
1 similar comment
/sponsor |
Going to push as commit 44c5aca.
Your commit was automatically rebased without conflicts. |
@tstuefe @roberttoyonaga Pushed as commit 44c5aca. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
@tstuefe The command |
Update:
After some discussion it was decided it's not necessary to expand the lock scope for reserve/commit. Instead, we are opting to add comments explaining the reasons for locking and the conditions to avoid which could lead to races. Some of the new tests can be kept because they are general enough to be useful outside of this context.
Summary:
This PR makes memory operations atomic with NMT accounting.
The problem:
In memory related functions like
os::commit_memory
andos::reserve_memory
the OS memory operations are currently done before acquiring the the NMT mutex. And the the virtual memory accounting is done later inMemTracker
, after the lock has been acquired. Doing the memory operations outside of the lock scope can lead to races.1.1 Thread_1 releases range_A.
1.2 Thread_1 tells NMT "range_A has been released".
2.1 Thread_2 reserves (the now free) range_A.
2.2 Thread_2 tells NMT "range_A is reserved".
Since the sequence (1.1) (1.2) is not atomic, if Thread_2 begins operating after (1.1), we can have (1.1) (2.1) (2.2) (1.2). The OS sees two valid subsequent calls (release range_A, followed by map range_A). But NMT sees "reserve range_A", "release range_A" and is now out of sync with the OS.
Solution:
Where memory operations such as reserve, commit, or release virtual memory happen, I've expanded the scope of
NmtVirtualMemoryLocker
to protect both the NMT accounting and the memory operation itself.Other notes:
I also simplified this pattern found in many places:
To:
This is possible because
NmtVirtualMemoryLocker
now checksMemTracker::enabled()
.MemTracker::record_some_operation
already checksMemTracker::enabled()
and checks against nullptr. This refactoring previously wasn't possible becauseThreadCritical
was used before #22745 introducedNmtVirtualMemoryLocker
.I considered moving the locking and NMT accounting down into platform specific code: Ex. lock around { munmap() + MemTracker::record }. The hope was that this would help reduce the size of the critical section. However, I found that the OS-specific "pd_" functions are already short and to-the-point, so doing this wasn't reducing the lock scope very much. Instead it just makes the code more messy by having to maintain the locking and NMT accounting in each platform specific implementation.
In many places I've done minor refactoring by relocating calls to
MemTracker
in order to tighten the locking scope.In some OS specific code (such as
os::map_memory_to_file
), I've replaced calls toos::release_memory
withos::pd_release_memory
. This is to avoidNmtVirtualMemoryLocker
reentrancy.In a few places (such as
VirtualMemoryTracker::add_reserved_region
) I have replacedtty
withdefaultStream::output_stream()
. OtherwiseNmtVirtualMemory_lock
would be acquired out of rank order withtty_lock
.Testing:
One concern, due to the expanded critical section, is reentrancy.
NmtVirtualMemoryLocker
is a HotSpot mutex and is not reentrant. I've added new tests in test_os.cpp and test_virtualMemoryTracker.cpp that try to exercise any usages of NMT that weren't already exercised by existing tests.tier1 passes on linux and windows. I do not have an AIX machine to test on. Can someone please help run the tests on AIX?
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24084/head:pull/24084
$ git checkout pull/24084
Update a local copy of the PR:
$ git checkout pull/24084
$ git pull https://git.openjdk.org/jdk.git pull/24084/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24084
View PR using the GUI difftool:
$ git pr show -t 24084
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24084.diff
Using Webrev
Link to Webrev Comment