8341491: Reserve and commit memory operations should be protected by NMT lock #24084

roberttoyonaga · 2025-03-17T16:20:42Z

Update:

After some discussion it was decided it's not necessary to expand the lock scope for reserve/commit. Instead, we are opting to add comments explaining the reasons for locking and the conditions to avoid which could lead to races. Some of the new tests can be kept because they are general enough to be useful outside of this context.

Summary:

This PR makes memory operations atomic with NMT accounting.

The problem:

In memory related functions like os::commit_memory and os::reserve_memory the OS memory operations are currently done before acquiring the the NMT mutex. And the the virtual memory accounting is done later in MemTracker, after the lock has been acquired. Doing the memory operations outside of the lock scope can lead to races.

1.1 Thread_1 releases range_A.
1.2 Thread_1 tells NMT "range_A has been released".

2.1 Thread_2 reserves (the now free) range_A.
2.2 Thread_2 tells NMT "range_A is reserved".

Since the sequence (1.1) (1.2) is not atomic, if Thread_2 begins operating after (1.1), we can have (1.1) (2.1) (2.2) (1.2). The OS sees two valid subsequent calls (release range_A, followed by map range_A). But NMT sees "reserve range_A", "release range_A" and is now out of sync with the OS.

Solution:

Where memory operations such as reserve, commit, or release virtual memory happen, I've expanded the scope of NmtVirtualMemoryLocker to protect both the NMT accounting and the memory operation itself.

Other notes:

I also simplified this pattern found in many places:

if (MemTracker::enabled()) {
  MemTracker::NmtVirtualMemoryLocker nvml;
  result = pd_some_operation(addr, bytes);
  if (result != nullptr) {
    MemTracker::record_some_operation(addr, bytes);
  }
} else {
  result = pd_unmap_memory(addr, bytes);
}

To:

MemTracker::NmtVirtualMemoryLocker nvml;
result = pd_unmap_memory(addr, bytes);
MemTracker::record_some_operation(addr, bytes);

This is possible because NmtVirtualMemoryLocker now checks MemTracker::enabled(). MemTracker::record_some_operation already checks MemTracker::enabled() and checks against nullptr. This refactoring previously wasn't possible because ThreadCritical was used before #22745 introduced NmtVirtualMemoryLocker.

I considered moving the locking and NMT accounting down into platform specific code: Ex. lock around { munmap() + MemTracker::record }. The hope was that this would help reduce the size of the critical section. However, I found that the OS-specific "pd_" functions are already short and to-the-point, so doing this wasn't reducing the lock scope very much. Instead it just makes the code more messy by having to maintain the locking and NMT accounting in each platform specific implementation.

In many places I've done minor refactoring by relocating calls to MemTracker in order to tighten the locking scope.

In some OS specific code (such as os::map_memory_to_file), I've replaced calls to os::release_memory with os::pd_release_memory. This is to avoid NmtVirtualMemoryLocker reentrancy.

In a few places (such as VirtualMemoryTracker::add_reserved_region) I have replaced tty with defaultStream::output_stream(). Otherwise NmtVirtualMemory_lock would be acquired out of rank order with tty_lock.

Testing:

One concern, due to the expanded critical section, is reentrancy. NmtVirtualMemoryLocker is a HotSpot mutex and is not reentrant. I've added new tests in test_os.cpp and test_virtualMemoryTracker.cpp that try to exercise any usages of NMT that weren't already exercised by existing tests.

tier1 passes on linux and windows. I do not have an AIX machine to test on. Can someone please help run the tests on AIX?

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8341491: Reserve and commit memory operations should be protected by NMT lock (Enhancement - P4)

Reviewers

Stefan Karlsson (@stefank - Reviewer) Review applies to 5c23a76a
Thomas Stuefe (@tstuefe - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24084/head:pull/24084
$ git checkout pull/24084

Update a local copy of the PR:
$ git checkout pull/24084
$ git pull https://git.openjdk.org/jdk.git pull/24084/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24084

View PR using the GUI difftool:
$ git pr show -t 24084

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24084.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2025-03-17T16:21:14Z

👋 Welcome back roberttoyonaga! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-03-17T16:25:51Z

@roberttoyonaga This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8341491: Reserve and commit memory operations should be protected by NMT lock

Reviewed-by: stuefe, stefank

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 831 new commits pushed to the master branch:

f2587d9: 8354938: ZGC: Disable UseNUMA when ZFakeNUMA is used
0f1c448: 8354922: ZGC: Use MAP_FIXED_NOREPLACE when reserving memory
9eeb86d: 8354341: Open some JTable bugs 7
... and 828 more: https://git.openjdk.org/jdk/compare/dea7a9f0d640e5234bafe2157aecd942c71d5de5...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@stefank, @tstuefe) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

openjdk · 2025-03-17T16:49:26Z

@roberttoyonaga The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

roberttoyonaga · 2025-03-21T12:56:46Z

/label serviceability

openjdk · 2025-03-21T12:57:06Z

@roberttoyonaga
The serviceability label was successfully added.

mlbridge · 2025-03-24T15:10:28Z

Webrevs

tstuefe · 2025-03-24T15:11:32Z

Ping @JoKern65 for AIX

stefank

I don't see why we need to extend the lock to be held over the reserve/commit/alloc part.

If we only extend the lock on the release side, then it looks like we get the required exclusion:

lock
1.1 Thread_1 releases range_A.
1.2 Thread_1 tells NMT "range_A has been released".
unlock

2.1 Thread_2 reserves (the now free) range_A.
lock
2.2 Thread_2 tells NMT "range_A is reserved".
unlock

We can get ordering (1.1) (2.1) (1.2) (2.2) but we can't get (1.1) (2.1) (2.2) (1.2).

And isn't this locking scheme exactly what the current code is using? Have you seen an issue that this proposed PR intends to solve? If there is such a problem I wonder if there's just a missing lock extension in one of the "release" operations.

tstuefe · 2025-03-24T17:00:28Z

@stefank

And isn't this locking scheme exactly what the current code is using? Have you seen an issue that this proposed PR intends to solve? If there is such a problem I wonder if there's just a missing lock extension in one of the "release" operations.

What about the case where one thread reserves a range and another thread releases it?

1 Thread A reserves range
2 Thread B releases range
3 Thread B tells NMT "range released"
4 Thread A tells NMT "range reserved"

This would either result in an assert in NMT at step 3 when releasing a range NMT does not know. Or in an incorrectly booked range in step 4 without asserts. Am I making a thinking error somewhere?

roberttoyonaga · 2025-03-24T17:11:55Z

Hi @stefank, I think you're right about (1.1) (2.1) (2.2) (1.2) being prevented by the current implementation. Is there a reason that the current implementation only does the wider locking for release/uncommit? Maybe (2.1) (1.1) (1.2) (2.2) isn't much of an issue since it's unlikely that another thread would uncommit/release the same base address shortly after it's committed/reserved?

Have you seen an issue that this proposed PR intends to solve? If there is such a problem I wonder if there's just a missing lock extension in one of the "release" operations.

I haven't seen that race in the wild, I just noticed that the memory operations weren't protected and thought that it could be a problem.

stefank · 2025-03-25T18:55:56Z

And isn't this locking scheme exactly what the current code is using? Have you seen an issue that this proposed PR intends to solve? If there is such a problem I wonder if there's just a missing lock extension in one of the "release" operations.

What about the case where one thread reserves a range and another thread releases it?

1 Thread A reserves range 2 Thread B releases range 3 Thread B tells NMT "range released" 4 Thread A tells NMT "range reserved"

This would either result in an assert in NMT at step 3 when releasing a range NMT does not know. Or in an incorrectly booked range in step 4 without asserts. Am I making a thinking error somewhere?

In a scenario like that, doesn't Thread A have to communicate somehow that Thread B can now start using and/or releasing that reservation? It sounds like you would have a race condition if you don't have synchronization to hand over the reservation from one thread to another. I would expect that such communication would be be placed after the NMT booking in Thread A.

Thread A:
reserve
lock
NMT booking
unlock
<synchronizing operation (maybe a release_store into a data structure)>

Thread B:
<synchronizing operation (maybe a load_acquire from the same data structure)>
lock
release
NMT booking
unlock

Are there any code that we know of that doesn't fit into a synchronization pattern similar to the above?

I can think of some contrived example where Thread B asks the OS for memory mappings and uses that to ascertain that a pre-determined address has been reserved, and how that could lead to an incorrect booking as you described, but do we really have code like that? If we do, should we really have code like that? Are there some other patterns that I'm missing?

Related to this, I talked to @xmas92 and he mused about swapping the order of the release and NMT release booking as a way to shrink the lock scope for the release operation:

Thread A:
reserve
lock
NMT booking
unlock
<synchronizing operation (maybe a release_store into a data structure)>

Thread B:
<synchronizing operation (maybe a load_acquire from the same data structure)>
lock
NMT booking
unlock
release

As long as we hold the reservation, no other thread can re-reserve the reservation, so Thread B can take its time to first perform the NMT release booking under the lock, and then perform the release without the lock. If another thread (say Thread C) manages to re-reserve the memory, it sounds reasonable that the NMT release booking should have already completed. If we were to try this out, we would have to verify/ensure that the release/reserve pairs perform enough synchronization to make this work. Axel can probably correct me if I mischaracterized what he wrote.

stefank · 2025-03-25T18:58:59Z

Hi stefank, I think you're right about (1.1) (2.1) (2.2) (1.2) being prevented by the current implementation. Is there a reason that the current implementation only does the wider locking for release/uncommit? Maybe (2.1) (1.1) (1.2) (2.2) isn't much of an issue since it's unlikely that another thread would uncommit/release the same base address shortly after it's committed/reserved?

I'm very curious to find out if anyone knows how this could happen without a race condition hand-over from one thread to another. (See my answer to Stüfe).

Have you seen an issue that this proposed PR intends to solve? If there is such a problem I wonder if there's just a missing lock extension in one of the "release" operations.

I haven't seen that race in the wild, I just noticed that the memory operations weren't protected and thought that it could be a problem.

OK. Let's see if anyone finds a hole in my arguments given above.

roberttoyonaga · 2025-03-25T20:20:55Z

... swapping the order of the release and NMT release booking as a way to shrink the lock scope for the release operation:
...
As long as we hold the reservation, no other thread can re-reserve the reservation, so Thread B can take its time to first perform the NMT release booking under the lock, and then perform the release without the lock.

Hi @stefank, I think that's true. But if the release/uncommit does not complete successfully we would need to readjust the accounting afterward. To do that we would need to retrieve the original memtag (at least for reserved regions) and potentially need to retrieve the original callsite data (if we're in detailed mode).

stefank · 2025-03-25T20:49:16Z

... swapping the order of the release and NMT release booking as a way to shrink the lock scope for the release operation:
...
As long as we hold the reservation, no other thread can re-reserve the reservation, so Thread B can take its time to first perform the NMT release booking under the lock, and then perform the release without the lock.

Hi @stefank, I think that's true. But if the release/uncommit does not complete successfully we would need to readjust the accounting afterward. To do that we would need to retrieve the original memtag (at least for reserved regions) and potentially need to retrieve the original callsite data (if we're in detailed mode).

When does a release/uncommit fail? Would that be a JVM bug? What state is the memory in when such a failure happens? Do we even know if the memory is still committed if an uncommit fails?

If I look at the man page for munmap it only fails if you pass in incorrect values, which sounds like a JVM bug to me. I don't understand why we don't treat that as a fatal error OR make sure that all call-sites handles that error, which they don't do today.

roberttoyonaga · 2025-03-26T14:16:06Z

Are there any code that we know of that doesn't fit into a synchronization pattern similar to the above?
I can think of some contrived example where Thread B asks the OS for memory mappings and uses that to ascertain that a pre-determined address has been reserved, and how that could lead to an incorrect booking as you described, but do we really have code like that?

From what I can tell, it doesn't look like that's happening anywhere, someone else may know better though. Similarly, for uncommit, the base address must be passed over from somewhere else in the JVM so relying on some external synchonization seems reasonable here too. If this problem scenario is not present in the current code and it's not expected it to become a possiblity in the future, then I suppose there's no reason to guard against it. Maybe just a comment explaining the reasoning is good enough (and a warning not to use such patterns).

When does a release/uncommit fail? Would that be a JVM bug?

On Windows, VirtualFree also looks like it only fails if an invalid set of arguments are passed. So if os::pd_release fails it's probably a JVM bug. Uncommit uses mmap, which could fail for a larger variety of reasons. Some reasons are out of control of the JVM. For example: "The number of mapped regions would exceed an implementation-defined limit (per process or per system)." See here

What state is the memory in when such a failure happens? Do we even know if the memory is still committed if an uncommit fails?

If release/uncommit fails, then it would be hard to know what state the target memory is in. If the arguments are invalid (bad base address), the target region may not even be allocated. Or, in the case of uncommit, if the base address is not aligned, maybe the target committed region does indeed exist but the uncommit still fails. So it would be hard to determine how to readjust the NMT accounting afterward.

I don't understand why we don't treat that as a fatal error OR make sure that all call-sites handles that error, which they don't do today.

I think release/uncommit failures should be handled by the callers. Currently, uncommit failure is handled in most places by the caller, release failure seems mostly not. Since, at least for uncommit, we could sometimes fail for valid reasons, I think we shouldn't fail fatally in the os:: functions.

stefank · 2025-03-26T16:05:00Z

Are there any code that we know of that doesn't fit into a synchronization pattern similar to the above?
I can think of some contrived example where Thread B asks the OS for memory mappings and uses that to ascertain that a pre-determined address has been reserved, and how that could lead to an incorrect booking as you described, but do we really have code like that?

From what I can tell, it doesn't look like that's happening anywhere, someone else may know better though. Similarly, for uncommit, the base address must be passed over from somewhere else in the JVM so relying on some external synchonization seems reasonable here too. If this problem scenario is not present in the current code and it's not expected it to become a possiblity in the future, then I suppose there's no reason to guard against it. Maybe just a comment explaining the reasoning is good enough (and a warning not to use such patterns).

When does a release/uncommit fail? Would that be a JVM bug?

On Windows, VirtualFree also looks like it only fails if an invalid set of arguments are passed. So if os::pd_release fails it's probably a JVM bug. Uncommit uses mmap, which could fail for a larger variety of reasons. Some reasons are out of control of the JVM. For example: "The number of mapped regions would exceed an implementation-defined limit (per process or per system)." See here

Right. And that failure is fatal, so there should be no need to fix any NMT bookkeeping for that.

What state is the memory in when such a failure happens? Do we even know if the memory is still committed if an uncommit fails?

If release/uncommit fails, then it would be hard to know what state the target memory is in. If the arguments are invalid (bad base address), the target region may not even be allocated. Or, in the case of uncommit, if the base address is not aligned, maybe the target committed region does indeed exist but the uncommit still fails. So it would be hard to determine how to readjust the NMT accounting afterward.

Agreed. And this would be a pre-existing problem already. If a release/uncommit fails, then we have the similar issues for that as well.

I don't understand why we don't treat that as a fatal error OR make sure that all call-sites handles that error, which they don't do today.

I think release/uncommit failures should be handled by the callers. Currently, uncommit failure is handled in most places by the caller, release failure seems mostly not. Since, at least for uncommit, we could sometimes fail for valid reasons, I think we shouldn't fail fatally in the os:: functions.

I would like to drill a bit deeper into this. Do you have any concrete examples of an uncommit failure that should not be handled as a fatal error?

tstuefe · 2025-03-26T16:17:01Z

Hi stefank, I think you're right about (1.1) (2.1) (2.2) (1.2) being prevented by the current implementation. Is there a reason that the current implementation only does the wider locking for release/uncommit? Maybe (2.1) (1.1) (1.2) (2.2) isn't much of an issue since it's unlikely that another thread would uncommit/release the same base address shortly after it's committed/reserved?

I'm very curious to find out if anyone knows how this could happen without a race condition hand-over from one thread to another. (See my answer to Stüfe).

Stefan, your analysis sounds reasonable. Don't see a hole. The original issue was from me I think, but I've never seen that variant in real life.

So, I am fine with leaving that scenario out.

tstuefe · 2025-03-26T16:19:41Z

I think release/uncommit failures should be handled by the callers. Currently, uncommit failure is handled in most places by the caller, release failure seems mostly not. Since, at least for uncommit, we could sometimes fail for valid reasons, I think we shouldn't fail fatally in the os:: functions.

I would like to drill a bit deeper into this. Do you have any concrete examples of an uncommit failure that should not be handled as a fatal error?

I second @stefank here.

Uncommit can fail, ironically, with an ENOMEM : if the uncommit punches a hole into a committed region, this would cause a new new VMA on the kernel-side. This may fail if we run against the limit for VMAs. Forgot what it was, some sysconf setting. All of this is Linux specific, though.

I don't think this should be unconditionally a fatal error. Since the allocator (whatever it is) can decide to re-commit the region later, and thus "self-heal" itself.

stefank · 2025-03-26T16:43:21Z

I think release/uncommit failures should be handled by the callers. Currently, uncommit failure is handled in most places by the caller, release failure seems mostly not. Since, at least for uncommit, we could sometimes fail for valid reasons, I think we shouldn't fail fatally in the os:: functions.

I would like to drill a bit deeper into this. Do you have any concrete examples of an uncommit failure that should not be handled as a fatal error?

I second @stefank here.

Uncommit can fail, ironically, with an ENOMEM : if the uncommit punches a hole into a committed region, this would cause a new new VMA on the kernel-side. This may fail if we run against the limit for VMAs. Forgot what it was, some sysconf setting. All of this is Linux specific, though.

This happens when we hit the /proc/sys/vm/max_map_count limit, and this immediately crashes the JVM.

I don't think this should be unconditionally a fatal error. Since the allocator (whatever it is) can decide to re-commit the region later, and thus "self-heal" itself.

Is this referring to failures when we hit the max_map_count limit? I'm not convinced that you can recover from that without immediately hitting the same issue somewhere else in the code.

Or maybe you are thinking about some of the other reasons for the uncommit to fail?

roberttoyonaga · 2025-03-26T18:57:51Z

What state is the memory in when such a failure happens? Do we even know if the memory is still committed if an uncommit fails?

If release/uncommit fails, then it would be hard to know what state the target memory is in. If the arguments are invalid (bad base address), the target region may not even be allocated. Or, in the case of uncommit, if the base address is not aligned, maybe the target committed region does indeed exist but the uncommit still fails. So it would be hard to determine how to readjust the NMT accounting afterward.

Agreed. And this would be a pre-existing problem already. If a release/uncommit fails, then we have the similar issues for that as well.

Hi @stefank, Are you referring to the difficulty in determining the original allocation as being the pre-existing problem? I think that only becomes an issue if we decide to swap the order of NMT booking and the memory release/uncommit (assuming we don't just fail fatally). Since we don't need to readjust currently, if there's a failure we can just leave everything as it is.

I don't understand why we don't treat that as a fatal error OR make sure that all call-sites handles that error, which they don't do today.

I think release/uncommit failures should be handled by the callers. Currently, uncommit failure is handled in most places by the caller, release failure seems mostly not. Since, at least for uncommit, we could sometimes fail for valid reasons, I think we shouldn't fail fatally in the os:: functions.

I would like to drill a bit deeper into this. Do you have any concrete examples of an uncommit failure that should not be handled as a fatal error?

VirtualSpace::shrink_by allows uncommit to fail without crashing. I'm not certain of the intention behind that. But it seems like it's because shrinking is an optimization and not always critical that it be done immediately. [1]

stefank · 2025-03-26T20:18:43Z

What state is the memory in when such a failure happens? Do we even know if the memory is still committed if an uncommit fails?

If release/uncommit fails, then it would be hard to know what state the target memory is in. If the arguments are invalid (bad base address), the target region may not even be allocated. Or, in the case of uncommit, if the base address is not aligned, maybe the target committed region does indeed exist but the uncommit still fails. So it would be hard to determine how to readjust the NMT accounting afterward.

Agreed. And this would be a pre-existing problem already. If a release/uncommit fails, then we have the similar issues for that as well.

Hi @stefank, Are you referring to the difficulty in determining the original allocation as being the pre-existing problem? I think that only becomes an issue if we decide to swap the order of NMT booking and the memory release/uncommit (assuming we don't just fail fatally). Since we don't need to readjust currently, if there's a failure we can just leave everything as it is.

My thinking is that if there is a failure you don't know what state the OS left the memory in. So, you don't know whether the memory was in fact unmapped as requested, or if it was left intact, or even something in-between. So, if you don't do the matching NMT bookkeeping there will be a mismatch between the state of the memory and what has been bookkeeped in NMT.

I don't understand why we don't treat that as a fatal error OR make sure that all call-sites handles that error, which they don't do today.

I think release/uncommit failures should be handled by the callers. Currently, uncommit failure is handled in most places by the caller, release failure seems mostly not. Since, at least for uncommit, we could sometimes fail for valid reasons, I think we shouldn't fail fatally in the os:: functions.

I would like to drill a bit deeper into this. Do you have any concrete examples of an uncommit failure that should not be handled as a fatal error?

VirtualSpace::shrink_by allows uncommit to fail without crashing. I'm not certain of the intention behind that. But it seems like it's because shrinking is an optimization and not always critical that it be done immediately. [1]

The above example shows code that assumes that it is OK to fail uncommitting and continuing. I'm trying to figure it that assumption is true. So, what I meant was that I was looking for a concrete example of a failure mode of uncommit that would be an acceptable (safe) failure to continue executing from. That is, a valid failure that don't mess up the memory in an unpredictable/unknowable way.

JoKern65 · 2025-04-03T10:21:29Z

Tonight we tested this PR on AIX and it failed in the gtest with

Internal Error (os_aix.cpp:1917), pid=26476938, tid=258
Error: guarantee((vmi)) failed

This will happen if a
os::pd_commit_memory() or os::pd_release_memory() or os::pd_uncommit_memory()
is called on memory not allocated with
os::pd_reserve_memory() or os::pd_attempt_map_memory_to_file_at() or os::pd_attempt_reserve_memory_at()

roberttoyonaga · 2025-04-03T15:23:15Z

Internal Error (os_aix.cpp:1917), pid=26476938, tid=258 Error: guarantee((vmi)) failed

This will happen if a os::pd_commit_memory() or os::pd_release_memory() or os::pd_uncommit_memory() is called on memory not allocated with os::pd_reserve_memory() or os::pd_attempt_map_memory_to_file_at() or os::pd_attempt_reserve_memory_at()

Thank you for running the tests on AIX. I've excluded the file mapping tests that don't meet that criteria on AIX.

JoKern65 · 2025-04-07T09:55:52Z

I ran the tests over the weekend again and now they passed.

stefank

I think this looks good to me, but please seek feedback from others as well.

I've added a couple of suggestions. None of them are required, but I think they would be nice to do.

stefank · 2025-04-08T14:08:23Z

src/hotspot/share/runtime/os.cpp

+// The scope of NmtVirtualMemoryLocker covers both pd_uncommit_memory and record_virtual_memory_uncommit because
+// these operations must happen atomically to avoid races. For example, if Thread_1 were to uncommit a region, but
+// before Thread_1 can update NMT accounting, Thead_2 commits overlapping the same region and updates NMT,
+// then Thread_1 finishes updating NMT. This would result in NMT perceiving part of the region being uncommited,
+// when it is actually committed. The opposite scenario is not guarded against. pd_commit_memory and
+// record_virtual_memory_commit do not happen atomically. We assume that there is some external synchronization
+// that prevents a region from being uncommitted before it is finished being committed.


It's not a requirement, but you get kudos from me if you keep comments lines below 80 lines. I typically don't like code to be 80 lines, but comments tend to be nicer if they are.

Ok I'll try to shorten these comments a bit

test/hotspot/gtest/runtime/test_os.cpp

Co-authored-by: Stefan Karlsson <stefan.karlsson@oracle.com>

roberttoyonaga · 2025-04-09T16:20:36Z

Thank you @stefank for the feedback. I've applied your suggestions.

@tstuefe, when you have time, can you please have another look at this? Based on the discussion above, I've reverted the changes to the locking scope in favor of new comments explaining the asymmetrical locking and warning against patterns that lead to races. The new tests that are still relevant are kept.

tstuefe

Okay, thank you for doing this. I am fine with this version.

roberttoyonaga · 2025-04-22T12:56:39Z

Thank you everyone for the review!

/integrate

openjdk · 2025-04-22T12:57:47Z

@roberttoyonaga
Your change (at version 7b7263b) is now ready to be sponsored by a Committer.

tstuefe · 2025-04-23T11:52:06Z

/sponsor

tstuefe · 2025-04-23T11:52:06Z

/sponsor

openjdk · 2025-04-23T11:53:09Z

Going to push as commit 44c5aca.
Since your change was applied there have been 860 commits pushed to the master branch:

ef0cd18: 8354949: JFR: Split up the EventInstrumentation class
82c2494: 8354228: Parallel: Set correct minimum of InitialSurvivorRatio
e76f203: 8354309: Sort GC includes
... and 857 more: https://git.openjdk.org/jdk/compare/dea7a9f0d640e5234bafe2157aecd942c71d5de5...master

Your commit was automatically rebased without conflicts.

openjdk · 2025-04-23T11:53:21Z

@tstuefe @roberttoyonaga Pushed as commit 44c5aca.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk · 2025-04-23T11:53:29Z

@tstuefe The command sponsor can only be used in open pull requests.

openjdk bot changed the title ~~8341491~~ 8341491: Reserve and commit memory operations should be protected by NMT lock Mar 17, 2025

openjdk bot added the hotspot hotspot-dev@openjdk.org label Mar 17, 2025

roberttoyonaga force-pushed the JDK-8341491 branch from c4e01e2 to 78ed01a Compare March 20, 2025 20:52

openjdk bot added the serviceability serviceability-dev@openjdk.org label Mar 21, 2025

roberttoyonaga force-pushed the JDK-8341491 branch from 67fc12c to f841f7c Compare March 21, 2025 15:24

make memory op and NMT accounting atomic

86423d0

roberttoyonaga force-pushed the JDK-8341491 branch from f841f7c to 86423d0 Compare March 21, 2025 18:53

roberttoyonaga marked this pull request as ready for review March 24, 2025 15:03

openjdk bot added the rfr Pull request is ready for review label Mar 24, 2025

stefank suggested changes Mar 24, 2025

View reviewed changes

exclude file mapping tests on AIX.

5c23a76

roberttoyonaga requested a review from stefank April 8, 2025 13:10

stefank approved these changes Apr 8, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Apr 8, 2025

roberttoyonaga and others added 4 commits April 9, 2025 09:18

Update test/hotspot/gtest/runtime/test_os.cpp

2bf11cd

Co-authored-by: Stefan Karlsson <stefan.karlsson@oracle.com>

Update test/hotspot/gtest/runtime/test_os.cpp

311c374

Co-authored-by: Stefan Karlsson <stefan.karlsson@oracle.com>

Update test/hotspot/gtest/runtime/test_os.cpp

6b99d05

Co-authored-by: Stefan Karlsson <stefan.karlsson@oracle.com>

Update test/hotspot/gtest/runtime/test_os.cpp

813a1e4

Co-authored-by: Stefan Karlsson <stefan.karlsson@oracle.com>

openjdk bot removed the ready Pull request is ready to be integrated label Apr 9, 2025

improve tests and comments

7b7263b

tstuefe approved these changes Apr 16, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Apr 16, 2025

openjdk bot added the sponsor Pull request is ready to be sponsored label Apr 22, 2025

openjdk bot added the integrated Pull request has been integrated label Apr 23, 2025

openjdk bot closed this Apr 23, 2025

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Apr 23, 2025

stefank mentioned this pull request Jun 11, 2025

8357662: applications/runthese/RunThese8H.java Crash: 'Failed to uncommit metaspace' #25720

Open

3 tasks

8341491: Reserve and commit memory operations should be protected by NMT lock #24084

8341491: Reserve and commit memory operations should be protected by NMT lock #24084

Uh oh!

Conversation

roberttoyonaga commented Mar 17, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update:

Summary:

The problem:

Solution:

Other notes:

Testing:

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Mar 17, 2025

Uh oh!

openjdk bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Mar 17, 2025

Uh oh!

roberttoyonaga commented Mar 21, 2025

Uh oh!

openjdk bot commented Mar 21, 2025

Uh oh!

mlbridge bot commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

tstuefe commented Mar 24, 2025

Uh oh!

stefank left a comment

Choose a reason for hiding this comment

Uh oh!

tstuefe commented Mar 24, 2025

Uh oh!

roberttoyonaga commented Mar 24, 2025

Uh oh!

stefank commented Mar 25, 2025

Uh oh!

stefank commented Mar 25, 2025

Uh oh!

roberttoyonaga commented Mar 25, 2025

Uh oh!

stefank commented Mar 25, 2025

Uh oh!

roberttoyonaga commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefank commented Mar 26, 2025

Uh oh!

tstuefe commented Mar 26, 2025

Uh oh!

tstuefe commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefank commented Mar 26, 2025

Uh oh!

roberttoyonaga commented Mar 26, 2025

Uh oh!

stefank commented Mar 26, 2025

Uh oh!

JoKern65 commented Apr 3, 2025

Uh oh!

roberttoyonaga commented Apr 3, 2025

Uh oh!

JoKern65 commented Apr 7, 2025

Uh oh!

stefank left a comment

Choose a reason for hiding this comment

Uh oh!

stefank Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

roberttoyonaga Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

roberttoyonaga commented Mar 17, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Mar 17, 2025 •

edited

Loading

mlbridge bot commented Mar 24, 2025 •

edited

Loading

roberttoyonaga commented Mar 26, 2025 •

edited

Loading

tstuefe commented Mar 26, 2025 •

edited

Loading