Skip to content

8304824: NMT should not use ThreadCritical #20852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

roberttoyonaga
Copy link
Contributor

@roberttoyonaga roberttoyonaga commented Sep 4, 2024

Summary

This PR just replaces ThreadCritical with a lock specific to NMT. ThreadCritical is a big lock and is unnecessary for the purposes of NMT. I've implemented the new lock with a semaphore so that it can be used early before VM init. There is also the possibility of adding assertions in places we expect NMT to have synchronization. I haven't added assertions yet in many places because some code paths such as the (NMT tests) don't lock yet. I think it makes sense to close any gaps in locking in another PR in which I can also add more assertions.

Testing:

  • hotspot_nmt
  • gtest:VirtualSpace
  • tier1

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8304824: NMT should not use ThreadCritical (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20852/head:pull/20852
$ git checkout pull/20852

Update a local copy of the PR:
$ git checkout pull/20852
$ git pull https://git.openjdk.org/jdk.git pull/20852/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20852

View PR using the GUI difftool:
$ git pr show -t 20852

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20852.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 4, 2024

👋 Welcome back roberttoyonaga! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Sep 4, 2024

@roberttoyonaga This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8304824: NMT should not use ThreadCritical

Reviewed-by: stuefe, dholmes, jsjolen

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 588 new commits pushed to the master branch:

  • d49f210: 8342040: Further improve entry lookup performance for multi-release JARs
  • d2e716e: 8331958: Update PC/SC Lite for Suse Linux to 2.3.0
  • a95374f: 8343101: Rework BasicTest.testTemp test cases
  • 00fe9f7: 8343100: Consolidate EmptyFolderTest and EmptyFolderPackageTest jpackage tests into single java file
  • 9f6d5b4: 8343020: (fs) Add support for SecureDirectoryStream on macOS
  • 1341b81: 8341666: FileInputStream doesn't support readAllBytes() or readNBytes(int) on pseudo devices
  • 52382e2: 8338021: Support new unsigned and saturating vector operators in VectorAPI
  • e659d9d: 8342975: C2: Micro-optimize PhaseIdealLoop::Dominators()
  • 9f6211b: 8341371: CDS cannot load archived heap objects with -XX:+UseSerialGC -XX:-UseCompressedOops
  • 120a935: 8342561: Metaspace for generated reflection classes is no longer needed
  • ... and 578 more: https://git.openjdk.org/jdk/compare/f0ae90f30c346544e87217ef1832d6a350fe1985...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@dholmes-ora, @gerard-ziemski, @jdksjolen, @tstuefe) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot changed the title 8304824 8304824: NMT should not use ThreadCritical Sep 4, 2024
@openjdk
Copy link

openjdk bot commented Sep 4, 2024

@roberttoyonaga To determine the appropriate audience for reviewing this pull request, one or more labels corresponding to different subsystems will normally be applied automatically. However, no automatic labelling rule matches the changes in this pull request. In order to have an "RFR" email sent to the correct mailing list, you will need to add one or more applicable labels manually using the /label pull request command.

Applicable Labels
  • build
  • client
  • compiler
  • core-libs
  • graal
  • hotspot
  • hotspot-compiler
  • hotspot-gc
  • hotspot-jfr
  • hotspot-runtime
  • i18n
  • ide-support
  • javadoc
  • jdk
  • jmx
  • kulla
  • net
  • nio
  • security
  • serviceability
  • shenandoah

@roberttoyonaga
Copy link
Contributor Author

/label serviceability

@openjdk openjdk bot added the serviceability serviceability-dev@openjdk.org label Sep 4, 2024
@openjdk
Copy link

openjdk bot commented Sep 4, 2024

@roberttoyonaga
The serviceability label was successfully added.

@roberttoyonaga roberttoyonaga force-pushed the JDK-8304824 branch 2 times, most recently from 7ac88d2 to a9731e2 Compare September 4, 2024 23:24
@roberttoyonaga
Copy link
Contributor Author

The GHA macos-aarch64 / test - Test (tier1) is failing due to TEST: runtime/cds/DeterministicDump.java with the message java.lang.RuntimeException: File content different at byte #4, b0 = -126, b1 = -52.
But I don't think it's related to the changes in this PR. It also looks like this test is failing on other recent PRs.

@roberttoyonaga roberttoyonaga marked this pull request as ready for review September 5, 2024 21:25
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 5, 2024
@mlbridge
Copy link

mlbridge bot commented Sep 5, 2024

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay in principal except for the reentrancy aspect - see below.

A few other minor nits.

Thanks

Comment on lines 39 to 40
Semaphore NmtGuard::_nmt_semaphore(1);
intx volatile NmtGuard::_owner((intx) -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add explanatory comments.

Comment on lines 32 to 34
#include "runtime/os.hpp"
#include "runtime/atomic.hpp"
#include "runtime/javaThread.hpp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please list in alphabetic order.

NmtGuard() {
intx const current = os::current_thread_id();
intx const owner = Atomic::load(&_owner);
assert(current != owner, "Lock is not reentrant");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadCritical is reentrant though. We would need to do a lot of testing and/or code analysis to ensure we don't have the possibility of needing a reentrant lock with NMT. The most likely problems would be if we triggered a crash whilst in a locked section of code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a healthy amount of tests already. We need a wide use of assert_locked and need to take a close look at error reporting. Locking, in NMT, is rather simple.

Error handling: My pragmatic "good enough for this rare scenario" proposal would be:

At the entry of error reporting, if the NMT mutex is locked by the crashing thread, just unlock it manually and try to continue. The chance is high that this will actually work well enough to survive the few mallocs we may do during error reporting.

Reasoning: We use NMT mutex as we did ThreadCritical: rather widely scoped. The time windows where things are out of sync and reentrant use of the same functionality can bite us are small. Moreover, NMT's two subsystems - malloc tracker and mmap tracker - are mostly independent from each other, and even leaving the mmap tracker in an inconsistent state will not prevent us from doing malloc during error reporting. We don't use mmap during error reporting, just malloc. Finally, moreover, NMT malloc code is comparatively simple and stable; mmap code is more complex, and we plan to rework parts of it in the near future.

TL;DR I think just manually releasing the lock in error reporting has a high chance of succeeding. Especially if we disable the NMT report in the hs-err file I am not even sure this is necessary, though. We have safeguards against error reporting locking up (step timeouts and secondary crash guards).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through this more, we can probably shorten my advice to this:

When entering error handling, if NMT lock is held by crashing thread and NMT level is detail, switch back to NMT level summary.

That's it.

Reasoning: we only do mallocs inside hs-err generation. For malloc, we only lock inside NMT if the NMT level is detail. And this should never change: malloc is too hot to justify a lock in summary mode.

The only point where possible crashes or deadlocks can occur in error reporting is when doing the NMT report itself - but chances for this to happen are remote, and we have secondary crash handling and error step timeouts to deal with this if it happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I've added some logic in VMError::report to switch to summary mode (I think this is the right place). However, if we need to do error reporting while the NmtGuard is already held, then I think we might still have reentrancy even in summary mode. After the latest commit that replaces MemoryFileTracker's locker with NmtGuard, MemBaseline::baseline_summary() now acquires NmtGuard too.

Because of this I've also made NmtGuard reentrant. Maybe handling reentrancy would be beneficial in future/unforeseen situations as well. But if this is unnecessary, I don't mind removing it.

Comment on lines 166 to 168
intx const current = os::current_thread_id();
intx const owner = Atomic::load(&_owner);
assert(current == owner, "NMT lock should be acquired in this section.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the entire body in ifdef DEBUG so we don't execute anything in release builds.

Also the assert should report the values of owner and current when they are not as expected.

@tstuefe
Copy link
Member

tstuefe commented Sep 6, 2024

I plan to look at this start of next week.

@jdksjolen
Copy link
Contributor

Thank you Robert.

I think that the MemoryFileTracker's locker should probably also be replaced with this semaphore, as in the future we have plans to have a globally shared NativeCallStackStorage.

@tstuefe
Copy link
Member

tstuefe commented Sep 6, 2024

Thank you Robert.

I think that the MemoryFileTracker's locker should probably also be replaced with this semaphore, as in the future we have plans to have a globally shared NativeCallStackStorage.

+1

@tstuefe
Copy link
Member

tstuefe commented Sep 6, 2024

@roberttoyonaga Why don't we use a normal hotspot mutex here?

Comment on lines 165 to 168
static void assert_locked(){
intx const current = os::current_thread_id();
intx const owner = Atomic::load(&_owner);
assert(current == owner, "NMT lock should be acquired in this section.");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we try to re-use assert_locked() in constructor and destructor, something like:

    NmtGuard() {
      assert(!assert_locked(), "Lock is not reentrant");
      _nmt_semaphore.wait();
      Atomic::store(&_owner, current);
    }

    ~NmtGuard() {
      assert(assert_locked(), "NMT lock should be acquired in this section.");
      Atomic::store(&_owner, (intx) -1);
      _nmt_semaphore.signal();
    }

    static bool assert_locked(){
      intx const current = os::current_thread_id();
      intx const owner = Atomic::load(&_owner);
      return (current == owner);
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to use it in the destructor. But based on the discussion above, it seems like we shouldn't assert the lock can't be reentered.

@roberttoyonaga
Copy link
Contributor Author

@roberttoyonaga Why don't we use a normal hotspot mutex here?

Hi @tstuefe. I tried using a regular mutex along with MutexLocker, but it seems like the mutex isn't initialized early enough to be used during VM init. After building I get:

assert(NMT_lock != nullptr) failed: not initialized!

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x178448c]  VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x948  (nmtCommon.hpp:137)
V  [libjvm.so+0x1783aed]  (vmError.cpp:1611)
V  [libjvm.so+0xa8e2d2]  report_vm_status_error(char const*, int, char const*, int, char const*)+0x0  (debug.cpp:193)
V  [libjvm.so+0x8cb317]  NMTUtil::nmt_lock()+0x69  (nmtCommon.hpp:137)
V  [libjvm.so+0x13a0459]  MemTracker::record_virtual_memory_reserve(void*, unsigned long, NativeCallStack const&, MEMFLAGS)+0x36  (memTracker.hpp:128)
V  [libjvm.so+0x139d292]  os::reserve_memory(unsigned long, bool, MEMFLAGS)+0x80  (os.cpp:1877)
V  [libjvm.so+0xa8fc31]  initialize_assert_poison()+0x1f  (debug.cpp:712)
V  [libjvm.so+0x16e5720]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x1be  (threads.cpp:489)
V  [libjvm.so+0xf62555]  JNI_CreateJavaVM_inner(JavaVM_**, void**, void*)+0xbd  (jni.cpp:3582)
V  [libjvm.so+0xf629e3]  JNI_CreateJavaVM+0x32  (jni.cpp:3673)
C  [libjli.so+0x8c04]  InitializeJVM+0x14d  (java.c:1592)
C  [libjli.so+0x5337]  JavaMain+0xe8  (java.c:505)
C  [libjli.so+0xc61e]  ThreadJavaMain+0x27  (java_md.c:653)
C  [libc.so.6+0x97507]  start_thread+0x377

@roberttoyonaga
Copy link
Contributor Author

Thank you Robert.

I think that the MemoryFileTracker's locker should probably also be replaced with this semaphore, as in the future we have plans to have a globally shared NativeCallStackStorage.

Hi @jdksjolen. No problem! Ok I've replaced it.

@tstuefe
Copy link
Member

tstuefe commented Sep 11, 2024

@roberttoyonaga Why don't we use a normal hotspot mutex here?

Hi @tstuefe. I tried using a regular mutex along with MutexLocker, but it seems like the mutex isn't initialized early enough to be used during VM init. After building I get:

assert(NMT_lock != nullptr) failed: not initialized!

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x178448c]  VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x948  (nmtCommon.hpp:137)
V  [libjvm.so+0x1783aed]  (vmError.cpp:1611)
V  [libjvm.so+0xa8e2d2]  report_vm_status_error(char const*, int, char const*, int, char const*)+0x0  (debug.cpp:193)
V  [libjvm.so+0x8cb317]  NMTUtil::nmt_lock()+0x69  (nmtCommon.hpp:137)
V  [libjvm.so+0x13a0459]  MemTracker::record_virtual_memory_reserve(void*, unsigned long, NativeCallStack const&, MEMFLAGS)+0x36  (memTracker.hpp:128)
V  [libjvm.so+0x139d292]  os::reserve_memory(unsigned long, bool, MEMFLAGS)+0x80  (os.cpp:1877)
V  [libjvm.so+0xa8fc31]  initialize_assert_poison()+0x1f  (debug.cpp:712)
V  [libjvm.so+0x16e5720]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x1be  (threads.cpp:489)
V  [libjvm.so+0xf62555]  JNI_CreateJavaVM_inner(JavaVM_**, void**, void*)+0xbd  (jni.cpp:3582)
V  [libjvm.so+0xf629e3]  JNI_CreateJavaVM+0x32  (jni.cpp:3673)
C  [libjli.so+0x8c04]  InitializeJVM+0x14d  (java.c:1592)
C  [libjli.so+0x5337]  JavaMain+0xe8  (java.c:505)
C  [libjli.so+0xc61e]  ThreadJavaMain+0x27  (java_md.c:653)
C  [libc.so.6+0x97507]  start_thread+0x377

Hi @roberttoyonaga,

sorry for the sluggish responses, I am swamped atm. Thanks a lot for your work here.

I dislike the self-written lock because of duplication and because Mutex gives us infrastructure benefits we don't have here. E.g. locking monitoring, deadlock detection, and things like assert_lock_strong.

You run into this problem at the initialization phase. I expected that, and here is where things gets interesting - here and in deadlock detection.

I see two better solutions:

  1. During initialization, we don't need a lock. We are single-threaded (note that it may be different threads at different times, since different threads may handle CreateJavaVM and the initial invocation, but only one thread will run simultaneously). In any case, before the mutex system is initialized and the Mutexes are created, we don't need to lock.

So, you could use Mutex and MutexLocker but short-circuit locking if the mutex is still zero. This should work and should be lock/unlock balanced too. I believe MutexLocker already handles this case under the hood. You can call it with a nullptr mutex, and it will not lock. So you don't even have to do anything. Just don't assert.

  1. That place where we mmap is a rare (and I think probably the only one) case where we mmap before mutex initialization. Pretty sure at least. This is just for poisoning the assertion page. This code could be moved downward a bit, since the assert poisoning is not essential for getting assertions to work, just a nice to have for added error info. That info does not really matter here, its more for assertion in compiled code.

I prefer (1), since then, locking would work as it does in all other places in hotspot.

@roberttoyonaga
Copy link
Contributor Author

Hi @tstuefe,

I prefer (1), since then, locking would work as it does in all other places in hotspot.

Ok I've switched to using a regular hotspot Mutex instead of my own lock based on a semaphore.

  • I had to decrease the rank of SharedDecoder_lock in order for it to be acquired after NMT_lock inside MemTracker::print_containing_region
  • I needed to add a check for Thread::current_or_null() != nullptr in the constructor/destructor for MutexLockerImpl because, during VM init, NMT_lock seems to be used by detached threads.
  • I'm still a little concerned about possible reentrancy due to the previous comment above so I manually unlock NMT_lock inside error reporting in addition to reducing tracking to summary level.

@roberttoyonaga
Copy link
Contributor Author

After switching to a Hotspot Mutex, it looks like the windows-x64 / test (hs/tier1 common) GHA is failing because the test release_bad_ranges in test_os.cpp is expecting an assertion and an error message to be printed. However, when this printing happens, tty_lock gets acquired out of rank order with the already held NMT_lock, causing the lock rank assertion fail. One solution would be to lower the rank of tty_lock. I'm not sure that's the best solution because that might cause rank conflicts with other locks (and it makes sense to give locks the highest possible rank to minimize future conflicts).

@tstuefe
Copy link
Member

tstuefe commented Sep 13, 2024

After switching to a Hotspot Mutex, it looks like the windows-x64 / test (hs/tier1 common) GHA is failing because the test release_bad_ranges in test_os.cpp is expecting an assertion and an error message to be printed. However, when this printing happens, tty_lock gets acquired out of rank order with the already held NMT_lock, causing the lock rank assertion fail. One solution would be to lower the rank of tty_lock. I'm not sure that's the best solution because that might cause rank conflicts with other locks (and it makes sense to give locks the highest possible rank to minimize future conflicts).

What code exactly locks tty_lock?

@tstuefe
Copy link
Member

tstuefe commented Sep 13, 2024

After switching to a Hotspot Mutex, it looks like the windows-x64 / test (hs/tier1 common) GHA is failing because the test release_bad_ranges in test_os.cpp is expecting an assertion and an error message to be printed. However, when this printing happens, tty_lock gets acquired out of rank order with the already held NMT_lock, causing the lock rank assertion fail. One solution would be to lower the rank of tty_lock. I'm not sure that's the best solution because that might cause rank conflicts with other locks (and it makes sense to give locks the highest possible rank to minimize future conflicts).

What code exactly locks tty_lock?

This is weird. We print the error in this case via print_error_for_unit_test(), which takes care only to use raw fprintf(stderr) for this very reason... therefore I am interested in understanding the lock semantics, to know if deadlock detection found a real problem or if this is just a weird error setup.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 28, 2024
@roberttoyonaga
Copy link
Contributor Author

Thank you everyone for the review!
/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Oct 28, 2024
@openjdk
Copy link

openjdk bot commented Oct 28, 2024

@roberttoyonaga
Your change (at version 976e26c) is now ready to be sponsored by a Committer.

@tstuefe
Copy link
Member

tstuefe commented Oct 29, 2024

/sponsor

@openjdk
Copy link

openjdk bot commented Oct 29, 2024

Going to push as commit 0abfa3b.
Since your change was applied there have been 593 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 29, 2024
@openjdk openjdk bot closed this Oct 29, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Oct 29, 2024
@openjdk
Copy link

openjdk bot commented Oct 29, 2024

@tstuefe @roberttoyonaga Pushed as commit 0abfa3b.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@jdksjolen
Copy link
Contributor

Hi @tstuefe, @roberttoyonaga

We saw some test failures in non-generational ZGC due to this change. That GC is removed now, so there's no worry there. I did however look at all of our remaining usages of ThreadCritical and saw that there are thrre left that are related to NMT:

src/hotspot/share/nmt/nmtUsage.cpp
56     ThreadCritical tc;

src/hotspot/share/nmt/mallocTracker.cpp
65     // Use ThreadCritical to make sure that mtChunks don't get deallocated while the
68     ThreadCritical tc;


src/hotspot/share/memory/arena.cpp
178      ThreadCritical tc;  // Free chunks under TC lock so that NMT adjustment is stable.

I am not that familiar with this code. Is leaving these as they are intentional, or have we introduced a potential bug?

Thank you.

@roberttoyonaga
Copy link
Contributor Author

Hi @jdksjolen, I originally replaced those instances, but then put them back in because I saw that they are related to chunks allocation/deallocation code where it seemed ThreadCritical was used for other reasons.

I'm not certain, but looking at it again, it seems that the ThreadCritical uses in ChunkPool::deallocate_chunk and ChunkPool::prune() are only needed for NMT and are independent of the other ThreadCritical uses in that code.

@roberttoyonaga
Copy link
Contributor Author

roberttoyonaga commented Oct 30, 2024

Also, those uses of ThreadCritical are not protecting NMT's model of virtual memory. They are only protecting the reporting of malloc info specifically. Although, it may still make sense to replace those instances with NmtVirtualMemory_lock since it's is still related to NMT.

What do you think @tstuefe?

@jdksjolen
Copy link
Contributor

I'm not certain, but looking at it again, it seems that the ThreadCritical uses in ChunkPool::deallocate_chunk and ChunkPool::prune() are only needed for NMT and are independent of the other ThreadCritical uses in that code.

At least for prune, that's needed for the chunk pool itself as it would otherwise be accessed concurrently.

Also, those uses of ThreadCritical are not protecting NMT's model of virtual memory. They are only protecting the reporting of malloc info specifically. Although, it may still make sense to replace those instances with NmtVirtualMemory_lock since it's is still related to NMT.

OK, I understand. That seems reasonable.

@tstuefe
Copy link
Member

tstuefe commented Oct 30, 2024

We should get rid of this awful value correcting for arenas. I get headaches trying to think about this every time.

To remind everyone including me, some of the stuff is explained here: complex .

About locking around os::free - not sure if we can do this. Are all paths inside os::free lock free? It does call NMT. I think all cases inside NMT that touches are lock free, but am not sure. Have to think about this a biot.

@tstuefe
Copy link
Member

tstuefe commented Oct 30, 2024

About locking around os::free - not sure if we can do this. Are all paths inside os::free lock free? It does call NMT. I think all cases inside NMT that touches are lock free, but am not sure. Have to think about this a bit.

Ah okay should be fine. Both mallocs and frees should be lock free in NMT. In summary mode we just modify counters. In detail mode, we add to the malloc site table, but that one is lock free as well.

@roberttoyonaga
Copy link
Contributor Author

roberttoyonaga commented Oct 30, 2024

Ok so there shouldn't be reentrancy if we use the NmtVirtualMemory_lock around os::free. But we also free in ChunkPool::prune() :

I'm not certain, but looking at it again, it seems that the ThreadCritical uses in ChunkPool::deallocate_chunk and ChunkPool::prune() are only needed for NMT and are independent of the other ThreadCritical uses in that code.

At least for prune, that's needed for the chunk pool itself as it would otherwise be accessed concurrently.

So that means we'd need to have both ThreadCritical and NmtVirtualMemory_lock in that method (if we were to do the other replacements). One to protect the chunks and one to protect the malloc accounting. It might also be good to rename NmtVirtualMemory_lock then too.

@tstuefe
Copy link
Member

tstuefe commented Oct 31, 2024

So that means we'd need to have both ThreadCritical and NmtVirtualMemory_lock in that method (if we were to do the other replacements). One to protect the chunks and one to protect the malloc accounting. It might also be good to rename NmtVirtualMemory_lock then too.

Please split this code in two parts. First part removes retired chunks from the arena chunk list and adds them to a new temporary list. Second part goes through the temp list and deletes the chunks. First part under ThreadCritical, second part under NmtVirtualMemory_lock.

@tstuefe
Copy link
Member

tstuefe commented Oct 31, 2024

I had to analyze this again, to understand why we need this locking, since my mind slips.

I updated my findings in https://bugs.openjdk.org/browse/JDK-8325890 . Please see details there.

But I don't think the current locking makes much sense, tbh (even before this patch). Or I don't get it.

We have three counters in play here:
A) the malloc[mtChunk] malloc counter for the mtChunk tag
B) the global malloc counter
C) the arena[xxx] arena counter for the mtXXX tag the arena is tagged with

Upon arena growth, we incremement all four. In two steps - first (A) and (B) under os::malloc, then (C) in the arena code.

Upon arena death, we may delete the chunk, in which case we adjust all three counters as well.

But since we don't want (B) to include arena sizes, we have several places where we lazily adjust (B) and (A) from the sum of all arena counters (see JBS issue). These adjustments must be synchronized with arena allocations. But we don't do this. The lock does not include (C). For that, the lock would have to be at a different place, inside arena code. We only lock around (A) and (B).

I am not sure what the point of this whole thing is. I am also not convinced that it always works correctly. This whole "updating counters lazily" business makes the code brittle.

Also, the current locking does not guarantee correctness anyway. There are several places in the coding where we calculate e.g. the sum of all mallocs, but malloc counters are modified (rightly so) out of lock protection.

To make matters even more complex, we have not two locks here but three:

  • ThreadCritical
  • the new NmtVirtualMemory_lock
  • we also have the NMTQuery_lock, which guards NMT reports from jcmd exclusively

Not sure what the point of the latter even is. It certainly does not prevent us from getting reports while underlying data structures are being changed, unless I am overlooking something.

It would be very nice if someone other than me could analyze this.

@jdksjolen
Copy link
Contributor

I had to analyze this again, to understand why we need this locking, since my mind slips.

I updated my findings in https://bugs.openjdk.org/browse/JDK-8325890 . Please see details there.

But I don't think the current locking makes much sense, tbh (even before this patch). Or I don't get it.

We have three counters in play here: A) the malloc[mtChunk] malloc counter for the mtChunk tag B) the global malloc counter C) the arena[xxx] arena counter for the mtXXX tag the arena is tagged with

Upon arena growth, we incremement all four. In two steps - first (A) and (B) under os::malloc, then (C) in the arena code.

Upon arena death, we may delete the chunk, in which case we adjust all three counters as well.

But since we don't want (B) to include arena sizes, we have several places where we lazily adjust (B) and (A) from the sum of all arena counters (see JBS issue). These adjustments must be synchronized with arena allocations. But we don't do this. The lock does not include (C). For that, the lock would have to be at a different place, inside arena code. We only lock around (A) and (B).

I am not sure what the point of this whole thing is. I am also not convinced that it always works correctly. This whole "updating counters lazily" business makes the code brittle.

Also, the current locking does not guarantee correctness anyway. There are several places in the coding where we calculate e.g. the sum of all mallocs, but malloc counters are modified (rightly so) out of lock protection.

To make matters even more complex, we have not two locks here but three:

* ThreadCritical

* the new `NmtVirtualMemory_lock`

* we also have the `NMTQuery_lock`, which guards NMT reports from jcmd exclusively

Not sure what the point of the latter even is. It certainly does not prevent us from getting reports while underlying data structures are being changed, unless I am overlooking something.

It would be very nice if someone other than me could analyze this.

If this is so brittle and complex, then I wonder what you even get out of us doing the ThreadCritical trick. In other words, if we just removed it, would anyone notice and be able to discern what's occurred? Open question, I might do that change and run the tests on it.

@tstuefe
Copy link
Member

tstuefe commented Oct 31, 2024

If this is so brittle and complex, then I wonder what you even get out of us doing the ThreadCritical trick. In other words, if we just removed it, would anyone notice and be able to discern what's occurred? Open question, I might do that change and run the tests on it.

@jdksjolen good question. I am not sure who introduced that, may have been Afshin as a fix for some NMT test.

@roberttoyonaga
Copy link
Contributor Author

roberttoyonaga commented Oct 31, 2024

Hi @tstuefe,

But since we don't want (B) to include arena sizes, we have several places where we lazily adjust (B) and (A) from the sum of all arena counters (see JBS issue). These adjustments must be synchronized with arena allocations. But we don't do this. The lock does not include (C). For that, the lock would have to be at a different place, inside arena code

Yes, updating all three counters (A/B/C) is not atomic when growing an arena or when destroying an arena. I think the ThreadCritical, that's meant to keep the accounting in sync, needs to be further up the stack when destroying arenas. What's strange is that there's no ThreadCritical protecting the growing operation at all. Why was there only an attempt to protect the freeing?

I think this is also confusing because its not consistent whether we count arena sizes as part of malloc. In the NMT report total, arena is counted toward malloc (so we have to subtract mtChunk). But in the reported individual categories, arena is disjunct from malloc.

The solution you mentioned in https://bugs.openjdk.org/browse/JDK-8325890 seems like it would solve this consistency problem while also avoiding the double counting problem with mtChunk. I guess the main problem with this is that the change will break things if people are relying on the reported arena values to not overlap with malloc.

@roberttoyonaga
Copy link
Contributor Author

The easiest action to take would be to remove the ThreadCritical that attempts to keep NMT accounting in-sync in arena.cpp. It doesn't actually cover all cases, so we might as well not bother locking at all for slightly better performance and reduce the risk of bad lock ordering.

The next best solution would be to make arena size a subset of malloc size, and make mtChunk only responsible for free chunks not yet associated with other NMT categories. Then there would be no need for special adjustments when reporting, and thus no need for the special ThreadCritical locking here. And when calculating total sizes, we can simplify the calculation to summing up all "malloc" in MemoryCounter. No need to make a distinction between when it is/isn't appropriate to add on arena sizes. This would be a breaking change since it changes the meaning of arena, malloc, and mtChunk.

The third and highest effort solution would be to remove the chunk pooling altogether as suggested in https://bugs.openjdk.org/browse/JDK-8325890. This shares the benefits of solution 2, but also requires performance testing and has consequences that extend beyond NMT.

Also, the current locking does not guarantee correctness anyway. There are several places in the coding where we calculate e.g. the sum of all mallocs, but malloc counters are modified (rightly so) out of lock protection.

Yes, still none of the three above solutions solve this. Even with solution 2 or 3 it would be possible to update the arena sizes for each category concurrently with reporting (unless we introduce new locking).

@@ -28,6 +28,7 @@
#include "memory/allocation.hpp"
#include "runtime/flags/flagSetting.hpp"
#include "runtime/mutex.hpp"
#include "runtime/thread.hpp"
Copy link

@mgronlun mgronlun Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This include is not needed because there are no uses that require the definition of Thread.

After pull, I am getting circularity errors from runtime/thread.hpp -> jfr/support/jfrThreadLocal.hpp (which now needs runtime/mutexLocker.hpp (-> runtime/thread.hpp)

The include of "runtime/thread.hpp" can instead be placed in "nmt/nmtCommon.hpp" which has a use in the NmtVirtualMemoryLocker constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This include is not needed because there are no uses that require the definition of Thread.

Right, seems like the forward declaration used to be provided by memory/allocation.hpp. Let's get rid of the include and use a forward declaration of our own instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I open another PR to remove it? I can also remove the locking related to NMT in arena.cpp in the same PR as well (as per this comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed that particular bug and integrated it already :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

7 participants