-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Fix NativeAOT GC hang on single-core systems by preventing RT signal queue overflow #121772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: VSadov <8218165+VSadov@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a GC hang on single-core systems under high CPU load by preventing RT signal queue overflow during thread suspension. The Unix implementation of PalHijack() was repeatedly sending suspension signals to threads that couldn't suspend immediately, causing the signal queue to overflow and preventing signal handlers from completing the suspension process.
Key Changes
- Added early-exit check in
PalHijack()to avoid sending duplicate suspension signals - Aligns Unix behavior with existing Windows implementation pattern
- Uses existing
IsActivationPending()flag to track pending signals
|
I have run the entire Libraries test suite with Considering that this brings the signal strategy on parity with CoreCLR, I think it is safe to merge. |
|
We should watch for impact on benchmarks after merge, but I expect no change, since there will be no impact on suspension-friendly scenarios, which is the most common case. It would be nice to know if this resolves the reported issue. It should help. |
|
Do we need an equivalent of this too? runtime/src/coreclr/vm/threadsuspend.cpp Line 2216 in f62405c
|
Good catch. We need that if we do not have. |
On single-core systems under high CPU load, NativeAOT GC can hang for hours when suspending threads. The Unix implementation repeatedly sends RT signals to threads that cannot suspend immediately, causing signal queue overflow and handler starvation.
Changes
IsActivationPending()check inPalHijack()before sending suspension signalsThe flag is set before sending
pthread_kill()and cleared in the signal handler after processing, ensuring at most one pending signal per thread.Fixes #121623
Original prompt
This section details on the original issue you should resolve
<issue_title>NativeAOT garbage collector hang</issue_title>
<issue_description>### Description
GC hangs for extended periods (several hours) on single core system running NativeAOT build when CPU load is high.
Noticed with a web application that uses gRpc for internal communication between the dotnet application and the rest of the linux system.
Reproduction Steps
Run NativeAOT build on single CPU core, with high CPU load on that same core.
Expected behavior
The application runs.
Actual behavior
The application stops running for long periods, up to several hours.
Regression?
Not known.
Known Workarounds
Reduce the CPU load, increase the priority of the dotnet threads, or run the application on multiple cores .
Configuration
Tested with .NET 9.0.2, 9.0.10 and 10.0
OS: Custom Preempt-RT Linux based on Yocto Scarthgap (kernel v6.1)
Architecture: ARM64
Dotnet application running in docker (aspnet noble-chiseled)
Other information
If I understood the GC correctly, it tries to stop all other threads before running the actual collection. For the UNIX NativeAOT this is implemented using real-time signals, so that the GC sends the RT signal to all threads that need to be suspended, and it does this until the threads report that they have been suspended succesfully.
AFAICT looking at gdb, the problem is triggered if one thread cannot be suspended immediately, and the thread trying to run the GC keeps on filling the rt-signal queue. This means that the thread will keep on running its' signal handler over and over again and never complete the actual suspend.
Below snippet of the callstack shown by gdb:
Thread 16 (Thread 0xffbebefdf080 (LWP 4657) "WebHMI"):
dotnet/runtime#0 0x0000ffff8b767558 in ?? () from target:/lib/aarch64-linux-gnu/libc.so.6
#1 0x0000aaaab9a182fc in PalHijack (hThread=, pThreadToHijack=0xffbeee0af7f0) at /__w/1/s/src/coreclr/nativeaot/Runtime/unix/PalRedhawkUnix.cpp:1089
#2 0x0000aaaab99d7c78 in ThreadStore::SuspendAllThreads (this=, waitForGCEvent=) at /__w/1/s/src/coreclr/nativeaot/Runtime/threadstore.cpp:264
#3 0x0000aaaab99d27f4 in GCToEEInterface::SuspendEE (reason=) at /__w/1/s/src/coreclr/nativeaot/Runtime/gcenv.ee.cpp:50
#4 0x0000aaaab99e6f50 in WKS::GCHeap::GarbageCollectGeneration (this=, gen=0, reason=reason_alloc_soh) at /__w/1/s/src/coreclr/gc/gc.cpp:51181
#5 0x0000aaaab99e8220 in WKS::gc_heap::trigger_gc_for_alloc (gen_number=0, gen_number@entry=-1090665120, gr=3204301792, msl=msl@entry=0xaaaabcbcc5b8 WKS::gc_heap::more_space_lock_soh, loh_p=false, take_state=) at /__w/1/s/src/coreclr/gc/gc.cpp:18926
#6 0x0000aaaab99e8ea4 in WKS::gc_heap::try_allocate_more_space (acontext=acontext@entry=0xffbebefdf7f0, size=size@entry=48, flags=flags@entry=2, gen_number=0, gen_number@entry=-1090665024) at /__w/1/s/src/coreclr/gc/gc.cpp:19064
#7 0x0000aaaab9a0db78 in WKS::gc_heap::allocate_more_space (acontext=0xffbebefdf7f0, size=48, flags=2, alloc_generation_number=0) at /__w/1/s/src/coreclr/gc/gc.cpp:19564
#8 WKS::gc_heap::allocate (jsize=48, acontext=0xffbebefdf7f0, flags=2) at /__w/1/s/src/coreclr/gc/gc.cpp:19595
#9 WKS::GCHeap::Alloc (this=, context=0xffbebefdf7f0, size=48, flags=2) at /__w/1/s/src/coreclr/gc/gc.cpp:50122
#10 0x0000aaaab99d19fc in GcAllocInternal (pEEType=0xaaaabd4482d8, uFlags=2, numElements=0, pThread=0xffbebefdf7f0) at /__w/1/s/src/coreclr/nativeaot/Runtime/GCHelpers.cpp:542
#11 0x0000aaaab9a2550c in RhpNewObject () at /__w/1/s/src/coreclr/nativeaot/Runtime/arm64/AllocFast.S:88
#12 0x0000aaaabb0fd00c in ctor (this=...) at //src/coreclr/nativeaot/Common/src/System/Collections/Concurrent/ConcurrentUnifier.cs:69
#13 0x0000aaaaba258e00 in S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo_TypeComponentsCache__CreatePerNameQueryCaches (type=..., ignoreCase=true) at //src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.TypeComponentsCache.cs:84
#14 0x0000aaaaba258d24 in S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo_TypeComponentsCache___ctor (this=..., type=...) at //src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.TypeComponentsCache.cs:37
#15 0x0000aaaaba1ff314 in S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo__get_Cache (this=...) at
/_/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.BindingFlags.cs:188
Thread 14 (Thread 0xffbeee0af080 (LWP 4622) "WebHMI"):
dotnet/runtime#0 ActivationHandler (code=34, siginfo=0xffbeee0ab3f0, context=...
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.