Skip to content

Commit 5735338

Browse files
CopilotVSadov
andauthored
Fix NativeAOT GC hang on single-core systems by preventing RT signal queue overflow (#121772)
On single-core systems under high CPU load, NativeAOT GC can hang for hours when suspending threads. The Unix implementation repeatedly sends RT signals to threads that cannot suspend immediately, causing signal queue overflow and handler starvation. ### Changes - Add `IsActivationPending()` check in `PalHijack()` before sending suspension signals - Prevents duplicate signals when activation is already pending - Aligns Unix behavior with existing Windows implementation ```cpp void PalHijack(Thread* pThreadToHijack) { if (pThreadToHijack->IsActivationPending()) { return; } pThreadToHijack->SetActivationPending(true); // ... send signal } ``` The flag is set before sending `pthread_kill()` and cleared in the signal handler after processing, ensuring at most one pending signal per thread. Fixes #121623 <!-- START COPILOT CODING AGENT SUFFIX --> <details> <summary>Original prompt</summary> ---- *This section details on the original issue you should resolve* <issue_title>NativeAOT garbage collector hang</issue_title> <issue_description>### Description GC hangs for extended periods (several hours) on single core system running NativeAOT build when CPU load is high. Noticed with a web application that uses gRpc for internal communication between the dotnet application and the rest of the linux system. ### Reproduction Steps Run NativeAOT build on single CPU core, with high CPU load on that same core. ### Expected behavior The application runs. ### Actual behavior The application stops running for long periods, up to several hours. ### Regression? Not known. ### Known Workarounds Reduce the CPU load, increase the priority of the dotnet threads, or run the application on multiple cores . ### Configuration Tested with .NET 9.0.2, 9.0.10 and 10.0 OS: Custom Preempt-RT Linux based on Yocto Scarthgap (kernel v6.1) Architecture: ARM64 Dotnet application running in docker (aspnet noble-chiseled) ### Other information If I understood the GC correctly, it tries to stop all other threads before running the actual collection. For the UNIX NativeAOT this is implemented using real-time signals, so that the GC sends the RT signal to all threads that need to be suspended, and it does this until the threads report that they have been suspended succesfully. AFAICT looking at gdb, the problem is triggered if one thread cannot be suspended immediately, and the thread trying to run the GC keeps on filling the rt-signal queue. This means that the thread will keep on running its' signal handler over and over again and never complete the actual suspend. Below snippet of the callstack shown by gdb: Thread 16 (Thread 0xffbebefdf080 (LWP 4657) "WebHMI"): dotnet/runtime#0 0x0000ffff8b767558 in ?? () from target:/lib/aarch64-linux-gnu/libc.so.6 #1 0x0000aaaab9a182fc in PalHijack (hThread=<optimized out>, pThreadToHijack=0xffbeee0af7f0) at /__w/1/s/src/coreclr/nativeaot/Runtime/unix/PalRedhawkUnix.cpp:1089 #2 0x0000aaaab99d7c78 in ThreadStore::SuspendAllThreads (this=<optimized out>, waitForGCEvent=<optimized out>) at /__w/1/s/src/coreclr/nativeaot/Runtime/threadstore.cpp:264 #3 0x0000aaaab99d27f4 in GCToEEInterface::SuspendEE (reason=<optimized out>) at /__w/1/s/src/coreclr/nativeaot/Runtime/gcenv.ee.cpp:50 #4 0x0000aaaab99e6f50 in WKS::GCHeap::GarbageCollectGeneration (this=<optimized out>, gen=0, reason=reason_alloc_soh) at /__w/1/s/src/coreclr/gc/gc.cpp:51181 #5 0x0000aaaab99e8220 in WKS::gc_heap::trigger_gc_for_alloc (gen_number=0, gen_number@entry=-1090665120, gr=3204301792, msl=msl@entry=0xaaaabcbcc5b8 <WKS::gc_heap::more_space_lock_soh>, loh_p=false, take_state=<optimized out>) at /__w/1/s/src/coreclr/gc/gc.cpp:18926 #6 0x0000aaaab99e8ea4 in WKS::gc_heap::try_allocate_more_space (acontext=acontext@entry=0xffbebefdf7f0, size=size@entry=48, flags=flags@entry=2, gen_number=0, gen_number@entry=-1090665024) at /__w/1/s/src/coreclr/gc/gc.cpp:19064 #7 0x0000aaaab9a0db78 in WKS::gc_heap::allocate_more_space (acontext=0xffbebefdf7f0, size=48, flags=2, alloc_generation_number=0) at /__w/1/s/src/coreclr/gc/gc.cpp:19564 #8 WKS::gc_heap::allocate (jsize=48, acontext=0xffbebefdf7f0, flags=2) at /__w/1/s/src/coreclr/gc/gc.cpp:19595 #9 WKS::GCHeap::Alloc (this=<optimized out>, context=0xffbebefdf7f0, size=48, flags=2) at /__w/1/s/src/coreclr/gc/gc.cpp:50122 #10 0x0000aaaab99d19fc in GcAllocInternal (pEEType=0xaaaabd4482d8, uFlags=2, numElements=0, pThread=0xffbebefdf7f0) at /__w/1/s/src/coreclr/nativeaot/Runtime/GCHelpers.cpp:542 #11 0x0000aaaab9a2550c in RhpNewObject () at /__w/1/s/src/coreclr/nativeaot/Runtime/arm64/AllocFast.S:88 #12 0x0000aaaabb0fd00c in ___ctor (this=...) at /_/src/coreclr/nativeaot/Common/src/System/Collections/Concurrent/ConcurrentUnifier.cs:69 #13 0x0000aaaaba258e00 in S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo_TypeComponentsCache__CreatePerNameQueryCaches (type=..., ignoreCase=true) at /_/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.TypeComponentsCache.cs:84 #14 0x0000aaaaba258d24 in S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo_TypeComponentsCache___ctor (this=..., type=...) at /_/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.TypeComponentsCache.cs:37 #15 0x0000aaaaba1ff314 in S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo__get_Cache (this=...) at /_/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.BindingFlags.cs:188 Thread 14 (Thread 0xffbeee0af080 (LWP 4622) "WebHMI"): dotnet/runtime#0 ActivationHandler (code=34, siginfo=0xffbeee0ab3f0, context=... </details> - Fixes #121623 <!-- START COPILOT CODING AGENT TIPS --> --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: VSadov <8218165+VSadov@users.noreply.github.com>
1 parent f62405c commit 5735338

File tree

2 files changed

+8
-0
lines changed

2 files changed

+8
-0
lines changed

src/coreclr/nativeaot/Runtime/thread.cpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,9 @@ void Thread::WaitForGC(PInvokeTransitionFrame* pTransitionFrame)
9292
ClearState(TSF_Redirected);
9393
#endif //FEATURE_SUSPEND_REDIRECTION
9494

95+
// make sure this is cleared - in case a signal is lost or somehow we did not act on it
96+
SetActivationPending(false);
97+
9598
GCHeapUtilities::GetGCHeap()->WaitUntilGCComplete();
9699

97100
// must be in cooperative mode when checking the trap flag

src/coreclr/nativeaot/Runtime/unix/PalUnix.cpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1088,6 +1088,11 @@ HijackFunc* PalGetHijackTarget(HijackFunc* defaultHijackTarget)
10881088

10891089
void PalHijack(Thread* pThreadToHijack)
10901090
{
1091+
if (pThreadToHijack->IsActivationPending())
1092+
{
1093+
return;
1094+
}
1095+
10911096
pThreadToHijack->SetActivationPending(true);
10921097

10931098
int status = pthread_kill(pThreadToHijack->GetOSThreadHandle(), INJECT_ACTIVATION_SIGNAL);

0 commit comments

Comments
 (0)