Commit 5735338
Fix NativeAOT GC hang on single-core systems by preventing RT signal queue overflow (#121772)
On single-core systems under high CPU load, NativeAOT GC can hang for
hours when suspending threads. The Unix implementation repeatedly sends
RT signals to threads that cannot suspend immediately, causing signal
queue overflow and handler starvation.
### Changes
- Add `IsActivationPending()` check in `PalHijack()` before sending
suspension signals
- Prevents duplicate signals when activation is already pending
- Aligns Unix behavior with existing Windows implementation
```cpp
void PalHijack(Thread* pThreadToHijack)
{
if (pThreadToHijack->IsActivationPending())
{
return;
}
pThreadToHijack->SetActivationPending(true);
// ... send signal
}
```
The flag is set before sending `pthread_kill()` and cleared in the
signal handler after processing, ensuring at most one pending signal per
thread.
Fixes #121623
<!-- START COPILOT CODING AGENT SUFFIX -->
<details>
<summary>Original prompt</summary>
----
*This section details on the original issue you should resolve*
<issue_title>NativeAOT garbage collector hang</issue_title>
<issue_description>### Description
GC hangs for extended periods (several hours) on single core system
running NativeAOT build when CPU load is high.
Noticed with a web application that uses gRpc for internal communication
between the dotnet application and the rest of the linux system.
### Reproduction Steps
Run NativeAOT build on single CPU core, with high CPU load on that same
core.
### Expected behavior
The application runs.
### Actual behavior
The application stops running for long periods, up to several hours.
### Regression?
Not known.
### Known Workarounds
Reduce the CPU load, increase the priority of the dotnet threads, or run
the application on multiple cores .
### Configuration
Tested with .NET 9.0.2, 9.0.10 and 10.0
OS: Custom Preempt-RT Linux based on Yocto Scarthgap (kernel v6.1)
Architecture: ARM64
Dotnet application running in docker (aspnet noble-chiseled)
### Other information
If I understood the GC correctly, it tries to stop all other threads
before running the actual collection. For the UNIX NativeAOT this is
implemented using real-time signals, so that the GC sends the RT signal
to all threads that need to be suspended, and it does this until the
threads report that they have been suspended succesfully.
AFAICT looking at gdb, the problem is triggered if one thread cannot be
suspended immediately, and the thread trying to run the GC keeps on
filling the rt-signal queue. This means that the thread will keep on
running its' signal handler over and over again and never complete the
actual suspend.
Below snippet of the callstack shown by gdb:
Thread 16 (Thread 0xffbebefdf080 (LWP 4657) "WebHMI"):
dotnet/runtime#0 0x0000ffff8b767558 in ?? () from
target:/lib/aarch64-linux-gnu/libc.so.6
#1 0x0000aaaab9a182fc in PalHijack (hThread=<optimized
out>, pThreadToHijack=0xffbeee0af7f0) at
/__w/1/s/src/coreclr/nativeaot/Runtime/unix/PalRedhawkUnix.cpp:1089
#2 0x0000aaaab99d7c78 in ThreadStore::SuspendAllThreads
(this=<optimized out>, waitForGCEvent=<optimized out>) at
/__w/1/s/src/coreclr/nativeaot/Runtime/threadstore.cpp:264
#3 0x0000aaaab99d27f4 in GCToEEInterface::SuspendEE
(reason=<optimized out>) at
/__w/1/s/src/coreclr/nativeaot/Runtime/gcenv.ee.cpp:50
#4 0x0000aaaab99e6f50 in
WKS::GCHeap::GarbageCollectGeneration (this=<optimized out>, gen=0,
reason=reason_alloc_soh) at /__w/1/s/src/coreclr/gc/gc.cpp:51181
#5 0x0000aaaab99e8220 in
WKS::gc_heap::trigger_gc_for_alloc (gen_number=0,
gen_number@entry=-1090665120, gr=3204301792,
msl=msl@entry=0xaaaabcbcc5b8 <WKS::gc_heap::more_space_lock_soh>,
loh_p=false, take_state=<optimized out>) at
/__w/1/s/src/coreclr/gc/gc.cpp:18926
#6 0x0000aaaab99e8ea4 in
WKS::gc_heap::try_allocate_more_space
(acontext=acontext@entry=0xffbebefdf7f0, size=size@entry=48,
flags=flags@entry=2, gen_number=0, gen_number@entry=-1090665024) at
/__w/1/s/src/coreclr/gc/gc.cpp:19064
#7 0x0000aaaab9a0db78 in WKS::gc_heap::allocate_more_space
(acontext=0xffbebefdf7f0, size=48, flags=2, alloc_generation_number=0)
at /__w/1/s/src/coreclr/gc/gc.cpp:19564
#8 WKS::gc_heap::allocate (jsize=48,
acontext=0xffbebefdf7f0, flags=2) at
/__w/1/s/src/coreclr/gc/gc.cpp:19595
#9 WKS::GCHeap::Alloc (this=<optimized out>,
context=0xffbebefdf7f0, size=48, flags=2) at
/__w/1/s/src/coreclr/gc/gc.cpp:50122
#10 0x0000aaaab99d19fc in GcAllocInternal
(pEEType=0xaaaabd4482d8, uFlags=2, numElements=0,
pThread=0xffbebefdf7f0) at
/__w/1/s/src/coreclr/nativeaot/Runtime/GCHelpers.cpp:542
#11 0x0000aaaab9a2550c in RhpNewObject () at
/__w/1/s/src/coreclr/nativeaot/Runtime/arm64/AllocFast.S:88
#12 0x0000aaaabb0fd00c in ___ctor (this=...) at
/_/src/coreclr/nativeaot/Common/src/System/Collections/Concurrent/ConcurrentUnifier.cs:69
#13 0x0000aaaaba258e00 in
S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo_TypeComponentsCache__CreatePerNameQueryCaches
(type=..., ignoreCase=true) at
/_/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.TypeComponentsCache.cs:84
#14 0x0000aaaaba258d24 in
S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo_TypeComponentsCache___ctor
(this=..., type=...) at
/_/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.TypeComponentsCache.cs:37
#15 0x0000aaaaba1ff314 in
S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo__get_Cache
(this=...) at
/_/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.BindingFlags.cs:188
Thread 14 (Thread 0xffbeee0af080 (LWP 4622) "WebHMI"):
dotnet/runtime#0 ActivationHandler (code=34, siginfo=0xffbeee0ab3f0,
context=...
</details>
- Fixes #121623
<!-- START COPILOT CODING AGENT TIPS -->
---
💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: VSadov <8218165+VSadov@users.noreply.github.com>1 parent f62405c commit 5735338
2 files changed
+8
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
95 | 98 | | |
96 | 99 | | |
97 | 100 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1088 | 1088 | | |
1089 | 1089 | | |
1090 | 1090 | | |
| 1091 | + | |
| 1092 | + | |
| 1093 | + | |
| 1094 | + | |
| 1095 | + | |
1091 | 1096 | | |
1092 | 1097 | | |
1093 | 1098 | | |
| |||
0 commit comments