Skip to content

NativeAOT garbage collector hang #121623

@jkarrenpalo-abb

Description

@jkarrenpalo-abb

Description

GC hangs for extended periods (several hours) on single core system running NativeAOT build when CPU load is high.

Noticed with a web application that uses gRpc for internal communication between the dotnet application and the rest of the linux system.

Reproduction Steps

Run NativeAOT build on single CPU core, with high CPU load on that same core.

Expected behavior

The application runs.

Actual behavior

The application stops running for long periods, up to several hours.

Regression?

Not known.

Known Workarounds

Reduce the CPU load, increase the priority of the dotnet threads, or run the application on multiple cores .

Configuration

Tested with .NET 9.0.2, 9.0.10 and 10.0
OS: Custom Preempt-RT Linux based on Yocto Scarthgap (kernel v6.1)
Architecture: ARM64
Dotnet application running in docker (aspnet noble-chiseled)

Other information

If I understood the GC correctly, it tries to stop all other threads before running the actual collection. For the UNIX NativeAOT this is implemented using real-time signals, so that the GC sends the RT signal to all threads that need to be suspended, and it does this until the threads report that they have been suspended succesfully.

AFAICT looking at gdb, the problem is triggered if one thread cannot be suspended immediately, and the thread trying to run the GC keeps on filling the rt-signal queue. This means that the thread will keep on running its' signal handler over and over again and never complete the actual suspend.

Below snippet of the callstack shown by gdb:

Thread 16 (Thread 0xffbebefdf080 (LWP 4657) "WebHMI"):
#0 0x0000ffff8b767558 in ?? () from target:/lib/aarch64-linux-gnu/libc.so.6
#1 0x0000aaaab9a182fc in PalHijack (hThread=, pThreadToHijack=0xffbeee0af7f0) at /__w/1/s/src/coreclr/nativeaot/Runtime/unix/PalRedhawkUnix.cpp:1089
#2 0x0000aaaab99d7c78 in ThreadStore::SuspendAllThreads (this=, waitForGCEvent=) at /__w/1/s/src/coreclr/nativeaot/Runtime/threadstore.cpp:264
#3 0x0000aaaab99d27f4 in GCToEEInterface::SuspendEE (reason=) at /__w/1/s/src/coreclr/nativeaot/Runtime/gcenv.ee.cpp:50
#4 0x0000aaaab99e6f50 in WKS::GCHeap::GarbageCollectGeneration (this=, gen=0, reason=reason_alloc_soh) at /__w/1/s/src/coreclr/gc/gc.cpp:51181
#5 0x0000aaaab99e8220 in WKS::gc_heap::trigger_gc_for_alloc (gen_number=0, gen_number@entry=-1090665120, gr=3204301792, msl=msl@entry=0xaaaabcbcc5b8 WKS::gc_heap::more_space_lock_soh, loh_p=false, take_state=) at /__w/1/s/src/coreclr/gc/gc.cpp:18926
#6 0x0000aaaab99e8ea4 in WKS::gc_heap::try_allocate_more_space (acontext=acontext@entry=0xffbebefdf7f0, size=size@entry=48, flags=flags@entry=2, gen_number=0, gen_number@entry=-1090665024) at /__w/1/s/src/coreclr/gc/gc.cpp:19064
#7 0x0000aaaab9a0db78 in WKS::gc_heap::allocate_more_space (acontext=0xffbebefdf7f0, size=48, flags=2, alloc_generation_number=0) at /__w/1/s/src/coreclr/gc/gc.cpp:19564
#8 WKS::gc_heap::allocate (jsize=48, acontext=0xffbebefdf7f0, flags=2) at /__w/1/s/src/coreclr/gc/gc.cpp:19595
#9 WKS::GCHeap::Alloc (this=, context=0xffbebefdf7f0, size=48, flags=2) at /__w/1/s/src/coreclr/gc/gc.cpp:50122
#10 0x0000aaaab99d19fc in GcAllocInternal (pEEType=0xaaaabd4482d8, uFlags=2, numElements=0, pThread=0xffbebefdf7f0) at /__w/1/s/src/coreclr/nativeaot/Runtime/GCHelpers.cpp:542
#11 0x0000aaaab9a2550c in RhpNewObject () at /__w/1/s/src/coreclr/nativeaot/Runtime/arm64/AllocFast.S:88
#12 0x0000aaaabb0fd00c in ctor (this=...) at //src/coreclr/nativeaot/Common/src/System/Collections/Concurrent/ConcurrentUnifier.cs:69
#13 0x0000aaaaba258e00 in S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo_TypeComponentsCache__CreatePerNameQueryCaches (type=..., ignoreCase=true) at /
/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.TypeComponentsCache.cs:84
#14 0x0000aaaaba258d24 in S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo_TypeComponentsCache___ctor (this=..., type=...) at /
/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.TypeComponentsCache.cs:37
#15 0x0000aaaaba1ff314 in S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo__get_Cache (this=...) at
/_/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.BindingFlags.cs:188

Thread 14 (Thread 0xffbeee0af080 (LWP 4622) "WebHMI"):
#0 ActivationHandler (code=34, siginfo=0xffbeee0ab3f0, context=0xffbeee0ab470) at /__w/1/s/src/coreclr/nativeaot/Runtime/unix/PalRedhawkUnix.cpp:1010
#1
#2 0x0000aaaab9fb51e8 in System_Collections_Immutable_System_Collections_Frozen_Hashing__GetHashCodeOrdinal (s=...) at /_/src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/String/Hashing.cs:24
#3 0x0000aaaab9a25be8 in RhpCallFilterFunclet () at /__w/1/s/src/coreclr/nativeaot/Runtime/arm64/ExceptionHandling.S:661
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Metadata

Metadata

Assignees

Type

No type

Projects

Status

No status

Relationships

None yet

Development

No branches or pull requests

Issue actions