Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 19, 2025

On single-core systems under high CPU load, NativeAOT GC can hang for hours when suspending threads. The Unix implementation repeatedly sends RT signals to threads that cannot suspend immediately, causing signal queue overflow and handler starvation.

Changes

  • Add IsActivationPending() check in PalHijack() before sending suspension signals
  • Prevents duplicate signals when activation is already pending
  • Aligns Unix behavior with existing Windows implementation
void PalHijack(Thread* pThreadToHijack)
{
    if (pThreadToHijack->IsActivationPending())
    {
        return;
    }

    pThreadToHijack->SetActivationPending(true);
    // ... send signal
}

The flag is set before sending pthread_kill() and cleared in the signal handler after processing, ensuring at most one pending signal per thread.

Fixes #121623

Original prompt

This section details on the original issue you should resolve

<issue_title>NativeAOT garbage collector hang</issue_title>
<issue_description>### Description

GC hangs for extended periods (several hours) on single core system running NativeAOT build when CPU load is high.

Noticed with a web application that uses gRpc for internal communication between the dotnet application and the rest of the linux system.

Reproduction Steps

Run NativeAOT build on single CPU core, with high CPU load on that same core.

Expected behavior

The application runs.

Actual behavior

The application stops running for long periods, up to several hours.

Regression?

Not known.

Known Workarounds

Reduce the CPU load, increase the priority of the dotnet threads, or run the application on multiple cores .

Configuration

Tested with .NET 9.0.2, 9.0.10 and 10.0
OS: Custom Preempt-RT Linux based on Yocto Scarthgap (kernel v6.1)
Architecture: ARM64
Dotnet application running in docker (aspnet noble-chiseled)

Other information

If I understood the GC correctly, it tries to stop all other threads before running the actual collection. For the UNIX NativeAOT this is implemented using real-time signals, so that the GC sends the RT signal to all threads that need to be suspended, and it does this until the threads report that they have been suspended succesfully.

AFAICT looking at gdb, the problem is triggered if one thread cannot be suspended immediately, and the thread trying to run the GC keeps on filling the rt-signal queue. This means that the thread will keep on running its' signal handler over and over again and never complete the actual suspend.

Below snippet of the callstack shown by gdb:

Thread 16 (Thread 0xffbebefdf080 (LWP 4657) "WebHMI"):
dotnet/runtime#0 0x0000ffff8b767558 in ?? () from target:/lib/aarch64-linux-gnu/libc.so.6
#1 0x0000aaaab9a182fc in PalHijack (hThread=, pThreadToHijack=0xffbeee0af7f0) at /__w/1/s/src/coreclr/nativeaot/Runtime/unix/PalRedhawkUnix.cpp:1089
#2 0x0000aaaab99d7c78 in ThreadStore::SuspendAllThreads (this=, waitForGCEvent=) at /__w/1/s/src/coreclr/nativeaot/Runtime/threadstore.cpp:264
#3 0x0000aaaab99d27f4 in GCToEEInterface::SuspendEE (reason=) at /__w/1/s/src/coreclr/nativeaot/Runtime/gcenv.ee.cpp:50
#4 0x0000aaaab99e6f50 in WKS::GCHeap::GarbageCollectGeneration (this=, gen=0, reason=reason_alloc_soh) at /__w/1/s/src/coreclr/gc/gc.cpp:51181
#5 0x0000aaaab99e8220 in WKS::gc_heap::trigger_gc_for_alloc (gen_number=0, gen_number@entry=-1090665120, gr=3204301792, msl=msl@entry=0xaaaabcbcc5b8 WKS::gc_heap::more_space_lock_soh, loh_p=false, take_state=) at /__w/1/s/src/coreclr/gc/gc.cpp:18926
#6 0x0000aaaab99e8ea4 in WKS::gc_heap::try_allocate_more_space (acontext=acontext@entry=0xffbebefdf7f0, size=size@entry=48, flags=flags@entry=2, gen_number=0, gen_number@entry=-1090665024) at /__w/1/s/src/coreclr/gc/gc.cpp:19064
#7 0x0000aaaab9a0db78 in WKS::gc_heap::allocate_more_space (acontext=0xffbebefdf7f0, size=48, flags=2, alloc_generation_number=0) at /__w/1/s/src/coreclr/gc/gc.cpp:19564
#8 WKS::gc_heap::allocate (jsize=48, acontext=0xffbebefdf7f0, flags=2) at /__w/1/s/src/coreclr/gc/gc.cpp:19595
#9 WKS::GCHeap::Alloc (this=, context=0xffbebefdf7f0, size=48, flags=2) at /__w/1/s/src/coreclr/gc/gc.cpp:50122
#10 0x0000aaaab99d19fc in GcAllocInternal (pEEType=0xaaaabd4482d8, uFlags=2, numElements=0, pThread=0xffbebefdf7f0) at /__w/1/s/src/coreclr/nativeaot/Runtime/GCHelpers.cpp:542
#11 0x0000aaaab9a2550c in RhpNewObject () at /__w/1/s/src/coreclr/nativeaot/Runtime/arm64/AllocFast.S:88
#12 0x0000aaaabb0fd00c in ctor (this=...) at //src/coreclr/nativeaot/Common/src/System/Collections/Concurrent/ConcurrentUnifier.cs:69
#13 0x0000aaaaba258e00 in S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo_TypeComponentsCache__CreatePerNameQueryCaches (type=..., ignoreCase=true) at /
/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.TypeComponentsCache.cs:84
#14 0x0000aaaaba258d24 in S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo_TypeComponentsCache___ctor (this=..., type=...) at /
/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.TypeComponentsCache.cs:37
#15 0x0000aaaaba1ff314 in S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo__get_Cache (this=...) at
/_/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Reflection/Runtime/TypeInfos/RuntimeTypeInfo.BindingFlags.cs:188

Thread 14 (Thread 0xffbeee0af080 (LWP 4622) "WebHMI"):
dotnet/runtime#0 ActivationHandler (code=34, siginfo=0xffbeee0ab3f0, context=...


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: VSadov <8218165+VSadov@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix garbage collector hang in NativeAOT builds Fix NativeAOT GC hang on single-core systems by preventing RT signal queue overflow Nov 19, 2025
Copilot AI requested a review from VSadov November 19, 2025 04:02
Copilot finished work on behalf of VSadov November 19, 2025 04:02
@VSadov VSadov marked this pull request as ready for review November 19, 2025 23:46
@VSadov VSadov requested review from Copilot and jkotas November 19, 2025 23:46
Copilot finished reviewing on behalf of VSadov November 19, 2025 23:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a GC hang on single-core systems under high CPU load by preventing RT signal queue overflow during thread suspension. The Unix implementation of PalHijack() was repeatedly sending suspension signals to threads that couldn't suspend immediately, causing the signal queue to overflow and preventing signal handlers from completing the suspension process.

Key Changes

  • Added early-exit check in PalHijack() to avoid sending duplicate suspension signals
  • Aligns Unix behavior with existing Windows implementation pattern
  • Uses existing IsActivationPending() flag to track pending signals

@VSadov
Copy link
Member

VSadov commented Nov 20, 2025

I have run the entire Libraries test suite with -rt Release -lc Release /p:TestNativeAot=true a few times in a loop on Linux-x64 and on Linux-arm64 (Ubuntu 24.04 noble, in both cases). I did not see any hangs or unexpected failures.

Considering that this brings the signal strategy on parity with CoreCLR, I think it is safe to merge.

@VSadov
Copy link
Member

VSadov commented Nov 20, 2025

We should watch for impact on benchmarks after merge, but I expect no change, since there will be no impact on suspension-friendly scenarios, which is the most common case.
Maybe there will be some minor improvements. This is mostly a reliability change for unusual cases.

It would be nice to know if this resolves the reported issue. It should help.

@jkotas
Copy link
Member

jkotas commented Nov 20, 2025

Do we need an equivalent of this too?

// make sure this is cleared - in case a signal is lost or somehow we did not act on it

@GerardSmit
Copy link
Contributor

Note, Copilot attached the wrong issue (#121738) in the description, so that one will be closed as well when the PR gets merged.

This is the second time this happend. This also happend in #121411 where Copilot attached the wrong issue number.

@VSadov
Copy link
Member

VSadov commented Nov 20, 2025

Do we need an equivalent of this too?

// make sure this is cleared - in case a signal is lost or somehow we did not act on it

Good catch. We need that if we do not have.

@VSadov
Copy link
Member

VSadov commented Nov 20, 2025

Note, Copilot attached the wrong issue (#121738) in the description, so that one will be closed as well when the PR gets merged.

This is the second time this happend. This also happend in #121411 where Copilot attached the wrong issue number.

Thanks! Fixed the description.

@VSadov VSadov enabled auto-merge (squash) November 20, 2025 01:59
@VSadov VSadov merged commit 5735338 into main Nov 20, 2025
97 checks passed
@VSadov VSadov deleted the copilot/fix-gc-hang-on-nativeaot branch November 20, 2025 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NativeAOT garbage collector hang

4 participants