Skip to content

Conversation

@jkotas
Copy link
Member

@jkotas jkotas commented Aug 28, 2025

TS_GCSuspendRedirected is normally cleared in the slow path of pulsing GC mode (RareDisablePreemptiveGC). The problem is that we may not even get to that slow path if the thread gets rescheduled right after we switch to preemptive mode and gets scheduled back only after the GC completes and resumes execution. This change clears it explicitly.

Fixes #119149

TS_GCSuspendRedirected is normally cleared in the slow path of pulsing GC mode (RareDisablePreemptiveGC). The problem is that we may not even get to that slow path if the thread gets rescheduled right after we switch to preemptive mode and gets scheduled back only after the GC completes and resumes execution. This change clears it explicitly.

Fixes dotnet#119149
@jkotas jkotas requested review from VSadov and Copilot August 28, 2025 02:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition in garbage collection thread suspension by ensuring the TS_GCSuspendRedirected thread state flag is properly cleared after thread redirection completes. The issue occurred when a thread was rescheduled between switching to preemptive mode and the GC completion, preventing the normal cleanup path from executing.

Key Changes

  • Adds explicit clearing of TS_GCSuspendRedirected flag at the end of thread redirection
  • Uses platform-specific conditional compilation to match the feature availability
  • Ensures threads can be redirected again after GC suspension ends

@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Aug 28, 2025
@jkotas jkotas added area-VM-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Aug 28, 2025
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@mangod9
Copy link
Member

mangod9 commented Aug 28, 2025

Is this a regression ?

@jkotas
Copy link
Member Author

jkotas commented Aug 28, 2025

I believe that it is .NET 8.0 -> .NET 9.0 regression introduced by #101782. Before #101782, the redirect flag was cleared more aggressively at https://github.com/dotnet/runtime/pull/101782/files#diff-c2b6a152f03d56992f672d77bd2fd94fc91b2bff527fa281fad9aa44d53e0c90L3497 .

This test has been timing out intermittently for more than a year (#106134 is the first issue opened on it). This timing aligns with .NET 8.0 -> .NET 9.0 regression. It has not been failing often enough to get us to the bottom of it. Multiple issues on this got closed since it appeared to be fixed, only to resurface later again.

This bug affects Win10-era and earlier OSes only that do not have QueueUserAPC2 API.

@jkotas jkotas requested review from janvorli and mangod9 August 28, 2025 04:33
@VSadov
Copy link
Member

VSadov commented Aug 28, 2025

I believe that it is .NET 8.0 -> .NET 9.0 regression introduced by #101782. Before #101782, the redirect flag was cleared more aggressively at https://github.com/dotnet/runtime/pull/101782/files#diff-c2b6a152f03d56992f672d77bd2fd94fc91b2bff527fa281fad9aa44d53e0c90L3497 .

This test has been timing out intermittently for more than a year (#106134 is the first issue opened on it). This timing aligns with .NET 8.0 -> .NET 9.0 regression. It has not been failing often enough to get us to the bottom of it. Multiple issues on this got closed since it appeared to be fixed, only to resurface later again.

This bug affects Win10-era and earlier OSes only that do not have QueueUserAPC2 API.

The assumption was that once a thread is redirected, it has only one way to continue - it is going to go preemptive and then switch to coop and doing that will a) suspend the thread and b) clear the flag. The flaw in that logic is that GC may start as soon as the thread goes preemptive, and if the thread is sufficiently slow/preempted that GC would be over by the time it goes coop, switching to coop may take the fast path, which will not result in (a) and (b).

It would be uncommon to see preemption that is long enough to cause this, unless under considerable stress.

The part that it affects only old configurations definitely made it harder to pinpoint.

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!!

@jkotas jkotas merged commit b73ba2a into dotnet:main Aug 28, 2025
101 checks passed
@jkotas jkotas deleted the issue-119149 branch August 28, 2025 06:05
@jkotas
Copy link
Member Author

jkotas commented Aug 28, 2025

/backport to release/10.0

@github-actions
Copy link
Contributor

Started backporting to release/10.0: https://github.com/dotnet/runtime/actions/runs/17287321792

@mangod9
Copy link
Member

mangod9 commented Aug 28, 2025

could this be also the cause for #110769. @janvorli

@janvorli
Copy link
Member

@mangod9 I don't think it is related. Based on recent investigations of dumps from their runs with checked build of the runtime, the problem always starts with a GC hole.

@jkotas
Copy link
Member Author

jkotas commented Aug 28, 2025

could this be also the cause for #110769

This bug can only ever lead to hangs. It won't ever lead to access violation.

@github-actions github-actions bot locked and limited conversation to collaborators Sep 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JIT/Methodical/tailcall_v4/hijacking/hijacking test hang

4 participants