-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Clear TS_GCSuspendRedirected at the end of thread redirection #119164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TS_GCSuspendRedirected is normally cleared in the slow path of pulsing GC mode (RareDisablePreemptiveGC). The problem is that we may not even get to that slow path if the thread gets rescheduled right after we switch to preemptive mode and gets scheduled back only after the GC completes and resumes execution. This change clears it explicitly. Fixes dotnet#119149
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a race condition in garbage collection thread suspension by ensuring the TS_GCSuspendRedirected thread state flag is properly cleared after thread redirection completes. The issue occurred when a thread was rescheduled between switching to preemptive mode and the GC completion, preventing the normal cleanup path from executing.
Key Changes
- Adds explicit clearing of
TS_GCSuspendRedirectedflag at the end of thread redirection - Uses platform-specific conditional compilation to match the feature availability
- Ensures threads can be redirected again after GC suspension ends
|
Tagging subscribers to this area: @mangod9 |
|
Is this a regression ? |
|
I believe that it is .NET 8.0 -> .NET 9.0 regression introduced by #101782. Before #101782, the redirect flag was cleared more aggressively at https://github.com/dotnet/runtime/pull/101782/files#diff-c2b6a152f03d56992f672d77bd2fd94fc91b2bff527fa281fad9aa44d53e0c90L3497 . This test has been timing out intermittently for more than a year (#106134 is the first issue opened on it). This timing aligns with .NET 8.0 -> .NET 9.0 regression. It has not been failing often enough to get us to the bottom of it. Multiple issues on this got closed since it appeared to be fixed, only to resurface later again. This bug affects Win10-era and earlier OSes only that do not have QueueUserAPC2 API. |
The assumption was that once a thread is redirected, it has only one way to continue - it is going to go preemptive and then switch to coop and doing that will a) suspend the thread and b) clear the flag. The flaw in that logic is that GC may start as soon as the thread goes preemptive, and if the thread is sufficiently slow/preempted that GC would be over by the time it goes coop, switching to coop may take the fast path, which will not result in (a) and (b). It would be uncommon to see preemption that is long enough to cause this, unless under considerable stress. The part that it affects only old configurations definitely made it harder to pinpoint. |
VSadov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!!
|
/backport to release/10.0 |
|
Started backporting to release/10.0: https://github.com/dotnet/runtime/actions/runs/17287321792 |
|
@mangod9 I don't think it is related. Based on recent investigations of dumps from their runs with checked build of the runtime, the problem always starts with a GC hole. |
This bug can only ever lead to hangs. It won't ever lead to access violation. |
TS_GCSuspendRedirected is normally cleared in the slow path of pulsing GC mode (RareDisablePreemptiveGC). The problem is that we may not even get to that slow path if the thread gets rescheduled right after we switch to preemptive mode and gets scheduled back only after the GC completes and resumes execution. This change clears it explicitly.
Fixes #119149