Skip to content

Conversation

@kouvel
Copy link
Contributor

@kouvel kouvel commented Sep 15, 2021

Port of #59129

Hardened ThreadpoolMgr::GateThreadStart against the possibility that the observed group-local affinity mask contains set bits at positions higher than the total group-local CPU count that was captured during earlier initialization.

This fixes customer-reported crashes which have occurred on multi-group machines with heterogenous CPU counts across groups (but the same crash can probably also occur on single-group VMs if CPUs are hot-added and are then manually added to the process affinity mask).

Customer Impact

When the thread pool is used after being idle for a short duration, the system's logical processor count and the process' affinity have increased since the start of the process, and the process is not configured to use all CPU groups, the process may crash. This can potentially happen on multi-CPU-group systems with heterogenous CPU counts across groups where the process may be moved to a different CPU group with more processors, or in systems where CPUs may be hot-added.

Testing

Verified in the debugger that increases to the affinity mask behave as expected after the change.

Risk

Low. The change may only increase and not decrease the size of an array, up to a max of 64 elements. I have not seen many of these crashes so far in .NET Core.

Hardened ThreadpoolMgr::GateThreadStart against the possibility that the observed group-local affinity mask contains set bits at positions higher than the total group-local CPU count that was captured during earlier initialization.

This fixes customer-reported crashes which have occurred on multi-group machines with heterogenous CPU counts across groups (but the same crash can probably also occur on single-group VMs if CPUs are hot-added and are then manually added to the process affinity mask).
@kouvel kouvel added this to the 6.0.0 milestone Sep 15, 2021
@kouvel kouvel requested a review from ChrisAhna September 15, 2021 01:30
@kouvel kouvel self-assigned this Sep 15, 2021
@ghost
Copy link

ghost commented Sep 15, 2021

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Port of #59129

Hardened ThreadpoolMgr::GateThreadStart against the possibility that the observed group-local affinity mask contains set bits at positions higher than the total group-local CPU count that was captured during earlier initialization.

This fixes customer-reported crashes which have occurred on multi-group machines with heterogenous CPU counts across groups (but the same crash can probably also occur on single-group VMs if CPUs are hot-added and are then manually added to the process affinity mask).

Customer Impact

When the thread pool is used after being idle for a short duration, the system's logical processor count and the process' affinity have increased since the start of the process, and the process is not configured to use all CPU groups, the process may crash. This can potentially happen on multi-CPU-group systems with heterogenous CPU counts across groups where the process may be moved to a different CPU group with more processors, or in systems where CPUs may be hot-added.

Testing

Verified in the debugger that increases to the affinity mask behave as expected after the change.

Risk

Low. The change may only increase and not decrease the size of an array, up to a max of 64 elements. I have not seen many of these crashes so far in .NET Core.

Author: kouvel
Assignees: kouvel
Labels:

area-System.Threading

Milestone: 6.0.0

Copy link
Member

@jeffschwMSFT jeffschwMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

@jeffschwMSFT jeffschwMSFT merged commit 2b8e9aa into dotnet:release/6.0 Sep 15, 2021
@kouvel kouvel deleted the TpGateFix6 branch September 15, 2021 18:49
@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants