Skip to content

[8.0] Add a config flag to experiment with some work item prioritization changes in cases involving a lot of sync-over-async #103984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 9, 2024

Conversation

kouvel
Copy link
Contributor

@kouvel kouvel commented Jun 25, 2024

Port of #103983 to 8.0

Some services that use a lot of sync-over-async were seen to experience stalls due to some priority inversion issues in work items that get queued to the thread pool. For instance, a work item W1 queues another work item W2 to the global queue and blocks waiting for a task to complete, where W2 would need to run in order to complete the task, but W2 is queued behind a number of other work items that operate like W1, and this sometimes leads to long-duration stalls. This change adds an experimental config option that when enabled, enqueues some kinds of work items to a new low-priority global queue that is checked after all other global queues. This was seen to help in some cases. The change is limited to CoreCLR for experimentation.

Customer Impact

Some 1p services that use a lot of sync-over-async are experiencing stalls due to some priority inversion issues in work items that get queued to the thread pool. In some situations, the services stall for a long time until a large number of worker threads are injected, which can take a long time. Due to having to handle existing synchronous APIs using async-only libraries, the sync-over-async is not easy to avoid. Increasing the rate of thread injection through config was not desirable due to other performance issues with high worker thread counts.

Regression?

No

Testing

Validated on a reduced test case provided by a customer.

Risk

Low:

  • The behavioral changes are under a config switch that is disabled by default (and only in CoreCLR)
  • Although the config var is checked on some hot paths (enqueuing/dequeuing TP work items), tiered compilation would remove the checks, and the alternate code paths in the hot sections are minimal in size

However:

  • When enabled, the change may not resolve all of the issues in the actual services. Working around some but not all of the priority inversion issues involved may just shift the issue to another location without resolving the long stall.
  • When enabled, the change may deprioritize some work items that shouldn't be deprioritized
  • The change and the config var as proposed are probably not how we would want it to look long-term. If the change appears fruitful, and with further information, we can consider a better solution for the future.

…ion changes in cases involving a lot of sync-over-async

Some services that use a lot of sync-over-async were seen to experience stalls due to some priority inversion issues in work items that get queued to the thread pool. For instance, a work item W1 queues another work item W2 to the global queue and blocks waiting for a task to complete, where W2 would need to run in order to complete the task, but W2 is queued behind a number of other work items that operate like W1, and this sometimes leads to long-duration stalls. This change adds an experimental config option that when enabled, enqueues some kinds of work items to a new low-priority global queue that is checked after all other global queues. This was seen to help in some cases.
@kouvel kouvel added this to the 8.0.x milestone Jun 25, 2024
@kouvel kouvel requested a review from stephentoub June 25, 2024 18:41
@kouvel kouvel self-assigned this Jun 25, 2024
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@carlossanlop
Copy link
Contributor

@kouvel friendly reminder that code complete for the August Release is July 15th. If you want this fix to be included in that release, please merge this PR before that date.

Copy link
Member

@jeffschwMSFT jeffschwMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. we will take for consideration in 8.0.x

@jeffschwMSFT jeffschwMSFT added the Servicing-consider Issue for next servicing release review label Jul 9, 2024
@leecow leecow added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Jul 9, 2024
@leecow leecow modified the milestones: 8.0.x, 8.0.8 Jul 9, 2024
@kouvel kouvel merged commit 57bd35a into dotnet:release/8.0-staging Jul 9, 2024
172 of 177 checks passed
@kouvel kouvel deleted the TpPriTestConfig8 branch July 9, 2024 19:43
@github-actions github-actions bot locked and limited conversation to collaborators Aug 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Threading Servicing-approved Approved for servicing release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants