Skip to content

Fix some scaling issues with the global queue in the thread pool #69386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 7, 2022

Conversation

kouvel
Copy link
Contributor

@kouvel kouvel commented May 16, 2022

  • The global concurrent queue is not scaling well in some situations on machines with a large number of processors. A lot of contention is seen in dequeuing work on some benchmarks.
  • I initially tried switching to queuing work from thread pool threads more independently with more efficient work stealing, but it looks like that may need more investigation/experimentation. This is a simpler change that seems to work reasonably well for now.
  • Beyond 32 procs, added additional concurrent queues, one per 16 procs. When a worker thread begins dispatching work, it assigns one of those additional queues to itself, trying to limit assignments to 16 worker threads per queue.
  • Work items queued by a worker thread queues to the assigned queue. The worker thread dequeues from the assigned queue first, and later tries to dequeue from other queues. Work items queued by non-thread-pool threads continue to go into the global queue.
  • When a worker thread stops dispatching work items, it unassigns itself from the queue, and may transfer work items from it if it was the last thread assigned to the queue.
  • In the observed cases, work items are distributed to the different queues and contention is reduced due to a limited number of threads operating on each queue

Fixes #67845

- The global concurrent queue is not scaling well in some situations on machines with a large number of processors. A lot of contention is seen in dequeuing work on some benchmarks.
- I initially tried switching to queuing work from thread pool threads more independently with more efficient work stealing, but it looks like that may need more investigation/experimentation. This is a simpler change that seems to work reasonably well for now.
- Beyond 32 procs, added additional concurrent queues, one per 16 procs. When a worker thread begins dispatching work, it assigns one of those additional queues to itself, trying to limit assignments to 16 worker threads per queue.
- Work items queued by a worker thread queues to the assigned queue. The worker thread dequeues from the assigned queue first, and later tries to dequeue from other queues. Work items queued by non-thread-pool threads continue to go into the global queue.
- When a worker thread stops dispatching work items, it unassigns itself from the queue, and may transfer work items from it if it was the last thread assigned to the queue.
- In the observed cases, work items are distributed to the different queues and contention is reduced due to a limited number of threads operating on each queue
@kouvel kouvel added this to the 7.0.0 milestone May 16, 2022
@kouvel kouvel self-assigned this May 16, 2022
@ghost
Copy link

ghost commented May 16, 2022

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details
  • The global concurrent queue is not scaling well in some situations on machines with a large number of processors. A lot of contention is seen in dequeuing work on some benchmarks.
  • I initially tried switching to queuing work from thread pool threads more independently with more efficient work stealing, but it looks like that may need more investigation/experimentation. This is a simpler change that seems to work reasonably well for now.
  • Beyond 32 procs, added additional concurrent queues, one per 16 procs. When a worker thread begins dispatching work, it assigns one of those additional queues to itself, trying to limit assignments to 16 worker threads per queue.
  • Work items queued by a worker thread queues to the assigned queue. The worker thread dequeues from the assigned queue first, and later tries to dequeue from other queues. Work items queued by non-thread-pool threads continue to go into the global queue.
  • When a worker thread stops dispatching work items, it unassigns itself from the queue, and may transfer work items from it if it was the last thread assigned to the queue.
  • In the observed cases, work items are distributed to the different queues and contention is reduced due to a limited number of threads operating on each queue
Author: kouvel
Assignees: kouvel
Labels:

area-System.Threading

Milestone: 7.0.0

@kouvel
Copy link
Contributor Author

kouvel commented May 16, 2022

RPS numbers for ASP.NET perf on an 80-proc arm64 Linux machine below. For this machine's comparison I included dotnet/aspnetcore#40476 on both the before and after sides because in some cases the thread pool global queue bottleneck was only showing up after that fix.

RPS Before After Diff
PlaintextPlatform 7896486 11830164 49.8%
JsonPlatform 310950 1146701 268.8%
FortunesPlatform 334955 437976 30.8%
Plaintext 9785381 10187365 4.1%
Json 270047 1089592 303.5%
Fortunes 244937 331054 35.2%

No significant change on a 48-proc x64 Linux machine:

RPS Before After Diff
PlaintextPlatform 12042603 12051597 0.1%
JsonPlatform 1352148 1348159 -0.3%
FortunesPlatform 470702 478676 1.7%
Plaintext 7760305 7835156 1.0%
Json 1053096 1059431 0.6%
Fortunes 397688 400052 0.6%

No significant change on other citrine machines where only one queue is used.

@kouvel kouvel requested review from mangod9 and janvorli May 16, 2022 09:58
@kunalspathak
Copy link
Member

FYI - @sebastienros @JulieLeeMSFT

@sebastienros
Copy link
Member

And these numbers are without solving the same issue in aspnet? In my tests this was still required so I find this very promising.

@JulieLeeMSFT
Copy link
Member

No significant change on a 48-proc x64 Linux machine:

@kouvel is this RPS measurement?

@kouvel
Copy link
Contributor Author

kouvel commented May 16, 2022

And these numbers are without solving the same issue in aspnet?

For the ampere machine I included your fix from dotnet/aspnetcore#40476 in the before and after columns so the numbers include the aspnet fix, as in some tests the thread pool scaling issue was only showing up after that. For the other machines I didn't include the aspnet fix.

@kouvel is this RPS measurement?

Yes, these are RPS numbers. I'll check the mean latency on the ampere machine and will add those.

@kouvel
Copy link
Contributor Author

kouvel commented May 16, 2022

Here are the mean latency numbers for ASP.NET perf on the 80-proc arm64 Linux machine:

Latency Before (ms) After (ms) Diff (ms)
PlaintextPlatform 1.43 1.64 0.21
JsonPlatform 1.86 0.56 -1.30
FortunesPlatform 1.86 1.61 -0.25
Plaintext 0.33 0.31 -0.02
Json 0.97 0.31 -0.66
Fortunes 1.52 1.18 -0.34

@kunalspathak
Copy link
Member

@kouvel - any update on this PR? Is there anything left?

@kouvel
Copy link
Contributor Author

kouvel commented Jun 1, 2022

Needs a review, @janvorli or @mangod9 could you please take a look?

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one nit. I am sorry it took me so long, I've forgotten about this PR.

@stephentoub
Copy link
Member

This maintains the FIFO nature of work items queued by a given thread to the global queue. Technically it loses the FIFO nature of multiple threads queueing to the global queue, right? e.g. if thread 1 queues a work item, then signals to thread 2 which in turn queues a work item, it's possible thread 2's work item could be dequeued before thread 1's, yes? I assume we think the chances of that negatively impacting something are slim.

@kouvel
Copy link
Contributor Author

kouvel commented Jun 3, 2022

Technically it loses the FIFO nature of multiple threads queueing to the global queue, right? e.g. if thread 1 queues a work item, then signals to thread 2 which in turn queues a work item, it's possible thread 2's work item could be dequeued before thread 1's, yes? I assume we think the chances of that negatively impacting something are slim.

Yea it's a tradeoff. Maintaining a strict global FIFO order for work items comes with scalability issues. Based on my other testing, breaking the FIFO order more (such as with independent queuing of work by threads with work stealing), currently has some issues with ordering of certain work items, although that model may work better eventually. This change tries to maintain some FIFO ordering while breaking some of it to scale better. I don't anticipate many issues from it since where it's applicable, the scalability issue seems to be a larger issue. Another potential tradeoff is that in some cases on larger machines with low load, some threads may have to spend more CPU time searching more queues for work, resulting in more CPU time for the same work done. I haven't seen it yet, though that one may be workable if it becomes an issue.

@kouvel kouvel merged commit deb4044 into dotnet:main Jun 7, 2022
@kouvel kouvel deleted the TpSplit branch June 7, 2022 15:01
kouvel added a commit to dotnet/diagnostics that referenced this pull request Jun 7, 2022
* Update SOS to show thread pool work items from new queues

- Depends on dotnet/runtime#69386
- The PR above added new queues of work items. This change updates the `ThreadPool -wi` command to include showing work items from those new queues.
- Previously on x86 it looks like it was reading 8 bytes from an array element of pointer size and only using the lower 4 bytes. Fixed to read only pointer size in a few places.
@EgorBo
Copy link
Member

EgorBo commented Jun 16, 2022

Improvements on Linux-arm64: dotnet/perf-autofiling-issues#6172, dotnet/perf-autofiling-issues#6157

@adamsitnik
Copy link
Member

@kouvel is there any chance you are going to blog about it? or do a Platform talk? it looks like a very interesting problem, I would love to hear the whole story behind it

@kouvel
Copy link
Contributor Author

kouvel commented Jul 6, 2022

There's some more info about the issue I ran into in dotnet/aspnetcore#41391. There are more things to investigate, including polling for IO on worker threads. It is to be determined what other changes may help and if it would work better than currently. We can chat if you'd like.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The thread pool's global queue doesn't scale well on machines with a large processor count
9 participants