Fix some scaling issues with the global queue in the thread pool #69386

kouvel · 2022-05-16T09:53:56Z

The global concurrent queue is not scaling well in some situations on machines with a large number of processors. A lot of contention is seen in dequeuing work on some benchmarks.
I initially tried switching to queuing work from thread pool threads more independently with more efficient work stealing, but it looks like that may need more investigation/experimentation. This is a simpler change that seems to work reasonably well for now.
Beyond 32 procs, added additional concurrent queues, one per 16 procs. When a worker thread begins dispatching work, it assigns one of those additional queues to itself, trying to limit assignments to 16 worker threads per queue.
Work items queued by a worker thread queues to the assigned queue. The worker thread dequeues from the assigned queue first, and later tries to dequeue from other queues. Work items queued by non-thread-pool threads continue to go into the global queue.
When a worker thread stops dispatching work items, it unassigns itself from the queue, and may transfer work items from it if it was the last thread assigned to the queue.
In the observed cases, work items are distributed to the different queues and contention is reduced due to a limited number of threads operating on each queue

- The global concurrent queue is not scaling well in some situations on machines with a large number of processors. A lot of contention is seen in dequeuing work on some benchmarks. - I initially tried switching to queuing work from thread pool threads more independently with more efficient work stealing, but it looks like that may need more investigation/experimentation. This is a simpler change that seems to work reasonably well for now. - Beyond 32 procs, added additional concurrent queues, one per 16 procs. When a worker thread begins dispatching work, it assigns one of those additional queues to itself, trying to limit assignments to 16 worker threads per queue. - Work items queued by a worker thread queues to the assigned queue. The worker thread dequeues from the assigned queue first, and later tries to dequeue from other queues. Work items queued by non-thread-pool threads continue to go into the global queue. - When a worker thread stops dispatching work items, it unassigns itself from the queue, and may transfer work items from it if it was the last thread assigned to the queue. - In the observed cases, work items are distributed to the different queues and contention is reduced due to a limited number of threads operating on each queue

ghost · 2022-05-16T09:54:04Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

The global concurrent queue is not scaling well in some situations on machines with a large number of processors. A lot of contention is seen in dequeuing work on some benchmarks.
I initially tried switching to queuing work from thread pool threads more independently with more efficient work stealing, but it looks like that may need more investigation/experimentation. This is a simpler change that seems to work reasonably well for now.
Beyond 32 procs, added additional concurrent queues, one per 16 procs. When a worker thread begins dispatching work, it assigns one of those additional queues to itself, trying to limit assignments to 16 worker threads per queue.
Work items queued by a worker thread queues to the assigned queue. The worker thread dequeues from the assigned queue first, and later tries to dequeue from other queues. Work items queued by non-thread-pool threads continue to go into the global queue.
When a worker thread stops dispatching work items, it unassigns itself from the queue, and may transfer work items from it if it was the last thread assigned to the queue.
In the observed cases, work items are distributed to the different queues and contention is reduced due to a limited number of threads operating on each queue

Author:	kouvel
Assignees:	kouvel
Labels:	`area-System.Threading`
Milestone:	7.0.0

kouvel · 2022-05-16T09:58:07Z

RPS numbers for ASP.NET perf on an 80-proc arm64 Linux machine below. For this machine's comparison I included dotnet/aspnetcore#40476 on both the before and after sides because in some cases the thread pool global queue bottleneck was only showing up after that fix.

RPS	Before	After	Diff
PlaintextPlatform	7896486	11830164	49.8%
JsonPlatform	310950	1146701	268.8%
FortunesPlatform	334955	437976	30.8%
Plaintext	9785381	10187365	4.1%
Json	270047	1089592	303.5%
Fortunes	244937	331054	35.2%

No significant change on a 48-proc x64 Linux machine:

RPS	Before	After	Diff
PlaintextPlatform	12042603	12051597	0.1%
JsonPlatform	1352148	1348159	-0.3%
FortunesPlatform	470702	478676	1.7%
Plaintext	7760305	7835156	1.0%
Json	`1053096`	`1059431`	0.6%
Fortunes	397688	400052	0.6%

No significant change on other citrine machines where only one queue is used.

kunalspathak · 2022-05-16T17:22:17Z

FYI - @sebastienros @JulieLeeMSFT

sebastienros · 2022-05-16T17:26:48Z

And these numbers are without solving the same issue in aspnet? In my tests this was still required so I find this very promising.

JulieLeeMSFT · 2022-05-16T17:38:24Z

No significant change on a 48-proc x64 Linux machine:

@kouvel is this RPS measurement?

kouvel · 2022-05-16T18:27:48Z

And these numbers are without solving the same issue in aspnet?

For the ampere machine I included your fix from dotnet/aspnetcore#40476 in the before and after columns so the numbers include the aspnet fix, as in some tests the thread pool scaling issue was only showing up after that. For the other machines I didn't include the aspnet fix.

@kouvel is this RPS measurement?

Yes, these are RPS numbers. I'll check the mean latency on the ampere machine and will add those.

kouvel · 2022-05-16T19:18:56Z

Here are the mean latency numbers for ASP.NET perf on the 80-proc arm64 Linux machine:

Latency	Before (ms)	After (ms)	Diff (ms)
PlaintextPlatform	1.43	1.64	0.21
JsonPlatform	1.86	0.56	-1.30
FortunesPlatform	1.86	1.61	-0.25
Plaintext	0.33	0.31	-0.02
Json	0.97	0.31	-0.66
Fortunes	1.52	1.18	-0.34

src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs

kunalspathak · 2022-06-01T22:36:38Z

@kouvel - any update on this PR? Is there anything left?

kouvel · 2022-06-01T23:25:02Z

Needs a review, @janvorli or @mangod9 could you please take a look?

janvorli

LGTM with one nit. I am sorry it took me so long, I've forgotten about this PR.

src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs

stephentoub · 2022-06-03T00:59:02Z

This maintains the FIFO nature of work items queued by a given thread to the global queue. Technically it loses the FIFO nature of multiple threads queueing to the global queue, right? e.g. if thread 1 queues a work item, then signals to thread 2 which in turn queues a work item, it's possible thread 2's work item could be dequeued before thread 1's, yes? I assume we think the chances of that negatively impacting something are slim.

kouvel · 2022-06-03T04:53:53Z

Technically it loses the FIFO nature of multiple threads queueing to the global queue, right? e.g. if thread 1 queues a work item, then signals to thread 2 which in turn queues a work item, it's possible thread 2's work item could be dequeued before thread 1's, yes? I assume we think the chances of that negatively impacting something are slim.

Yea it's a tradeoff. Maintaining a strict global FIFO order for work items comes with scalability issues. Based on my other testing, breaking the FIFO order more (such as with independent queuing of work by threads with work stealing), currently has some issues with ordering of certain work items, although that model may work better eventually. This change tries to maintain some FIFO ordering while breaking some of it to scale better. I don't anticipate many issues from it since where it's applicable, the scalability issue seems to be a larger issue. Another potential tradeoff is that in some cases on larger machines with low load, some threads may have to spend more CPU time searching more queues for work, resulting in more CPU time for the same work done. I haven't seen it yet, though that one may be workable if it becomes an issue.

* Update SOS to show thread pool work items from new queues - Depends on dotnet/runtime#69386 - The PR above added new queues of work items. This change updates the `ThreadPool -wi` command to include showing work items from those new queues. - Previously on x86 it looks like it was reading 8 bytes from an array element of pointer size and only using the lower 4 bytes. Fixed to read only pointer size in a few places.

EgorBo · 2022-06-16T16:57:33Z

Improvements on Linux-arm64: dotnet/perf-autofiling-issues#6172, dotnet/perf-autofiling-issues#6157

adamsitnik · 2022-06-23T15:24:22Z

@kouvel is there any chance you are going to blog about it? or do a Platform talk? it looks like a very interesting problem, I would love to hear the whole story behind it

kouvel · 2022-07-06T17:04:21Z

There's some more info about the issue I ran into in dotnet/aspnetcore#41391. There are more things to investigate, including polling for IO on worker threads. It is to be determined what other changes may help and if it would work better than currently. We can chat if you'd like.

kouvel added the area-System.Threading label May 16, 2022

kouvel added this to the 7.0.0 milestone May 16, 2022

kouvel self-assigned this May 16, 2022

kouvel requested review from mangod9 and janvorli May 16, 2022 09:58

kunalspathak mentioned this pull request May 16, 2022

Improving ARM64 Performance in .NET 7.0 #64820

Closed

32 tasks

This was referenced May 16, 2022

Long test time for Perf_Timer.[A]synchronousContention tests dotnet/performance#2420

Closed

Update SOS to show thread pool work items from new queues dotnet/diagnostics#3080

Merged

stephentoub reviewed May 17, 2022

View reviewed changes

Rename a couple of fields

467a8d9

This was referenced May 19, 2022

Test failure: System.Security.Cryptography.X509Certificates.Tests.CertificateCreation.CertificateRequestChainTests/CreateChain_Hybrid #25979

Closed

System.Security.Cryptography.X509Certificates.Tests.ChainTests.BuildInvalidSignatureTwice failure #65448

Closed

mangod9 approved these changes Jun 2, 2022

View reviewed changes

janvorli approved these changes Jun 2, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs Outdated Show resolved Hide resolved

Address feedback

f28479a

kouvel merged commit deb4044 into dotnet:main Jun 7, 2022

kouvel deleted the TpSplit branch June 7, 2022 15:01

DrewScoggins mentioned this pull request Jun 9, 2022

[Perf] Regressions in System.Threading.Tasks.Tests.Perf_AsyncMethods #70496

Closed

JulieLeeMSFT mentioned this pull request Jun 9, 2022

[Arm64 Server Throughput] Address many-core scaling challenges #70528

Closed

2 tasks

This was referenced Jun 10, 2022

[Perf] Changes at 6/7/2022 11:31:29 PM dotnet/perf-autofiling-issues#5981

Closed

[Perf] Changes at 6/7/2022 6:41:39 PM dotnet/perf-autofiling-issues#5980

Closed

[Perf] Changes at 6/7/2022 6:41:39 PM dotnet/perf-autofiling-issues#5948

Closed

DrewScoggins mentioned this pull request Jun 16, 2022

[Perf] Regressions in System.Threading.Tasks.Tests.Perf_AsyncMethods #70842

Closed

EgorBo mentioned this pull request Jun 23, 2022

[Perf] Changes at 6/7/2022 6:41:39 PM dotnet/perf-autofiling-issues#6148

Closed

ghost locked as resolved and limited conversation to collaborators Aug 5, 2022

Fix some scaling issues with the global queue in the thread pool #69386

Fix some scaling issues with the global queue in the thread pool #69386

Uh oh!

Conversation

kouvel commented May 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented May 16, 2022

Uh oh!

kouvel commented May 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kunalspathak commented May 16, 2022

Uh oh!

sebastienros commented May 16, 2022

Uh oh!

JulieLeeMSFT commented May 16, 2022

Uh oh!

kouvel commented May 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kouvel commented May 16, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kunalspathak commented Jun 1, 2022

Uh oh!

kouvel commented Jun 1, 2022

Uh oh!

janvorli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stephentoub commented Jun 3, 2022

Uh oh!

kouvel commented Jun 3, 2022

Uh oh!

EgorBo commented Jun 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamsitnik commented Jun 23, 2022

Uh oh!

kouvel commented Jul 6, 2022

Uh oh!

Uh oh!

kouvel commented May 16, 2022 •

edited

Loading

kouvel commented May 16, 2022 •

edited

Loading

kouvel commented May 16, 2022 •

edited

Loading

EgorBo commented Jun 16, 2022 •

edited

Loading