Partition more ConcurrentQueues in Kestrel #42237

halter73 · 2022-06-16T16:43:47Z

PinnedMemoryPool uses an ConcurrentQueue just like IOQueue so we should partition it the same way which this PR does
We used to do the exact same thing per LibuvThread. I thought I did the same thing in SocketConnectionListener, but I guess not.

Ultimately, we can be smarter about how we partition ConcurrentQueues at the Kestrel layer than we can be at a lower layer like within ConcurrentQueue itself. That's because we can ensure that each partition is exclusively used by code servicing sockets within that partition. These can still be accessed by the application from the thread pool, so it's not perfect. But especially when you're running inline, this should avoid issues related to high core counts and NUMA.

@sebastienros Can we benchmark on our high core count machines that are seeing a lot of contention in ConcurrentQueue?

- PinnedMemoryPool uses an ConcurrentQueue just like IOQueue so we should partition it the same - We used to do the exact same thing per LibuvThread. I thought I did the same thing in SocketConnectionListener, but I guess not

sebastienros · 2022-06-17T21:50:13Z

I have seen PRs with more value, maybe?
On ARM64 80 cores, Linux

Plaintext Platform

| application           | base-plaintext            | pr-plaintext              |         |
| --------------------- | ------------------------- | ------------------------- | ------- |
| CPU Usage (%)         |                        96 |                        81 | -15.62% |
| Cores usage (%)       |                     7,671 |                     6,464 | -15.73% |
| Working Set (MB)      |                       187 |                       189 |  +1.07% |
| Private Memory (MB)   |                     1,338 |                     1,338 |   0.00% |
| Build Time (ms)       |                     3,841 |                     4,017 |  +4.58% |
| Start Time (ms)       |                       267 |                       273 |  +2.25% |
| Published Size (KB)   |                   105,362 |                   105,362 |   0.00% |
| .NET Core SDK Version | 7.0.100-preview.6.22316.8 | 7.0.100-preview.6.22316.8 |         |


| load                   | base-plaintext | pr-plaintext |          |
| ---------------------- | -------------- | ------------ | -------- |
| CPU Usage (%)          |              7 |           45 | +542.86% |
| Cores usage (%)        |            337 |        2,159 | +540.65% |
| Working Set (MB)       |             38 |           38 |    0.00% |
| Private Memory (MB)    |            370 |          370 |    0.00% |
| Start Time (ms)        |              0 |            0 |          |
| First Request (ms)     |             62 |           75 |  +20.97% |
| Requests/sec           |      2,388,167 |   14,678,357 | +514.63% |
| Requests               |     36,059,808 |  221,609,656 | +514.56% |
| Mean latency (ms)      |           7.20 |         1.19 |  -83.47% |
| Max latency (ms)       |         177.30 |       224.39 |  +26.56% |
| Bad responses          |              0 |            0 |          |
| Socket errors          |              0 |            0 |          |
| Read throughput (MB/s) |         286.97 |     1,761.28 | +513.75% |
| Latency 50th (ms)      |           4.36 |         0.59 |  -86.38% |
| Latency 75th (ms)      |           9.94 |         0.98 |  -90.14% |
| Latency 90th (ms)      |          17.55 |         1.61 |  -90.83% |
| Latency 99th (ms)      |          31.97 |        11.46 |  -64.15% |

Json Platform:

| application           | base-json                 | pr-json                   |         |
| --------------------- | ------------------------- | ------------------------- | ------- |
| CPU Usage (%)         |                        97 |                        65 | -32.99% |
| Cores usage (%)       |                     7,741 |                     5,209 | -32.71% |
| Working Set (MB)      |                       157 |                       158 |  +0.64% |
| Private Memory (MB)   |                     1,307 |                     1,307 |   0.00% |
| Build Time (ms)       |                     3,923 |                     3,827 |  -2.45% |
| Start Time (ms)       |                       266 |                       271 |  +1.88% |
| Published Size (KB)   |                   105,362 |                   105,362 |   0.00% |
| .NET Core SDK Version | 7.0.100-preview.6.22316.8 | 7.0.100-preview.6.22316.8 |         |


| load                   | base-json | pr-json    |          |
| ---------------------- | --------- | ---------- | -------- |
| CPU Usage (%)          |         9 |         34 | +277.78% |
| Cores usage (%)        |       421 |      1,613 | +283.14% |
| Working Set (MB)       |        38 |         38 |    0.00% |
| Private Memory (MB)    |       363 |        363 |    0.00% |
| Start Time (ms)        |         0 |          0 |          |
| First Request (ms)     |        62 |         69 |  +11.29% |
| Requests/sec           |   272,998 |  1,123,904 | +311.69% |
| Requests               | 4,122,123 | 16,970,669 | +311.70% |
| Mean latency (ms)      |      2.00 |       0.59 |  -70.49% |
| Max latency (ms)       |    100.24 |      78.29 |  -21.90% |
| Bad responses          |         0 |          0 |          |
| Socket errors          |         0 |          0 |          |
| Read throughput (MB/s) |     38.01 |     156.49 | +311.71% |
| Latency 50th (ms)      |      1.78 |       0.37 |  -79.27% |
| Latency 75th (ms)      |      2.53 |       0.54 |  -78.77% |
| Latency 90th (ms)      |      3.34 |       1.02 |  -69.46% |
| Latency 99th (ms)      |      6.90 |       4.05 |  -41.30% |

Note that the bottleneck in the runtime has already been fixed with these runs.

sebastienros · 2022-06-17T22:09:00Z

Ran benchmarks on INTEL Linux 28 cores and it doesn't regress it.

ghost · 2022-06-17T22:40:33Z

halter73, this change will be considered for inclusion in the blog post for the release it'll ship in. Nice work!

Please ensure that the original comment in this thread contains a clear explanation of what the change does, why it's important (what problem does it solve?), and, if relevant, include things like code samples and/or performance numbers.

This content may not be exactly what goes into the blog post, but it will help the team putting together the announcement.

Thanks!

adityamandaleeka · 2022-06-17T22:41:18Z

Looks like the bot needs an @-sign added 😄

Partition more ConcurrentQueue

ff4f569

- PinnedMemoryPool uses an ConcurrentQueue just like IOQueue so we should partition it the same - We used to do the exact same thing per LibuvThread. I thought I did the same thing in SocketConnectionListener, but I guess not

halter73 requested review from Tratcher, BrennanConroy and JamesNK as code owners June 16, 2022 16:43

ghost added the area-runtime label Jun 16, 2022

halter73 requested a review from sebastienros June 16, 2022 16:45

missed one!

dccbe0c

sebastienros added the blog-candidate Consider mentioning this in the release blog post label Jun 17, 2022

sebastienros approved these changes Jun 17, 2022

View reviewed changes

sebastienros mentioned this pull request Jun 17, 2022

Partition some pools by core #40476

Closed

kunalspathak mentioned this pull request Jun 17, 2022

Improving ARM64 Performance in .NET 7.0 dotnet/runtime#64820

Closed

32 tasks

adityamandaleeka added blog-candidate Consider mentioning this in the release blog post and removed blog-candidate Consider mentioning this in the release blog post labels Jun 17, 2022

halter73 merged commit 36e1456 into main Jun 17, 2022

halter73 deleted the halter73/partition-more-concurrentqueues branch June 17, 2022 23:03

ghost added this to the 7.0-preview6 milestone Jun 17, 2022

EgorBo mentioned this pull request Jun 21, 2022

2x RPS improvement in Json-https benchmark. why? dotnet/runtime#71060

Closed

JulieLeeMSFT mentioned this pull request Jun 30, 2022

[Arm64 Server Throughput] Address many-core scaling challenges dotnet/runtime#70528

Closed

2 tasks

davidfowl added the Perf label Aug 26, 2022

amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Partition more ConcurrentQueues in Kestrel #42237

Partition more ConcurrentQueues in Kestrel #42237

Uh oh!

halter73 commented Jun 16, 2022

Uh oh!

sebastienros commented Jun 17, 2022 •

edited

Loading

Uh oh!

sebastienros commented Jun 17, 2022

Uh oh!

ghost commented Jun 17, 2022

Uh oh!

adityamandaleeka commented Jun 17, 2022

Uh oh!

Uh oh!

Partition more ConcurrentQueues in Kestrel #42237

Partition more ConcurrentQueues in Kestrel #42237

Uh oh!

Conversation

halter73 commented Jun 16, 2022

Uh oh!

sebastienros commented Jun 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastienros commented Jun 17, 2022

Uh oh!

ghost commented Jun 17, 2022

Uh oh!

adityamandaleeka commented Jun 17, 2022

Uh oh!

Uh oh!

sebastienros commented Jun 17, 2022 •

edited

Loading