Skip to content

Partition more ConcurrentQueues in Kestrel #42237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 17, 2022

Conversation

halter73
Copy link
Member

  • PinnedMemoryPool uses an ConcurrentQueue just like IOQueue so we should partition it the same way which this PR does
  • We used to do the exact same thing per LibuvThread. I thought I did the same thing in SocketConnectionListener, but I guess not.

Ultimately, we can be smarter about how we partition ConcurrentQueues at the Kestrel layer than we can be at a lower layer like within ConcurrentQueue itself. That's because we can ensure that each partition is exclusively used by code servicing sockets within that partition. These can still be accessed by the application from the thread pool, so it's not perfect. But especially when you're running inline, this should avoid issues related to high core counts and NUMA.

@sebastienros Can we benchmark on our high core count machines that are seeing a lot of contention in ConcurrentQueue?

- PinnedMemoryPool uses an ConcurrentQueue just like IOQueue
  so we should partition it the same
- We used to do the exact same thing per LibuvThread. I thought
  I did the same thing in SocketConnectionListener, but I guess not
@sebastienros sebastienros added the blog-candidate Consider mentioning this in the release blog post label Jun 17, 2022
@sebastienros
Copy link
Member

sebastienros commented Jun 17, 2022

I have seen PRs with more value, maybe?
On ARM64 80 cores, Linux

Plaintext Platform

| application           | base-plaintext            | pr-plaintext              |         |
| --------------------- | ------------------------- | ------------------------- | ------- |
| CPU Usage (%)         |                        96 |                        81 | -15.62% |
| Cores usage (%)       |                     7,671 |                     6,464 | -15.73% |
| Working Set (MB)      |                       187 |                       189 |  +1.07% |
| Private Memory (MB)   |                     1,338 |                     1,338 |   0.00% |
| Build Time (ms)       |                     3,841 |                     4,017 |  +4.58% |
| Start Time (ms)       |                       267 |                       273 |  +2.25% |
| Published Size (KB)   |                   105,362 |                   105,362 |   0.00% |
| .NET Core SDK Version | 7.0.100-preview.6.22316.8 | 7.0.100-preview.6.22316.8 |         |


| load                   | base-plaintext | pr-plaintext |          |
| ---------------------- | -------------- | ------------ | -------- |
| CPU Usage (%)          |              7 |           45 | +542.86% |
| Cores usage (%)        |            337 |        2,159 | +540.65% |
| Working Set (MB)       |             38 |           38 |    0.00% |
| Private Memory (MB)    |            370 |          370 |    0.00% |
| Start Time (ms)        |              0 |            0 |          |
| First Request (ms)     |             62 |           75 |  +20.97% |
| Requests/sec           |      2,388,167 |   14,678,357 | +514.63% |
| Requests               |     36,059,808 |  221,609,656 | +514.56% |
| Mean latency (ms)      |           7.20 |         1.19 |  -83.47% |
| Max latency (ms)       |         177.30 |       224.39 |  +26.56% |
| Bad responses          |              0 |            0 |          |
| Socket errors          |              0 |            0 |          |
| Read throughput (MB/s) |         286.97 |     1,761.28 | +513.75% |
| Latency 50th (ms)      |           4.36 |         0.59 |  -86.38% |
| Latency 75th (ms)      |           9.94 |         0.98 |  -90.14% |
| Latency 90th (ms)      |          17.55 |         1.61 |  -90.83% |
| Latency 99th (ms)      |          31.97 |        11.46 |  -64.15% |

Json Platform:

| application           | base-json                 | pr-json                   |         |
| --------------------- | ------------------------- | ------------------------- | ------- |
| CPU Usage (%)         |                        97 |                        65 | -32.99% |
| Cores usage (%)       |                     7,741 |                     5,209 | -32.71% |
| Working Set (MB)      |                       157 |                       158 |  +0.64% |
| Private Memory (MB)   |                     1,307 |                     1,307 |   0.00% |
| Build Time (ms)       |                     3,923 |                     3,827 |  -2.45% |
| Start Time (ms)       |                       266 |                       271 |  +1.88% |
| Published Size (KB)   |                   105,362 |                   105,362 |   0.00% |
| .NET Core SDK Version | 7.0.100-preview.6.22316.8 | 7.0.100-preview.6.22316.8 |         |


| load                   | base-json | pr-json    |          |
| ---------------------- | --------- | ---------- | -------- |
| CPU Usage (%)          |         9 |         34 | +277.78% |
| Cores usage (%)        |       421 |      1,613 | +283.14% |
| Working Set (MB)       |        38 |         38 |    0.00% |
| Private Memory (MB)    |       363 |        363 |    0.00% |
| Start Time (ms)        |         0 |          0 |          |
| First Request (ms)     |        62 |         69 |  +11.29% |
| Requests/sec           |   272,998 |  1,123,904 | +311.69% |
| Requests               | 4,122,123 | 16,970,669 | +311.70% |
| Mean latency (ms)      |      2.00 |       0.59 |  -70.49% |
| Max latency (ms)       |    100.24 |      78.29 |  -21.90% |
| Bad responses          |         0 |          0 |          |
| Socket errors          |         0 |          0 |          |
| Read throughput (MB/s) |     38.01 |     156.49 | +311.71% |
| Latency 50th (ms)      |      1.78 |       0.37 |  -79.27% |
| Latency 75th (ms)      |      2.53 |       0.54 |  -78.77% |
| Latency 90th (ms)      |      3.34 |       1.02 |  -69.46% |
| Latency 99th (ms)      |      6.90 |       4.05 |  -41.30% |

Note that the bottleneck in the runtime has already been fixed with these runs.

@sebastienros
Copy link
Member

Ran benchmarks on INTEL Linux 28 cores and it doesn't regress it.

@adityamandaleeka adityamandaleeka added blog-candidate Consider mentioning this in the release blog post and removed blog-candidate Consider mentioning this in the release blog post labels Jun 17, 2022
@ghost
Copy link

ghost commented Jun 17, 2022

halter73, this change will be considered for inclusion in the blog post for the release it'll ship in. Nice work!

Please ensure that the original comment in this thread contains a clear explanation of what the change does, why it's important (what problem does it solve?), and, if relevant, include things like code samples and/or performance numbers.

This content may not be exactly what goes into the blog post, but it will help the team putting together the announcement.

Thanks!

@adityamandaleeka
Copy link
Member

Looks like the bot needs an @-sign added 😄

@halter73 halter73 merged commit 36e1456 into main Jun 17, 2022
@halter73 halter73 deleted the halter73/partition-more-concurrentqueues branch June 17, 2022 23:03
@ghost ghost added this to the 7.0-preview6 milestone Jun 17, 2022
@davidfowl davidfowl added the Perf label Aug 26, 2022
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions blog-candidate Consider mentioning this in the release blog post Perf
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants