Skip to content

Use PipeWriter to write files #24851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

davidfowl
Copy link
Member

@davidfowl davidfowl commented Aug 12, 2020

  • Use Async Disposable for FileStream
  • Use PipeWriter in FileResult
  • Clamp buffer between 1K and 64K
  • Changed Kestrel to Start the response before writing to the pipe
  • Use the ArrayPool in the Http1OutputProducer

Successor to #21989 cc @Kahbazi

TL;DR

  • Files < 4K have the most significant RPS win.
  • Most of the other sizes don't have an RPS win but GC stats have improved (16K gen1 being investigated).

After spending a couple of days with this PR I've noticed that it does allocate more for files over certain sizes. After deeper investigation, it seems like we're exhausting the ArrayPool for 16K buffers (which result in the array pool allocating).

image

My hunch is this happens for 2 possible reasons:

  • We're running faster and there's more 16K buffers outstanding.
  • We're holding onto the buffers longer and they get returned when the buffer is written to the socket rather than copied into the server's buffer (at least in the case of kestrel).

It's possible we need to use different strategies to maximize throughput and keep the GC happy.

Performance numbers

128K 32 connections

| application              | baseline128k | Current    |         |
| ------------------------ | ------------ | ---------- | ------- |
| CPU Usage (%)            |           65 |         60 |  -7.69% |
| Raw CPU Usage (%)        |       785.30 |     715.65 |  -8.87% |
| Working Set (MB)         |          176 |        181 |  +2.84% |
| Build Time (ms)          |        3,135 |      3,123 |  -0.38% |
| Start Time (ms)          |          255 |        267 |  +4.71% |
| Published Size (KB)      |      100,011 |    100,011 |   0.00% |
| CPU Usage (%)            |           66 |         60 |  -9.09% |
| Working Set (MB)         |          184 |        187 |  +1.63% |
| GC Heap Size (MB)        |           89 |        107 | +20.22% |
| Gen 0 GC (#/min)         |         1.00 |       1.00 |   0.00% |
| Gen 1 GC (#/min)         |         1.00 |       1.00 |   0.00% |
| Gen 2 GC (#/min)         |         0.00 |       0.00 |         |
| Time in GC (%)           |         0.00 |       0.00 |         |
| Gen 0 Size (B)           |   18,338,840 |  9,827,264 | -46.41% |
| Gen 1 Size (B)           |    1,721,520 |  3,267,944 | +89.83% |
| Gen 2 Size (B)           |    1,685,904 |  3,248,232 | +92.67% |
| LOH Size (B)             |      310,880 |    310,880 |   0.00% |
| POH Size (B)             |    2,360,016 |    262,480 | -88.88% |
| Allocation Rate (B/sec)  |   54,536,488 | 49,154,848 |  -9.87% |
| # of Assemblies Loaded   |           88 |         89 |  +1.14% |
| Exceptions (#/s)         |          125 |        114 |  -8.80% |
| ThreadPool Threads Count |           22 |         22 |   0.00% |
| Lock Contention (#/s)    |          576 |        332 | -42.36% |
| ThreadPool Queue Length  |            1 |          1 |   0.00% |
| ThreadPool Items (#/s)   |      129,027 |     94,099 | -27.07% |
| Active Timers            |            0 |          0 |         |
| IL Jitted (B)            |      209,363 |    279,054 | +33.29% |
| Methods Jitted           |        2,384 |      3,395 | +42.41% |


| load                | baseline128k | Current |         |
| ------------------- | ------------ | ------- | ------- |
| CPU Usage (%)       |           19 |      18 |  -5.26% |
| Raw CPU Usage (%)   |       227.93 |  217.28 |  -4.68% |
| Working Set (MB)    |            6 |       6 |   0.00% |
| Build Time (ms)     |        3,715 |   3,674 |  -1.10% |
| Start Time (ms)     |            0 |       0 |         |
| Published Size (KB) |       76,390 |  76,390 |   0.00% |
| First Request (ms)  |          114 |     165 | +44.74% |
| Requests/sec        |        8,960 |   8,976 |  +0.17% |
| Requests            |      135,129 | 135,181 |  +0.04% |
| Mean latency (ms)   |         3.78 |    3.79 |  +0.26% |
| Max latency (ms)    |        17.08 |   17.21 |  +0.76% |
| Bad responses       |            0 |       0 |         |
| Socket errors       |            0 |       0 |         |
| Latency 50th (ms)   |         3.00 |    3.03 |  +1.00% |
| Latency 75th (ms)   |         5.03 |    4.94 |  -1.79% |
| Latency 90th (ms)   |         7.36 |    7.34 |  -0.27% |
| Latency 99th (ms)   |        11.53 |   11.69 |  +1.39% |

16K 10 connections

| application              | baseline16k10c | Current     |         |
| ------------------------ | -------------- | ----------- | ------- |
| CPU Usage (%)            |             78 |          79 |  +1.28% |
| Raw CPU Usage (%)        |         935.64 |      950.98 |  +1.64% |
| Working Set (MB)         |            206 |         188 |  -8.74% |
| Build Time (ms)          |          3,637 |       3,904 |  +7.34% |
| Start Time (ms)          |            261 |         254 |  -2.68% |
| Published Size (KB)      |         95,046 |     100,011 |  +5.22% |
| CPU Usage (%)            |             78 |          81 |  +3.85% |
| Working Set (MB)         |            216 |         196 |  -9.26% |
| GC Heap Size (MB)        |            104 |          97 |  -6.73% |
| Gen 0 GC (#/min)         |           2.00 |        2.00 |   0.00% |
| Gen 1 GC (#/min)         |           1.00 |        1.00 |   0.00% |
| Gen 2 GC (#/min)         |           0.00 |        0.00 |         |
| Time in GC (%)           |           0.00 |        0.00 |         |
| Gen 0 Size (B)           |     18,509,856 |  17,065,096 |  -7.81% |
| Gen 1 Size (B)           |      1,032,016 |   1,245,440 | +20.68% |
| Gen 2 Size (B)           |      1,000,408 |   1,229,416 | +22.89% |
| LOH Size (B)             |        966,576 |      48,680 | -94.96% |
| Allocation Rate (B/sec)  |    128,537,136 | 126,106,264 |  -1.89% |
| # of Assemblies Loaded   |             92 |          89 |  -3.26% |
| Exceptions (#/s)         |             10 |          12 | +20.00% |
| ThreadPool Threads Count |             29 |          22 | -24.14% |
| Lock Contention (#/s)    |             10 |           7 | -30.00% |
| ThreadPool Queue Length  |              1 |           1 |   0.00% |
| ThreadPool Items (#/s)   |        110,113 |     139,676 | +26.85% |
| Active Timers            |              1 |           0 |         |


| load                | baseline16k10c | Current |         |
| ------------------- | -------------- | ------- | ------- |
| CPU Usage (%)       |             13 |      13 |   0.00% |
| Raw CPU Usage (%)   |         150.92 |  156.62 |  +3.78% |
| Working Set (MB)    |              4 |       4 |   0.00% |
| Build Time (ms)     |          3,690 |   3,708 |  +0.49% |
| Start Time (ms)     |              0 |       0 |         |
| Published Size (KB) |         76,390 |  76,390 |   0.00% |
| First Request (ms)  |            110 |     166 | +50.91% |
| Requests/sec        |         26,737 |  29,590 | +10.67% |
| Requests            |        403,717 | 446,823 | +10.68% |
| Mean latency (ms)   |           0.41 |    0.34 | -16.74% |
| Max latency (ms)    |          17.39 |   10.38 | -40.31% |
| Bad responses       |              0 |       0 |         |
| Socket errors       |              0 |       0 |         |
| Latency 50th (ms)   |           0.36 |    0.33 |  -8.64% |
| Latency 75th (ms)   |           0.40 |    0.37 |  -7.73% |
| Latency 90th (ms)   |           0.44 |    0.41 |  -7.92% |
| Latency 99th (ms)   |           1.58 |    0.68 | -56.77% |

16K (default connections)

| application              | baseline16k | Current     |          |
| ------------------------ | ----------- | ----------- | -------- |
| CPU Usage (%)            |          97 |          98 |   +1.03% |
| Raw CPU Usage (%)        |    1,166.45 |    1,177.12 |   +0.92% |
| Working Set (MB)         |         179 |         198 |  +10.61% |
| Build Time (ms)          |       4,375 |       3,255 |  -25.60% |
| Start Time (ms)          |         263 |         281 |   +6.84% |
| Published Size (KB)      |      95,046 |     100,011 |   +5.22% |
| CPU Usage (%)            |          96 |          98 |   +2.08% |
| Working Set (MB)         |         187 |         203 |   +8.56% |
| GC Heap Size (MB)        |         113 |         108 |   -4.42% |
| Gen 0 GC (#/min)         |        4.00 |        3.00 |  -25.00% |
| Gen 1 GC (#/min)         |        1.00 |        1.00 |    0.00% |
| Gen 2 GC (#/min)         |        0.00 |        0.00 |          |
| Time in GC (%)           |        0.00 |        0.00 |          |
| Gen 0 Size (B)           |  54,688,704 |  54,045,056 |   -1.18% |
| Gen 1 Size (B)           |   2,019,320 |  20,553,216 | +917.83% |
| Gen 2 Size (B)           |   3,699,344 |   5,475,432 |  +48.01% |
| LOH Size (B)             |   5,162,672 |      48,680 |  -99.06% |
| Allocation Rate (B/sec)  | 326,903,080 | 301,148,520 |   -7.88% |
| # of Assemblies Loaded   |          91 |          88 |   -3.30% |
| Exceptions (#/s)         |         945 |         968 |   +2.43% |
| ThreadPool Threads Count |          27 |          27 |    0.00% |
| Lock Contention (#/s)    |          19 |          21 |  +10.53% |
| ThreadPool Queue Length  |         161 |          23 |  -85.71% |
| ThreadPool Items (#/s)   |     264,077 |     274,856 |   +4.08% |
| Active Timers            |           1 |           0 |          |


| load                | baseline16k | Current   |         |
| ------------------- | ----------- | --------- | ------- |
| CPU Usage (%)       |          31 |        31 |   0.00% |
| Raw CPU Usage (%)   |      368.35 |    366.85 |  -0.41% |
| Working Set (MB)    |           7 |         7 |   0.00% |
| Build Time (ms)     |       3,695 |     3,691 |  -0.11% |
| Start Time (ms)     |           0 |         0 |         |
| Published Size (KB) |      76,390 |    76,390 |   0.00% |
| First Request (ms)  |         101 |       167 | +65.35% |
| Requests/sec        |      69,724 |    70,270 |  +0.78% |
| Requests            |   1,052,826 | 1,061,049 |  +0.78% |
| Mean latency (ms)   |        5.76 |      5.94 |  +3.13% |
| Max latency (ms)    |      273.03 |    292.88 |  +7.27% |
| Bad responses       |           0 |         0 |         |
| Socket errors       |           0 |         0 |         |
| Latency 50th (ms)   |        3.07 |      2.73 | -11.07% |
| Latency 75th (ms)   |        7.00 |      7.69 |  +9.86% |
| Latency 90th (ms)   |       13.58 |     14.25 |  +4.93% |
| Latency 99th (ms)   |       33.86 |     34.13 |  +0.80% |

4K

| application              | baseline4k    | Current     |         |
| ------------------------ | ------------- | ----------- | ------- |
| CPU Usage (%)            |            97 |          98 |  +1.03% |
| Raw CPU Usage (%)        |      1,162.82 |    1,179.32 |  +1.42% |
| Working Set (MB)         |           178 |         178 |   0.00% |
| Build Time (ms)          |         3,566 |       3,071 | -13.88% |
| Start Time (ms)          |           270 |         285 |  +5.56% |
| Published Size (KB)      |        95,046 |     100,011 |  +5.22% |
| CPU Usage (%)            |            96 |          97 |  +1.04% |
| Working Set (MB)         |           186 |         185 |  -0.54% |
| GC Heap Size (MB)        |           108 |          89 | -17.59% |
| Gen 0 GC (#/min)         |         20.00 |        5.00 | -75.00% |
| Gen 1 GC (#/min)         |          2.00 |        2.00 |   0.00% |
| Gen 2 GC (#/min)         |          1.00 |        1.00 |   0.00% |
| Time in GC (%)           |          3.00 |        0.00 |         |
| Gen 0 Size (B)           |    70,320,960 |  68,185,432 |  -3.04% |
| Gen 1 Size (B)           |     9,462,840 |   1,753,792 | -81.47% |
| Gen 2 Size (B)           |     4,431,744 |   3,520,248 | -20.57% |
| LOH Size (B)             |     2,277,856 |      48,680 | -97.86% |
| Allocation Rate (B/sec)  | 1,931,641,248 | 436,676,264 | -77.39% |
| # of Assemblies Loaded   |            91 |          88 |  -3.30% |
| Exceptions (#/s)         |           885 |         880 |  -0.56% |
| ThreadPool Threads Count |            26 |          28 |  +7.69% |
| Lock Contention (#/s)    |             9 |           8 | -11.11% |
| ThreadPool Queue Length  |           201 |         158 | -21.39% |
| ThreadPool Items (#/s)   |       289,197 |     217,864 | -24.67% |
| Active Timers            |             1 |           0 |         |


| load                | baseline4k | Current   |         |
| ------------------- | ---------- | --------- | ------- |
| CPU Usage (%)       |         30 |        29 |  -3.33% |
| Raw CPU Usage (%)   |     355.37 |    352.15 |  -0.91% |
| Working Set (MB)    |          7 |         7 |   0.00% |
| Build Time (ms)     |      3,681 |     3,658 |  -0.62% |
| Start Time (ms)     |          0 |         0 |         |
| Published Size (KB) |     76,390 |    76,390 |   0.00% |
| First Request (ms)  |        106 |       167 | +57.55% |
| Requests/sec        |     89,248 |   101,902 | +14.18% |
| Requests            |  1,347,435 | 1,538,587 | +14.19% |
| Mean latency (ms)   |       4.91 |      3.35 | -31.77% |
| Max latency (ms)    |   1,560.00 |    333.90 | -78.60% |
| Bad responses       |          0 |         0 |         |
| Socket errors       |          0 |         0 |         |
| Latency 50th (ms)   |       2.58 |      2.37 |  -8.14% |
| Latency 75th (ms)   |       2.83 |      2.71 |  -4.24% |
| Latency 90th (ms)   |       4.06 |      3.26 | -19.70% |
| Latency 99th (ms)   |      34.38 |     22.63 | -34.18% |

// End of the source stream.
if (read == 0)
{
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you advance here to get the writer in a working state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that only matters if you write something. At least in the Pipe implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought if you call GetMemory, you can't call GetMemory afterwards without advancing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope.

@@ -117,7 +117,7 @@ private static async Task SendFileAsyncCore(HttpResponse response, IFileInfo fil
{
fileContent.Seek(offset, SeekOrigin.Begin);
}
await StreamCopyOperation.CopyToAsync(fileContent, response.Body, count, cancellationToken);
await StreamCopyOperation.CopyToAsync(fileContent, response.BodyWriter, count, cancellationToken);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is this will be good for kestrel but bad for IIS/HTTP.SYS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Http.Sys implements SendFile so this code path is never hit there. I can't recall if IIS implements the feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot all servers now implement SendFile so the only change would be if somebody call the extension method explicitly called with an IFileInfo that doesn't have a physical path 😄

@davidfowl davidfowl force-pushed the davidfowl/static-files-perf branch from bee6d9f to 12e00f8 Compare August 13, 2020 06:37
{
try
{
if (range == null)
{
await StreamCopyOperation.CopyToAsync(fileStream, outputStream, count: null, bufferSize: BufferSize, cancel: context.RequestAborted);
await StreamCopyOperation.CopyToAsync(fileStream, outputPipeWriter, count: null, cancel: context.RequestAborted);
}
else
{
fileStream.Seek(range.From.Value, SeekOrigin.Begin);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does here also need a check for range.From.Value > 0 before seek?
Also does it need a null check for range.From?

Kahbazi and others added 2 commits August 13, 2020 21:53
- Use Async Disposable for FileStream
- Use PipeWriter in FileResult
- Call StartAsync before using the PipeWriter
- Add SendFileAsync to HttpProtocol and call StartAsync in there to make sure we're not buffering more than we need to.
- Use the array pool as a fallback for large buffers
- Max buffer size of 64K
@davidfowl davidfowl force-pushed the davidfowl/static-files-perf branch from 681d2b3 to b7f55e1 Compare August 14, 2020 06:38
@Tratcher
Copy link
Member

How about IIS? It doesn't directly support SendFile either.

=> SendFileFallback.SendFileAsync(ResponseBody, path, offset, count, cancellation);

@jkotalik
Copy link
Contributor

jkotalik commented Aug 14, 2020

How about IIS? It doesn't directly support SendFile either.

Number 1 features in IIS I wish we implemented 😄

@davidfowl
Copy link
Member Author

We should directly support the feature in IIS but it doesn't affect mainstream scenarios for this change.

Copy link
Member

@JamesNK JamesNK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has new public API and should be API reviewed

Never mind, the previous PR was API reviewed.

Debug.Assert(buffer != null);

while (true)
{
// The natural end of the range.
if (bytesRemaining.HasValue && bytesRemaining.GetValueOrDefault() <= 0)
if (bytesRemaining.HasValue && bytesRemaining.Value <= 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change these to Value? Value checks whether the value is null and throws an error if it is. GetValueOrDefault returns whatever the value is, making it faster.

@davidfowl
Copy link
Member Author

I'm not merging this. For buffers > 4K the GC performance is worse. I'm down a rabbit hole in excel analyzing the array pool behavior with pipelines.

The other approach is to use 4K buffers no matter what the size of the file is. This results in more allocations because of the overhead of FileStream.ReadAsync.

@davidfowl davidfowl closed this Aug 17, 2020
@dougbu dougbu deleted the davidfowl/static-files-perf branch August 21, 2021 22:34
@amcasey amcasey added area-middleware Includes: URL rewrite, redirect, response cache/compression, session, and other general middlewares area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 6, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-middleware Includes: URL rewrite, redirect, response cache/compression, session, and other general middlewares area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants