-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Reduce asyncstatemachine cost in readLineAsyncInternal. #62479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @dotnet/area-system-io Issue Detailsnull
|
Before:
After:
|
#62061 <- referenced |
Did I broke the CI for arm? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think the CI just needs to be restarted, happens often. /cc @adamsitnik |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some minor style-related comments. Otherwise than that, the changes look good, I am really happy to see the improvements! 👍
I have few more ideas that you could try (in a separate PR):
- using
[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder<>))]
to annotateReadBufferAsync
method (more context: #49903) - using vectorized
IndexOfAny
similarly to what @nietras did in #60463 - optimizing for happy-path scenario where we know that we don't need to fetch any more data by eliminating awaits
@Trapov big thanks for your contribution!
src/libraries/System.Private.CoreLib/src/System/IO/StreamReader.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IO/StreamReader.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IO/StreamReader.cs
Outdated
Show resolved
Hide resolved
i++; | ||
} while (i < tmpCharLen); | ||
i = tmpCharLen - tmpCharPos; | ||
sb ??= new StringBuilder(i + 80); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamsitnik I was wondering whether it would be possible to do pooling of the StringBuilder
instance, to avoid this allocation in long line cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nietras it's definitely a good idea! Perhaps we could even replace the private char array ( _charBuffer
) with a StringBuilder
, write the chars directly to it and avoid the need of appending? The only thing I am not sure of is what side effects using StringBuilder.ToString(int startIndex, int length)
might have
Will open the second PR with those suggestions in mind. Feel free to merge this one, I guess? |
@Trapov note that we have a common micro-benchmark for this now with dotnet/performance#2170 FYI: You can compare your changes with main by calling something like:
|
Before:
After:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
else | ||
// Note the following common line feed chars: | ||
// \n - UNIX \r\n - DOS \r - Mac | ||
if (ch is '\r' or '\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This produces identical IL and asm, doesn't it? I would not expect a perf improvement from this. What will be different and might yield a micro-improvement is ch == '\r' | ch == '\n'
, removing a (likely-well-predicted) branch.
return null; | ||
|
||
return await ConsumeBufferedData().ConfigureAwait(false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this make the NoDataInTheBuffer case more expensive? Does the perf test being run stress this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does but it might not be the most used one case, at least that's what I hope for. The benchmarks -> dotnet/performance@529f33c#diff-64fc887dbc1598f2ea871592946f4077593925f80b22cab6f586ea7e1e1b0723
|
||
return await ConsumeBufferedData().ConfigureAwait(false); | ||
} | ||
async Task<string?> ConsumeBufferedData() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a theory you can share for why this improves perf? There are three cases here:
- There's no data in the buffer and we're at EOF.
- There's no data in the buffer but there's remaining data.
- There's data in the buffer.
In all three cases, previously you would be calling a single async method. In all three cases, now you're still calling at least one async method, and in the second case, it's now calling a second and doing more work than before. The third case is just adding an extra layer of indirection. And the first case should only occur once per file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that I could cut the state-machine in half (and eleminate awaits) and switch in two paths without any state-machines where in one part the third case is making less work and in another part the first/second ones (NoDataInBuffer) are doing more work (but they're not the most used ones probably), at least that was my reasoning behind it. But thinking about it again and looking at the perf-results not every case is a winner. Specificaly short lines are the worst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method | Job | Toolchain | LineLengthRange | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0 | Gen 1 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ReadLineAsync | Job-VRCDUV | /testhost-main/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 0, 0] | 535.71 us | 3.930 us | 3.281 us | 534.51 us | 529.92 us | 541.75 us | 1.00 | 0.00 | 187.5000 | 2.1552 | 1,155 KB |
ReadLineAsync | Job-UBMMAT | /testhost/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 0, 0] | 512.69 us | 4.709 us | 4.174 us | 512.40 us | 504.80 us | 519.53 us | 0.96 | 0.01 | 187.5000 | 2.0161 | 1,156 KB |
ReadLineAsync | Job-VRCDUV | /testhost-main/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 0, 1024] | 29.21 us | 0.369 us | 0.345 us | 29.12 us | 28.65 us | 29.84 us | 1.00 | 0.00 | 10.4360 | 0.1160 | 65 KB |
ReadLineAsync | Job-UBMMAT | /testhost/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 0, 1024] | 19.96 us | 0.153 us | 0.143 us | 19.95 us | 19.70 us | 20.18 us | 0.68 | 0.01 | 10.4828 | 0.1588 | 65 KB |
ReadLineAsync | Job-VRCDUV | /testhost-main/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 1, 1] | 320.86 us | 1.957 us | 1.634 us | 321.01 us | 317.07 us | 323.00 us | 1.00 | 0.00 | 125.0000 | 1.3021 | 771 KB |
ReadLineAsync | Job-UBMMAT | /testhost/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 1, 1] | 322.20 us | 1.850 us | 1.640 us | 322.16 us | 319.25 us | 325.77 us | 1.00 | 0.01 | 125.0000 | 1.2755 | 772 KB |
ReadLineAsync | Job-VRCDUV | /testhost-main/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 1, 8] | 169.12 us | 2.478 us | 2.069 us | 169.55 us | 164.25 us | 172.48 us | 1.00 | 0.00 | 55.7796 | 0.6720 | 344 KB |
ReadLineAsync | Job-UBMMAT | /testhost/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 1, 8] | 172.79 us | 2.950 us | 2.760 us | 172.04 us | 168.83 us | 177.28 us | 1.02 | 0.02 | 55.7065 | 0.6793 | 344 KB |
ReadLineAsync | Job-VRCDUV | /testhost-main/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 9, 32] | 58.74 us | 0.469 us | 0.391 us | 58.80 us | 58.05 us | 59.53 us | 1.00 | 0.00 | 17.9228 | 0.2298 | 110 KB |
ReadLineAsync | Job-UBMMAT | /testhost/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 9, 32] | 56.41 us | 0.530 us | 0.470 us | 56.53 us | 55.72 us | 57.13 us | 0.96 | 0.01 | 18.0201 | 0.2281 | 110 KB |
ReadLineAsync | Job-VRCDUV | /testhost-main/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 33, 128] | 35.15 us | 0.317 us | 0.281 us | 35.13 us | 34.73 us | 35.60 us | 1.00 | 0.00 | 9.5291 | - | 59 KB |
ReadLineAsync | Job-UBMMAT | /testhost/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 33, 128] | 27.54 us | 0.216 us | 0.192 us | 27.53 us | 27.33 us | 27.93 us | 0.78 | 0.01 | 9.6831 | 0.1100 | 59 KB |
ReadLineAsync | Job-VRCDUV | /testhost-main/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 129, 1024] | 29.31 us | 0.279 us | 0.233 us | 29.27 us | 28.95 us | 29.81 us | 1.00 | 0.00 | 10.5932 | 0.1177 | 65 KB |
ReadLineAsync | Job-UBMMAT | /testhost/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [ 129, 1024] | 20.26 us | 0.105 us | 0.088 us | 20.29 us | 20.05 us | 20.34 us | 0.69 | 0.01 | 10.6041 | - | 65 KB |
ReadLineAsync | Job-VRCDUV | /testhost-main/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [1025, 2048] | 31.22 us | 0.457 us | 0.405 us | 31.31 us | 30.56 us | 31.99 us | 1.00 | 0.00 | 14.9457 | 0.3706 | 92 KB |
ReadLineAsync | Job-UBMMAT | /testhost/net7.0-OSX-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun | [1025, 2048] | 20.79 us | 0.195 us | 0.182 us | 20.75 us | 20.50 us | 21.13 us | 0.67 | 0.01 | 14.9402 | 0.4150 | 92 KB |
@stephentoub dotnet/performance@529f33c#diff-64fc887dbc1598f2ea871592946f4077593925f80b22cab6f586ea7e1e1b0723
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented out of benchmarks the sync method for cleaner results between main and this branch
@Trapov FYI you can filter benchmarks by adding method name to command line .e.g
by using this with |
|
Thanks for your efforts here, @Trapov. I spent some time today looking into this, as I wouldn't have expected this change to make as meaningful a difference as the numbers show: in either case you're still invoking an async method, and the await being removed from one of those async methods isn't even changing the size of that state machine (it's removing an await using an awaiter type that's also used elsewhere in the method), and on top of that the benchmark is using a stream type whose read methods always complete synchronously. That said, I do see the same results on my machine when I pull down your PR. What's interesting, though, is why the improvement happens. Fundamentally, it looks to be due to the code the C# compiler happens to generate for the exact circumstance in play here. You can see this if you update your commit to just add back the There are also other improvements possible in ReadLineAsync that will make a much larger impact, like the ones @adamsitnik mentioned earlier. For example, here's a quick hackjob that a) uses IndexOfAny and b) caches a StringBuilder onto StreamReader:
This shows much larger wins from main to that commit, and then the improvements seen in this PR largely disappear, I believe because enough was perturbed in the C# codegen to make those transient wins go away. I'd like to suggest this PR be closed and we instead focus on making some of those other improvements. |
Closing this one because of how strongly the changes depend on the compiler and open a new one in a few moments. |
No description provided.