-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Replace slow implementations in ASCIIUtility with fast implementations #22516
Conversation
Unit tests are at https://github.com/GrabYourPitchforks/corefx/blob/b83649c050964a3858c9dec2e48392afef83842e/src/System.Runtime/tests/System/Text/AsciiUtilityTests.netcoreapp.cs if you care to look at them. I don't know whether these unit tests belong in coreclr or corefx given the class itself is internal. Given the complexity of these methods I believe it prudent to have unit tests which exercises them directly along with all of their edge cases and boundary conditions, in addition to existing unit tests which exercise |
The experimental /cc @layomia as FYI. |
This replumbing should be included as part of this PR, I think. |
Agreed. Otherwise this PR is adding dead, untested code that duplicates existing functionality. |
@GrabYourPitchforks, I think one example would be: coreclr/src/System.Private.CoreLib/shared/System/Buffers/Text/Utf8Formatter/Utf8Formatter.Float.cs Lines 110 to 114 in 70cf6d4
|
@jkotas Here are the numbers for replacing Calls to (Edit: Remove outdated perf numbers.) |
|
Could you please share links to the code for these experiments? |
@jkotas Sure. Here are the relevant links: https://github.com/GrabYourPitchforks/coreclr/blob/ascii_perf/src/System.Private.CoreLib/shared/System/Text/ASCIIEncoding.cs#L541 shows the implementation of GrabYourPitchforks@f579048 is the change that was made between Experiment A and Experiment B. It was the simplest possible change I could think of to measure the cost of pinning the reference without touching any other logic. |
Methods with pinned locals do not inline today, even with AggressiveInlining. So you have measure a cost of an extra frame on the stack; not the cost of pinning. Also, if you pin around the whole implementation - you will not need to pin inside it anymore, and so you will same a bit there too. This saving is not reflected in your measurement. |
I confirmed through breaking into a debugger that it's inlined into the caller. Perhaps it's a recent change? It looks like much of the cost is stack spilling additional data into the Before: 00007ffe`3a45e570 57 push rdi
00007ffe`3a45e571 56 push rsi
00007ffe`3a45e572 55 push rbp
00007ffe`3a45e573 53 push rbx
00007ffe`3a45e574 4883ec38 sub rsp,38h
00007ffe`3a45e578 33c0 xor eax,eax
00007ffe`3a45e57a 4889442428 mov qword ptr [rsp+28h],rax
00007ffe`3a45e57f 488bd9 mov rbx,rcx
00007ffe`3a45e582 488bea mov rbp,rdx
00007ffe`3a45e585 418bf0 mov esi,r8d
00007ffe`3a45e588 498bf9 mov rdi,r9
00007ffe`3a45e58b 85f6 test esi,esi ; test for charCount == 0 After: 00007ffe`39d9e3a0 57 push rdi
00007ffe`39d9e3a1 56 push rsi
00007ffe`39d9e3a2 55 push rbp
00007ffe`39d9e3a3 53 push rbx
00007ffe`39d9e3a4 4883ec38 sub rsp,38h
00007ffe`39d9e3a8 488bf1 mov rsi,rcx
00007ffe`39d9e3ab 488d7c2420 lea rdi,[rsp+20h]
00007ffe`39d9e3b0 b906000000 mov ecx,6
00007ffe`39d9e3b5 33c0 xor eax,eax
00007ffe`39d9e3b7 f3ab rep stos dword ptr [rdi] ; stores 24 null (0x00) bytes starting at [rdi], perhaps to support fixed?
00007ffe`39d9e3b9 488bce mov rcx,rsi
00007ffe`39d9e3bc 488be9 mov rbp,rcx
00007ffe`39d9e3bf 488bfa mov rdi,rdx
00007ffe`39d9e3c2 418bf0 mov esi,r8d
00007ffe`39d9e3c5 498bd9 mov rbx,r9
00007ffe`39d9e3c8 85f6 test esi,esi ; test for charCount == 0 windbg also shows a call instruction directly from 0:000> t
System_Private_CoreLib!System.Text.ASCIIEncoding.GetCharCount(Byte ByRef, Int32, System.Text.DecoderWorker)+0xffffffff`a0c2e555:
00007ffe`39d9e415 e8eec7ffff call System.Text.AsciiUtility.GetIndexOfFirstNonAsciiByte2(Byte ByRef, UInt64) (00007ffe`39d9ac08) 0:000> kv
# Child-SP RetAddr : Args to Child : Call Site
00 000000f6`8cd7c448 00007ffe`39d9e489 : 00000000`00000000 00007ffe`99639b3a 0000f8df`00000002 00007ffe`39ce7348 : System_Private_CoreLib!System.Text.AsciiUtility.GetIndexOfFirstNonAsciiByte_Sse2(Byte ByRef, UInt64)
01 000000f6`8cd7c450 00007ffe`39d9e41a : 00000000`00000000 00000000`00000000 000000f6`8cd7c4e8 00007ffe`ebb3e3fd : System_Private_CoreLib!System.Text.AsciiUtility.GetIndexOfFirstNonAsciiByte2(Byte ByRef, UInt64)+0x2809
02 000000f6`8cd7c480 00007ffe`39d9e379 : 00002cc3`94d667db 00000000`80000000 000000f6`8cd7c700 00007ffe`995bcb56 : System_Private_CoreLib!System.Text.ASCIIEncoding.GetCharCount(Byte ByRef, Int32, System.Text.DecoderWorker)+0xffffffff`a0c2e55a
03 000000f6`8cd7c4e0 00007ffe`39d9e544 : 000000f6`00000000 00000273`9a058718 00000000`00000000 00000000`00000000 : System_Private_CoreLib!System.Text.ASCIIEncoding.GetCharCount(System.ReadOnlySpan`1<Byte>)+0xffffffff`a0c2e4d9 Your point about the inner fixed statement being unnecessary is well-taken. I'll rewrite this proper without any |
Support for inlining some methods with pinned locals was added in #7788. |
Ah ok, I have seen this in the inliner that made me think that it was not fixed yet:
|
@tannergooding Here are the results from the change you suggested. They look really good! I experimented with some variations on it and settled on GrabYourPitchforks@e619b04 (which I still need to test fully, and which requires further tweaking so as not to take a strict dependency on BMI1). Experiment-A is the original PR; Experiment-C are the changes to use unaligned vectors at beginning + end with no overruns. (Edit: Removed old perf numbers.) @jkotas Still working on the pointers-only implementation as you suggested. |
|
@jkotas Here's the run where the logic was rewritten to use I was able to improve the performance slightly over the previous (Edit: Remove outdated perf numbers.) |
Have you looked where the difference is comming from? I am having hard time reasoning about the delta you have shared. It has new abstraction layer in Encoding, tricky JIT helpers, etc. I do not think any of that is necessary. This should simply pin the span at entrypoint, and the rest should be regular unmanged code that works with unmanaged pointers. It is the most maintainable way to do this. Maybe we will lose a cycle here or there compared to what you got, but that's ok. |
My gut tells me it's the The changed abstraction layer is not the culprit because all Experiments (even the ref based ones) share the exact same logic in AsciiEncoding. |
I'll try your suggestion of moving the pin up to a higher method in the call stack ( |
Yes, it is how I think these should be implemented.
The inefficient zero initialization is a known performance issue in the JIT that we have not fixed yet. I agree that it has more impact here than what one would expect. |
Something similar I saw in dotnet/corefx#34529 (comment) (where my gut telled me the same 😉) |
@jkotas Here are the latest numbers with moving the pinning up. This was a good suggestion! 👍 (Edit: Removed old perf numbers.) Experiment-E: public override unsafe int GetCharCount(ReadOnlySpan<byte> bytes)
{
fixed (byte* pBytes = &MemoryMarshal.GetReference(bytes))
{
return GetCharCount(pBytes, bytes.Length, (DecoderWorker)null);
}
} Experiment-F: public override unsafe int GetCharCount(ReadOnlySpan<byte> bytes)
{
ref byte rBytes = ref MemoryMarshal.GetReference(bytes);
fixed (byte* pBytes = &rBytes)
{
return GetCharCount((byte*)Unsafe.AsPointer(ref rBytes), bytes.Length, (DecoderWorker)null);
}
} Neither of these patterns produces the problematic |
|
I'll run the numbers once more across all methods with the latest iteration - |
Updated numbers. I think I know why
|
@stephentoub @jkotas @layomia I'm about to update the PR with the
The real benefit of (2) and (3) above is that it makes things like UTF-8 fallback logic much simpler, since now all of the infrastructure is already in place to support it. All of the fallback knowledge can be moved out of the There are two behavioral changes between the old logic and the new logic. @tarekgh would be a good person to comment on these changes.
There is already substantial unit test coverage across coreclr / corefx. Since this type is also used heavily by other places in the framework, bugs would likely show up as failures in other unrelated tests. There are some other tests coming online at dotnet/corefx#35331 and my GitHub repo to help validate these behaviors. |
Sorry I was confused it with GetMaxByteCount. so you are right. if GetByteCount not returning the expected right result then I think we should fix it too. |
@jkotas What if we put |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static bool AllBytesInUInt32AreAscii(uint value) | ||
{ | ||
return ((value & 0x80808080u) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why all these extra parens?
goto Finish; | ||
} | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveOptimization)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your expected effect of this attribute here?
src/System.Private.CoreLib/shared/System/Text/DecoderFallback.cs
Outdated
Show resolved
Hide resolved
I'd prefer to have more compact version available |
@jkotas @stephentoub @tarekgh I'm having difficulty refactoring the code in such a manner that it works with the current I was experimenting with keeping the Another option is to keep the Does any of you have thoughts on the best way to proceed here? I'm weighing large churn with perf improvements for all tested cases vs. small churn with perf regressions for some inputs. Will try to get measurements to you when possible to help inform your opinions. |
Can you do this refactoring first for all encodings - mechanically move the hot paths and slow paths as they exist today into separate methods, without any vectorization, etc. This should result into a incremental performance gains because the hot path will be leaner. And then you do the vectorization work on the refactored code. |
Certainly. I guess in the end there are three distinct steps, as listed below. Would splitting the PRs like this make it easier to review?
|
Yes, that sounds reasonable to me. (I think it may be ok for 1. and 2. to be together; but having them as separate steps would be fine too.) |
@jkotas @tarekgh @layomia In a separate branch I mechanically split This resulted in an expansion of the size of ASCIIEncoding.cs, which is expected because I intentionally wanted to point out duplicate / common code before it got refactored into a common location. Search for the text Before I start work on the other types, is this about what you expected to see? |
Left some comments on the commit. |
See also #23098, which attempts to answer the question "What would it look like to split all of these code paths mechanically?" |
2c80bd8
to
7baf971
Compare
I updated the PR to contain only the |
@stephentoub @jkotas One of you earlier had asked what the perf difference was between the hand-implemented hardware intrinsic version and the normal
|
Now requires only SSE2, not SSSE3
Unit tests for the |
This PR introduces internal workhorse methods for operating on buffers containing ASCII data (as bytes or chars). In particular, these methods provide implementations of:
These workhorse methods will be used internally by the UTF-8 transcoder. However, there is opportunity for other components in the Framework to take direct dependencies on them. For example, I have a separate experimental branch that replumbs the existing
ASCIIEncoding
class to ride on top of these workhorse methods rather than its own existing implementation. Performance numbers from that branch are provided below. (Due to time constraints I only ran the numbers for UTF-16 to ASCII, not the other way around.)(Edit: Removed outdated perf numbers.)
/cc @tannergooding and @ahsonkhan who I believe had some interest in taking advantage of these APIs within CoreLib.
For a description of how this is plumbed through
ASCIIEncoding
, see the comment #22516 (comment).Not in this PR: