Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Replace slow implementations in ASCIIUtility with fast implementations #22516

Merged
merged 3 commits into from
Mar 23, 2019

Conversation

GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Feb 11, 2019

This PR introduces internal workhorse methods for operating on buffers containing ASCII data (as bytes or chars). In particular, these methods provide implementations of:

  • Counting the number of ASCII bytes or chars at the beginning of a sequence; and
  • Transcoding ASCII data to and from UTF-16.

These workhorse methods will be used internally by the UTF-8 transcoder. However, there is opportunity for other components in the Framework to take direct dependencies on them. For example, I have a separate experimental branch that replumbs the existing ASCIIEncoding class to ride on top of these workhorse methods rather than its own existing implementation. Performance numbers from that branch are provided below. (Due to time constraints I only ran the numbers for UTF-16 to ASCII, not the other way around.)

(Edit: Removed outdated perf numbers.)

/cc @tannergooding and @ahsonkhan who I believe had some interest in taking advantage of these APIs within CoreLib.

For a description of how this is plumbed through ASCIIEncoding, see the comment #22516 (comment).

Not in this PR:

  • The UTF-8 transcoding logic which rides on top of this as workhorse.

@GrabYourPitchforks
Copy link
Member Author

Unit tests are at https://github.com/GrabYourPitchforks/corefx/blob/b83649c050964a3858c9dec2e48392afef83842e/src/System.Runtime/tests/System/Text/AsciiUtilityTests.netcoreapp.cs if you care to look at them. I don't know whether these unit tests belong in coreclr or corefx given the class itself is internal.

Given the complexity of these methods I believe it prudent to have unit tests which exercises them directly along with all of their edge cases and boundary conditions, in addition to existing unit tests which exercise ASCIIEncoding and other public APIs.

@GrabYourPitchforks
Copy link
Member Author

The experimental ASCIIEncoding replumbing is at https://github.com/GrabYourPitchforks/coreclr/blob/3de1f9d53e202278eb141403658aae21bac56d8c/src/System.Private.CoreLib/shared/System/Text/ASCIIEncoding.cs. In particular, search that file for the term AsciiUtility and you'll see the workhorse methods being used as implementation.

/cc @layomia as FYI.

@jkotas
Copy link
Member

jkotas commented Feb 11, 2019

The experimental ASCIIEncoding replumbing

This replumbing should be included as part of this PR, I think.

@stephentoub
Copy link
Member

This replumbing should be included as part of this PR, I think.

Agreed. Otherwise this PR is adding dead, untested code that duplicates existing functionality.

@tannergooding
Copy link
Member

who I believe had some interest in taking advantage of these APIs within CoreLib.

@GrabYourPitchforks, I think one example would be:

for (int i = 0; i < utf16Text.Length; i++)
{
Debug.Assert(utf16Text[i] < 128, "A culture-invariant ToString() of a floating point expected to produce ASCII characters only.");
destination[i] = (byte)utf16Text[i];
}

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Feb 11, 2019

@jkotas Here are the numbers for replacing ref byte with byte* everywhere.

Calls to Utf8Encoder.GetCharCount for inputs of various length. In the below numbers, Experiment-A is the PR as originally proposed (ref byte and friends), and Experiment-B is using pointers (byte* everywhere).

(Edit: Remove outdated perf numbers.)

@jkotas
Copy link
Member

jkotas commented Feb 11, 2019

fixed adds an overhead of a few instructions and a stack local. I can believe that this overhead would show up for very small lengths. I have hard time believing that this overhead would cause this much difference for DataLength of 128.

@jkotas
Copy link
Member

jkotas commented Feb 11, 2019

Could you please share links to the code for these experiments?

@GrabYourPitchforks
Copy link
Member Author

@jkotas Sure. Here are the relevant links:

https://github.com/GrabYourPitchforks/coreclr/blob/ascii_perf/src/System.Private.CoreLib/shared/System/Text/ASCIIEncoding.cs#L541 shows the implementation of AsciiEncoding.GetCharCount, using AsciiUtility.GetIndexOfFirstNonAsciiByte as a workhorse implementation.

GrabYourPitchforks@f579048 is the change that was made between Experiment A and Experiment B. It was the simplest possible change I could think of to measure the cost of pinning the reference without touching any other logic.

@jkotas
Copy link
Member

jkotas commented Feb 12, 2019

Methods with pinned locals do not inline today, even with AggressiveInlining. So you have measure a cost of an extra frame on the stack; not the cost of pinning.

Also, if you pin around the whole implementation - you will not need to pin inside it anymore, and so you will same a bit there too. This saving is not reflected in your measurement.

@GrabYourPitchforks
Copy link
Member Author

Methods with pinned locals do not inline today, even with AggressiveInlining. So you have measure a cost of an extra frame on the stack; not the cost of pinning.

I confirmed through breaking into a debugger that it's inlined into the caller. Perhaps it's a recent change? It looks like much of the cost is stack spilling additional data into the ASCIIEncoding.GetChars(...) method. Here's the preamble before and after:

Before:

00007ffe`3a45e570 57              push    rdi
00007ffe`3a45e571 56              push    rsi
00007ffe`3a45e572 55              push    rbp
00007ffe`3a45e573 53              push    rbx
00007ffe`3a45e574 4883ec38        sub     rsp,38h
00007ffe`3a45e578 33c0            xor     eax,eax
00007ffe`3a45e57a 4889442428      mov     qword ptr [rsp+28h],rax
00007ffe`3a45e57f 488bd9          mov     rbx,rcx
00007ffe`3a45e582 488bea          mov     rbp,rdx
00007ffe`3a45e585 418bf0          mov     esi,r8d
00007ffe`3a45e588 498bf9          mov     rdi,r9
00007ffe`3a45e58b 85f6            test    esi,esi  ; test for charCount == 0

After:

00007ffe`39d9e3a0 57              push    rdi
00007ffe`39d9e3a1 56              push    rsi
00007ffe`39d9e3a2 55              push    rbp
00007ffe`39d9e3a3 53              push    rbx
00007ffe`39d9e3a4 4883ec38        sub     rsp,38h
00007ffe`39d9e3a8 488bf1          mov     rsi,rcx
00007ffe`39d9e3ab 488d7c2420      lea     rdi,[rsp+20h]
00007ffe`39d9e3b0 b906000000      mov     ecx,6
00007ffe`39d9e3b5 33c0            xor     eax,eax
00007ffe`39d9e3b7 f3ab            rep stos dword ptr [rdi]  ; stores 24 null (0x00) bytes starting at [rdi], perhaps to support fixed?
00007ffe`39d9e3b9 488bce          mov     rcx,rsi
00007ffe`39d9e3bc 488be9          mov     rbp,rcx
00007ffe`39d9e3bf 488bfa          mov     rdi,rdx
00007ffe`39d9e3c2 418bf0          mov     esi,r8d
00007ffe`39d9e3c5 498bd9          mov     rbx,r9
00007ffe`39d9e3c8 85f6            test    esi,esi  ; test for charCount == 0

windbg also shows a call instruction directly from GetCharCount(...) to GetIndexOfFirstNonAsciiByte2, which means inlining appears to be working. The stack trace also shows this.

0:000> t
System_Private_CoreLib!System.Text.ASCIIEncoding.GetCharCount(Byte ByRef, Int32, System.Text.DecoderWorker)+0xffffffff`a0c2e555:
00007ffe`39d9e415 e8eec7ffff      call    System.Text.AsciiUtility.GetIndexOfFirstNonAsciiByte2(Byte ByRef, UInt64) (00007ffe`39d9ac08)
0:000> kv
 # Child-SP          RetAddr           : Args to Child                                                           : Call Site
00 000000f6`8cd7c448 00007ffe`39d9e489 : 00000000`00000000 00007ffe`99639b3a 0000f8df`00000002 00007ffe`39ce7348 : System_Private_CoreLib!System.Text.AsciiUtility.GetIndexOfFirstNonAsciiByte_Sse2(Byte ByRef, UInt64)
01 000000f6`8cd7c450 00007ffe`39d9e41a : 00000000`00000000 00000000`00000000 000000f6`8cd7c4e8 00007ffe`ebb3e3fd : System_Private_CoreLib!System.Text.AsciiUtility.GetIndexOfFirstNonAsciiByte2(Byte ByRef, UInt64)+0x2809
02 000000f6`8cd7c480 00007ffe`39d9e379 : 00002cc3`94d667db 00000000`80000000 000000f6`8cd7c700 00007ffe`995bcb56 : System_Private_CoreLib!System.Text.ASCIIEncoding.GetCharCount(Byte ByRef, Int32, System.Text.DecoderWorker)+0xffffffff`a0c2e55a
03 000000f6`8cd7c4e0 00007ffe`39d9e544 : 000000f6`00000000 00000273`9a058718 00000000`00000000 00000000`00000000 : System_Private_CoreLib!System.Text.ASCIIEncoding.GetCharCount(System.ReadOnlySpan`1<Byte>)+0xffffffff`a0c2e4d9

Your point about the inner fixed statement being unnecessary is well-taken. I'll rewrite this proper without any ref at all - no shortcuts :) - and see if the aberration disappears.

@AndyAyersMS
Copy link
Member

Support for inlining some methods with pinned locals was added in #7788.

@jkotas
Copy link
Member

jkotas commented Feb 12, 2019

Ah ok, I have seen this in the inliner that made me think that it was not fixed yet:

            inlineResult->Note(InlineObservation::CALLEE_HAS_PINNED_LOCALS);
            if (inlineResult->IsFailure())

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Feb 12, 2019

@tannergooding Here are the results from the change you suggested. They look really good! I experimented with some variations on it and settled on GrabYourPitchforks@e619b04 (which I still need to test fully, and which requires further tweaking so as not to take a strict dependency on BMI1).

Experiment-A is the original PR; Experiment-C are the changes to use unaligned vectors at beginning + end with no overruns.

(Edit: Removed old perf numbers.)

@jkotas Still working on the pointers-only implementation as you suggested.

@AndyAyersMS
Copy link
Member

NoteFatal is used if the observation always causes the inlines to fail -- Note lets the inline policy decide.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Feb 12, 2019

@jkotas Here's the run where the logic was rewritten to use byte* rather than ref byte everywhere. The code diff is at GrabYourPitchforks@cec62a0.

I was able to improve the performance slightly over the previous byte*-based iteration by tweaking a few code paths and controlling jumps more tightly. The throughput is still not quite as good as the ref byte-based code path, though.

(Edit: Remove outdated perf numbers.)

@jkotas
Copy link
Member

jkotas commented Feb 12, 2019

Have you looked where the difference is comming from?

I am having hard time reasoning about the delta you have shared. It has new abstraction layer in Encoding, tricky JIT helpers, etc. I do not think any of that is necessary. This should simply pin the span at entrypoint, and the rest should be regular unmanged code that works with unmanaged pointers. It is the most maintainable way to do this. Maybe we will lose a cycle here or there compared to what you got, but that's ok.

@GrabYourPitchforks
Copy link
Member Author

My gut tells me it's the rep stos introduced in GetCharCount (see
#22516 (comment)), especially since the remainder of the code tries to minimize the total number of reads, but I don't know if I trust my gut. Beyond that I don't know where to look.

The changed abstraction layer is not the culprit because all Experiments (even the ref based ones) share the exact same logic in AsciiEncoding.

@GrabYourPitchforks
Copy link
Member Author

I'll try your suggestion of moving the pin up to a higher method in the call stack (GetCharCount(ROS<byte>)) to see if this regains some of the perf. Agree with your assessment - not too worried about individual cycles, but these results were unexpected.

@jkotas
Copy link
Member

jkotas commented Feb 12, 2019

I'll try your suggestion of moving the pin up to a higher method in the call stack (GetCharCount(ROS<byte>)

Yes, it is how I think these should be implemented.

these results were unexpected

The inefficient zero initialization is a known performance issue in the JIT that we have not fixed yet. I agree that it has more impact here than what one would expect.

@gfoidl
Copy link
Member

gfoidl commented Feb 12, 2019

My gut tells me it's the rep stos introduced in GetCharCount

Something similar I saw in dotnet/corefx#34529 (comment) (where my gut telled me the same 😉)

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Feb 12, 2019

@jkotas Here are the latest numbers with moving the pinning up. This was a good suggestion! 👍

(Edit: Removed old perf numbers.)

Experiment-E:

public override unsafe int GetCharCount(ReadOnlySpan<byte> bytes)
{
    fixed (byte* pBytes = &MemoryMarshal.GetReference(bytes))
    {
        return GetCharCount(pBytes, bytes.Length, (DecoderWorker)null);
    }
}

Experiment-F:

public override unsafe int GetCharCount(ReadOnlySpan<byte> bytes)
{
    ref byte rBytes = ref MemoryMarshal.GetReference(bytes);
    fixed (byte* pBytes = &rBytes)
    {
        return GetCharCount((byte*)Unsafe.AsPointer(ref rBytes), bytes.Length, (DecoderWorker)null);
    }
}

Neither of these patterns produces the problematic rep stos code gen. Experiment F was just a test to see if we could coerce the runtime into producing even tighter codegen than normal for a typical fixed statement. It was an interesting experiment but I don't think the difference was significant enough to use that particular pattern.

@benaadams
Copy link
Member

rep stos is https://github.com/dotnet/coreclr/issues/13827 as I understand it has a fixed startup cost which gets amortized for longer lengths so ends up faster; whereas for very short lengths its more costly than reg or vector clears.

@GrabYourPitchforks
Copy link
Member Author

I'll run the numbers once more across all methods with the latest iteration - GetByteCount / GetBytes / GetCharCount / GetChars. Then I'll work on getting the ASCIIEncoding changes added to this PR.

@GrabYourPitchforks
Copy link
Member Author

Updated numbers. I think I know why GetBytes / GetChars regressed compared to the original PR, so will fix these up before submitting the ASCIIEncoding changes for review.

Method Toolchain DataLength Mean Error StdDev Median Ratio RatioSD
GetByteCount Experiment-A 0 4.393 us 0.0946 us 0.2729 us 4.459 us 0.55 0.04
GetByteCount Experiment-G 0 5.241 us 0.1049 us 0.2941 us 5.284 us 0.72 0.04
GetByteCount baseline (27411-2) 0 7.394 us 0.1410 us 0.1448 us 7.387 us 1.00 0.00
GetCharCount Experiment-A 0 4.482 us 0.0893 us 0.1923 us 4.480 us 0.39 0.03
GetCharCount Experiment-G 0 5.043 us 0.0991 us 0.1289 us 5.047 us 0.44 0.03
GetCharCount baseline (27411-2) 0 11.434 us 0.2277 us 0.6272 us 11.523 us 1.00 0.00
GetBytes Experiment-A 0 9.108 us 0.1799 us 0.4311 us 9.134 us 0.87 0.07
GetBytes Experiment-G 0 17.727 us 0.3523 us 0.8373 us 17.785 us 1.69 0.12
GetBytes baseline (27411-2) 0 10.296 us 0.2470 us 0.7283 us 10.476 us 1.00 0.00
GetChars Experiment-A 0 8.902 us 0.1761 us 0.4481 us 8.890 us 0.86 0.05
GetChars Experiment-G 0 18.668 us 0.3700 us 0.8794 us 18.704 us 1.78 0.11
GetChars baseline (27411-2) 0 10.465 us 0.2048 us 0.3587 us 10.498 us 1.00 0.00
GetByteCount Experiment-A 1 10.491 us 0.2090 us 0.5685 us 10.557 us 1.12 0.11
GetByteCount Experiment-G 1 9.603 us 0.1915 us 0.2049 us 9.541 us 1.05 0.07
GetByteCount baseline (27411-2) 1 9.417 us 0.2460 us 0.7254 us 9.359 us 1.00 0.00
GetCharCount Experiment-A 1 10.515 us 0.2103 us 0.5502 us 10.669 us 0.84 0.06
GetCharCount Experiment-G 1 9.071 us 0.2381 us 0.7020 us 8.936 us 0.73 0.06
GetCharCount baseline (27411-2) 1 12.380 us 0.2467 us 0.5668 us 12.451 us 1.00 0.00
GetBytes Experiment-A 1 9.525 us 0.1904 us 0.4846 us 9.566 us 0.85 0.07
GetBytes Experiment-G 1 17.101 us 0.3408 us 0.9887 us 17.274 us 1.54 0.14
GetBytes baseline (27411-2) 1 11.167 us 0.2462 us 0.7259 us 11.255 us 1.00 0.00
GetChars Experiment-A 1 8.817 us 0.1757 us 0.3966 us 8.793 us 0.81 0.06
GetChars Experiment-G 1 18.170 us 0.3619 us 0.7791 us 18.216 us 1.67 0.12
GetChars baseline (27411-2) 1 10.931 us 0.2170 us 0.5866 us 10.969 us 1.00 0.00
GetByteCount Experiment-A 11 11.658 us 0.2325 us 0.4802 us 11.806 us 0.46 0.03
GetByteCount Experiment-G 11 10.507 us 0.2093 us 0.4891 us 10.624 us 0.42 0.03
GetByteCount baseline (27411-2) 11 25.084 us 0.4960 us 1.2803 us 25.073 us 1.00 0.00
GetCharCount Experiment-A 11 11.260 us 0.2220 us 0.3323 us 11.187 us 0.55 0.04
GetCharCount Experiment-G 11 9.868 us 0.2179 us 0.6425 us 9.910 us 0.48 0.04
GetCharCount baseline (27411-2) 11 20.541 us 0.4113 us 1.0834 us 20.516 us 1.00 0.00
GetBytes Experiment-A 11 11.758 us 0.2345 us 0.4108 us 11.733 us 0.52 0.03
GetBytes Experiment-G 11 20.895 us 0.2995 us 0.2802 us 20.780 us 0.95 0.06
GetBytes baseline (27411-2) 11 23.117 us 0.4607 us 1.2136 us 23.102 us 1.00 0.00
GetChars Experiment-A 11 10.821 us 0.2164 us 0.5851 us 10.726 us 0.51 0.04
GetChars Experiment-G 11 19.625 us 0.3888 us 1.0037 us 19.708 us 0.93 0.07
GetChars baseline (27411-2) 11 21.236 us 0.4242 us 1.1542 us 21.386 us 1.00 0.00
GetByteCount Experiment-A 64 13.655 us 0.2727 us 0.5509 us 13.778 us 0.18 0.01
GetByteCount Experiment-G 64 12.749 us 0.2544 us 0.7217 us 12.679 us 0.17 0.01
GetByteCount baseline (27411-2) 64 75.522 us 0.0939 us 0.0733 us 75.510 us 1.00 0.00
GetCharCount Experiment-A 64 11.465 us 0.2293 us 0.6001 us 11.316 us 0.19 0.01
GetCharCount Experiment-G 64 9.843 us 0.1955 us 0.5047 us 9.815 us 0.17 0.01
GetCharCount baseline (27411-2) 64 58.303 us 1.1604 us 1.6268 us 58.174 us 1.00 0.00
GetBytes Experiment-A 64 18.319 us 0.3658 us 0.9765 us 18.254 us 0.15 0.01
GetBytes Experiment-G 64 24.700 us 0.4729 us 0.4424 us 24.494 us 0.20 0.01
GetBytes baseline (27411-2) 64 122.697 us 4.5594 us 3.8073 us 121.023 us 1.00 0.00
GetChars Experiment-A 64 16.518 us 0.3304 us 0.9742 us 16.372 us 0.23 0.02
GetChars Experiment-G 64 25.403 us 0.5053 us 1.4085 us 25.188 us 0.36 0.03
GetChars baseline (27411-2) 64 71.245 us 1.4688 us 4.3077 us 71.475 us 1.00 0.00
GetByteCount Experiment-A 128 16.048 us 0.3172 us 0.6186 us 15.765 us 0.12 0.01
GetByteCount Experiment-G 128 16.856 us 0.3399 us 1.0021 us 17.098 us 0.12 0.01
GetByteCount baseline (27411-2) 128 135.799 us 2.7126 us 7.1934 us 130.693 us 1.00 0.00
GetCharCount Experiment-A 128 13.425 us 0.2482 us 0.4900 us 13.496 us 0.14 0.01
GetCharCount Experiment-G 128 11.747 us 0.2309 us 0.3921 us 11.792 us 0.12 0.01
GetCharCount baseline (27411-2) 128 98.865 us 1.9714 us 5.0177 us 98.749 us 1.00 0.00
GetBytes Experiment-A 128 23.947 us 0.4766 us 1.1327 us 23.743 us 0.10 0.01
GetBytes Experiment-G 128 33.240 us 0.7156 us 2.1098 us 32.897 us 0.14 0.01
GetBytes baseline (27411-2) 128 243.074 us 4.9451 us 14.5807 us 240.323 us 1.00 0.00
GetChars Experiment-A 128 20.094 us 0.4014 us 1.0643 us 19.917 us 0.18 0.01
GetChars Experiment-G 128 30.721 us 0.7109 us 2.0849 us 31.001 us 0.28 0.02
GetChars baseline (27411-2) 128 111.117 us 2.3269 us 4.5384 us 109.006 us 1.00 0.00

@GrabYourPitchforks
Copy link
Member Author

@stephentoub @jkotas @layomia I'm about to update the PR with the ASCIIEncoding changes. Here's a brief walkthrough of how things changed:

  1. The GetByteCount / GetCharCount / GetBytes / GetChars methods are now split into two parts. The first part is the happy path where the result can be computed in constant time (for instance, if a replacement fallback is in use and has specific properties we can reason about) or where the input is verified to be all-ASCII so no fallback work needs to take place at all. If it is determined that fallback work must take place, control is given to a more complex method that handles fallback logic.

  2. The EncoderFallbackBuffer / DecoderFallbackBuffer types have new internal helper routines that perform the fallback logic in terms of Rune. This simplifies the call sites considerably, since all of the fallback logic is self-contained within a single method. The caller (in this case, the ASCIIEncoding type) doesn't need to know anything about the internals of how fallback works in the general case. There is no need to call InternalInitialize and no need to invoke the fallback buffer's logic within a loop.

  3. To support (2), new internal virtual methods have been added to the Encoding base class. These are used internally by the fallback mechanism to help convert fallback data into bytes. Currently only ASCIIEncoding needs to override these methods because only ASCIIEncoding invokes the new helper routines in the fallback buffers.

The real benefit of (2) and (3) above is that it makes things like UTF-8 fallback logic much simpler, since now all of the infrastructure is already in place to support it. All of the fallback knowledge can be moved out of the UTF8Encoding class entirely when the time comes to plumb new changes through that type.

There are two behavioral changes between the old logic and the new logic. @tarekgh would be a good person to comment on these changes.

  1. ASCIIEncoding.GetByteCount(..., EncoderNLS) did not correctly check EncoderReplacementFallback.DefaultString for ASCII-only data before short-circuiting the main method logic. This meant that calls to GetByteCount could succeed (and return a wrong answer) even if the corresponding call to GetBytes would throw an exception. This has been corrected in the new logic: if GetByteCount succeeds (or fails), then GetBytes will also succeed (or fail), modulo OOMs and other asynchronous exceptions.

  2. The fallback buffer logic is no longer recursive. Previously, if we were transcoding chars to bytes, it was possible that the fallback buffer could say "please replace this non-ASCII char with this other non-ASCII char before shrinking to ASCII." ASCIIEncoding would then go into a loop of trying to replace the invalid data over and over, eventually aborting the operation after some number of recursive tries. Now, the fallback buffer is no longer queried recursively. If the fallback buffer tries to provide fallback data that still isn't representable in the target encoding, the new helper methods on EncoderFallbackBuffer will abort the operation without recursing.

There is already substantial unit test coverage across coreclr / corefx. Since this type is also used heavily by other places in the framework, bugs would likely show up as failures in other unrelated tests. There are some other tests coming online at dotnet/corefx#35331 and my GitHub repo to help validate these behaviors.

@GrabYourPitchforks GrabYourPitchforks changed the title Add AsciiUtility, with workhorse methods for ASCII transcoding Add internal high-performance ASCII transcoding methods, plumb them through ASCIIEncoding Feb 14, 2019
@tarekgh
Copy link
Member

tarekgh commented Feb 15, 2019

GetByteCount is meant to return the exact count; not estimate

Sorry I was confused it with GetMaxByteCount. so you are right. if GetByteCount not returning the expected right result then I think we should fix it too.

@GrabYourPitchforks
Copy link
Member Author

@jkotas What if we put #if FAST_TRANSCODE around the intrinsic and vectorized code paths? That would give us the option to undefine the constant if we want a more compact (but slower) implementation suitable for mobile.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool AllBytesInUInt32AreAscii(uint value)
{
return ((value & 0x80808080u) == 0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all these extra parens?

goto Finish;
}

[MethodImpl(MethodImplOptions.AggressiveOptimization)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your expected effect of this attribute here?

@marek-safar
Copy link

I'd prefer to have more compact version available

@GrabYourPitchforks
Copy link
Member Author

@jkotas @stephentoub @tarekgh I'm having difficulty refactoring the code in such a manner that it works with the current EncoderNLS and DecoderNLS implementations while preserving all of the same fallback behaviors. The issue is with what Jan identified earlier - the existing EncoderNLS and DecoderNLS types retain extra incomplete data from the previous calls, and it's up to the implementation (in this case, ASCIIEncoding) to flush this data upon entry to the method.

I was experimenting with keeping the ASCIIEncoding implementation as slim as possible by pushing this concern up to the EncoderNLS / DecoderNLS class (since it's pretty much the same logic across all *Encoding types). This keeps ASCIIEncoding very lean and very fast for the common case where no EncoderNLS / DecoderNLS has been provided, but it resulted in the creation of new internal types EncoderNLS2 / DecoderNLS2 to encapsulate the fallback logic which is common to all encodings. This will eventually make all of the other encoding types (UTF8Encoding / UTF16Encoding / UTF32Encoding) simpler if they're replumbed to use the newly created internal types. But it has the effect of making this particular PR significantly larger in scope than what was intended.

Another option is to keep the ASCIIEncoding logic mostly unchanged, instead replacing the hot loops within its GetBytes / GetChars / etc. routines with calls to the new helpers. This would have the effect of regressing performance from baseline for small payloads, since much of the earlier reported gain for small payloads came from minimizing the amount of code in these methods. The performance improvement for large payloads is roughly about the same as reported earlier since the overhead is insignificant at that point.

Does any of you have thoughts on the best way to proceed here? I'm weighing large churn with perf improvements for all tested cases vs. small churn with perf regressions for some inputs. Will try to get measurements to you when possible to help inform your opinions.

@jkotas
Copy link
Member

jkotas commented Feb 22, 2019

it has the effect of making this particular PR significantly larger in scope than what was intended.

Can you do this refactoring first for all encodings - mechanically move the hot paths and slow paths as they exist today into separate methods, without any vectorization, etc. This should result into a incremental performance gains because the hot path will be leaner.

And then you do the vectorization work on the refactored code.

@GrabYourPitchforks
Copy link
Member Author

Can you do this refactoring first for all encodings - mechanically move the hot paths and slow paths as they exist today into separate methods, without any vectorization, etc. This should result into a incremental performance gains because the hot path will be leaner.

And then you do the vectorization work on the refactored code.

Certainly. I guess in the end there are three distinct steps, as listed below. Would splitting the PRs like this make it easier to review?

  1. Split the logic within each individual Encoding into fast path / slow path. We'll find that there's a lot of duplicated logic (mainly coordinating error handling and leftover char draining) between all the different Encoding types.
  2. Move all of the duplicated logic to a common location so that it can be shared across all the Encoding instances.
  3. Replace the individual logic within each Encoding with an optimized version, as was the original plan with this PR.

@jkotas
Copy link
Member

jkotas commented Feb 22, 2019

Yes, that sounds reasonable to me. (I think it may be ok for 1. and 2. to be together; but having them as separate steps would be fine too.)

@GrabYourPitchforks
Copy link
Member Author

@jkotas @tarekgh @layomia In a separate branch I mechanically split ASCIIEncoding into fast-path / slow-path without adding the vectorized ASCIIUtility code. This is essentially the Step (1) we discussed above. See the implementation for details.

This resulted in an expansion of the size of ASCIIEncoding.cs, which is expected because I intentionally wanted to point out duplicate / common code before it got refactored into a common location. Search for the text !! REVIEW !! in the link above to see in particular how this would eventually look once it's refactored.

Before I start work on the other types, is this about what you expected to see?

@jkotas
Copy link
Member

jkotas commented Feb 26, 2019

Left some comments on the commit.

@GrabYourPitchforks GrabYourPitchforks changed the title Add internal high-performance ASCII transcoding methods, plumb them through ASCIIEncoding [WIP] Add internal high-performance ASCII transcoding methods, plumb them through ASCIIEncoding Mar 6, 2019
@GrabYourPitchforks
Copy link
Member Author

See also #23098, which attempts to answer the question "What would it look like to split all of these code paths mechanically?"

@GrabYourPitchforks GrabYourPitchforks changed the title [WIP] Add internal high-performance ASCII transcoding methods, plumb them through ASCIIEncoding Replace slow implementations in ASCIIUtility with fast implementations Mar 12, 2019
@GrabYourPitchforks
Copy link
Member Author

I updated the PR to contain only the ASCIIUtility vectorized changes now that the Encoding core refactoring is complete.

@GrabYourPitchforks
Copy link
Member Author

@stephentoub @jkotas One of you earlier had asked what the perf difference was between the hand-implemented hardware intrinsic version and the normal Vector<T> code path. Answer is around 50% perf difference.

Method Toolchain StringLength Mean Error StdDev Ratio RatioSD
GetBytes with_intrin 256 2.936 us 0.0519 us 0.0486 us 1.00 0.00
GetBytes without_intrin 256 4.523 us 0.0551 us 0.0515 us 1.54 0.03

@GrabYourPitchforks
Copy link
Member Author

Unit tests for the ASCIIUtility changes over at dotnet/corefx#36227.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.