Replace slow implementations in ASCIIUtility with fast implementations #22516

GrabYourPitchforks · 2019-02-11T09:08:31Z

This PR introduces internal workhorse methods for operating on buffers containing ASCII data (as bytes or chars). In particular, these methods provide implementations of:

Counting the number of ASCII bytes or chars at the beginning of a sequence; and
Transcoding ASCII data to and from UTF-16.

These workhorse methods will be used internally by the UTF-8 transcoder. However, there is opportunity for other components in the Framework to take direct dependencies on them. For example, I have a separate experimental branch that replumbs the existing ASCIIEncoding class to ride on top of these workhorse methods rather than its own existing implementation. Performance numbers from that branch are provided below. (Due to time constraints I only ran the numbers for UTF-16 to ASCII, not the other way around.)

(Edit: Removed outdated perf numbers.)

/cc @tannergooding and @ahsonkhan who I believe had some interest in taking advantage of these APIs within CoreLib.

For a description of how this is plumbed through ASCIIEncoding, see the comment #22516 (comment).

Not in this PR:

The UTF-8 transcoding logic which rides on top of this as workhorse.

GrabYourPitchforks · 2019-02-11T09:10:29Z

Unit tests are at https://github.com/GrabYourPitchforks/corefx/blob/b83649c050964a3858c9dec2e48392afef83842e/src/System.Runtime/tests/System/Text/AsciiUtilityTests.netcoreapp.cs if you care to look at them. I don't know whether these unit tests belong in coreclr or corefx given the class itself is internal.

Given the complexity of these methods I believe it prudent to have unit tests which exercises them directly along with all of their edge cases and boundary conditions, in addition to existing unit tests which exercise ASCIIEncoding and other public APIs.

GrabYourPitchforks · 2019-02-11T09:13:01Z

The experimental ASCIIEncoding replumbing is at https://github.com/GrabYourPitchforks/coreclr/blob/3de1f9d53e202278eb141403658aae21bac56d8c/src/System.Private.CoreLib/shared/System/Text/ASCIIEncoding.cs. In particular, search that file for the term AsciiUtility and you'll see the workhorse methods being used as implementation.

/cc @layomia as FYI.

src/System.Private.CoreLib/shared/System/Text/AsciiUtility.cs

jkotas · 2019-02-11T09:33:12Z

The experimental ASCIIEncoding replumbing

This replumbing should be included as part of this PR, I think.

stephentoub · 2019-02-11T14:55:21Z

This replumbing should be included as part of this PR, I think.

Agreed. Otherwise this PR is adding dead, untested code that duplicates existing functionality.

src/System.Private.CoreLib/shared/System/Text/AsciiUtility.cs

tannergooding · 2019-02-11T17:01:51Z

who I believe had some interest in taking advantage of these APIs within CoreLib.

@GrabYourPitchforks, I think one example would be:

coreclr/src/System.Private.CoreLib/shared/System/Buffers/Text/Utf8Formatter/Utf8Formatter.Float.cs

Lines 110 to 114 in 70cf6d4

    
           for (int i = 0; i < utf16Text.Length; i++) 
        
           { 
        
               Debug.Assert(utf16Text[i] < 128, "A culture-invariant ToString() of a floating point expected to produce ASCII characters only."); 
        
               destination[i] = (byte)utf16Text[i]; 
        
           }

src/System.Private.CoreLib/shared/System/Text/AsciiUtility.cs

GrabYourPitchforks · 2019-02-11T23:36:19Z

@jkotas Here are the numbers for replacing ref byte with byte* everywhere.

Calls to Utf8Encoder.GetCharCount for inputs of various length. In the below numbers, Experiment-A is the PR as originally proposed (ref byte and friends), and Experiment-B is using pointers (byte* everywhere).

(Edit: Remove outdated perf numbers.)

jkotas · 2019-02-11T23:47:18Z

fixed adds an overhead of a few instructions and a stack local. I can believe that this overhead would show up for very small lengths. I have hard time believing that this overhead would cause this much difference for DataLength of 128.

jkotas · 2019-02-11T23:47:35Z

Could you please share links to the code for these experiments?

GrabYourPitchforks · 2019-02-12T00:10:47Z

@jkotas Sure. Here are the relevant links:

https://github.com/GrabYourPitchforks/coreclr/blob/ascii_perf/src/System.Private.CoreLib/shared/System/Text/ASCIIEncoding.cs#L541 shows the implementation of AsciiEncoding.GetCharCount, using AsciiUtility.GetIndexOfFirstNonAsciiByte as a workhorse implementation.

GrabYourPitchforks@f579048 is the change that was made between Experiment A and Experiment B. It was the simplest possible change I could think of to measure the cost of pinning the reference without touching any other logic.

jkotas · 2019-02-12T00:53:16Z

Methods with pinned locals do not inline today, even with AggressiveInlining. So you have measure a cost of an extra frame on the stack; not the cost of pinning.

Also, if you pin around the whole implementation - you will not need to pin inside it anymore, and so you will same a bit there too. This saving is not reflected in your measurement.

GrabYourPitchforks · 2019-02-12T01:27:51Z

Methods with pinned locals do not inline today, even with AggressiveInlining. So you have measure a cost of an extra frame on the stack; not the cost of pinning.

I confirmed through breaking into a debugger that it's inlined into the caller. Perhaps it's a recent change? It looks like much of the cost is stack spilling additional data into the ASCIIEncoding.GetChars(...) method. Here's the preamble before and after:

Before:

00007ffe`3a45e570 57              push    rdi
00007ffe`3a45e571 56              push    rsi
00007ffe`3a45e572 55              push    rbp
00007ffe`3a45e573 53              push    rbx
00007ffe`3a45e574 4883ec38        sub     rsp,38h
00007ffe`3a45e578 33c0            xor     eax,eax
00007ffe`3a45e57a 4889442428      mov     qword ptr [rsp+28h],rax
00007ffe`3a45e57f 488bd9          mov     rbx,rcx
00007ffe`3a45e582 488bea          mov     rbp,rdx
00007ffe`3a45e585 418bf0          mov     esi,r8d
00007ffe`3a45e588 498bf9          mov     rdi,r9
00007ffe`3a45e58b 85f6            test    esi,esi  ; test for charCount == 0

After:

00007ffe`39d9e3a0 57              push    rdi
00007ffe`39d9e3a1 56              push    rsi
00007ffe`39d9e3a2 55              push    rbp
00007ffe`39d9e3a3 53              push    rbx
00007ffe`39d9e3a4 4883ec38        sub     rsp,38h
00007ffe`39d9e3a8 488bf1          mov     rsi,rcx
00007ffe`39d9e3ab 488d7c2420      lea     rdi,[rsp+20h]
00007ffe`39d9e3b0 b906000000      mov     ecx,6
00007ffe`39d9e3b5 33c0            xor     eax,eax
00007ffe`39d9e3b7 f3ab            rep stos dword ptr [rdi]  ; stores 24 null (0x00) bytes starting at [rdi], perhaps to support fixed?
00007ffe`39d9e3b9 488bce          mov     rcx,rsi
00007ffe`39d9e3bc 488be9          mov     rbp,rcx
00007ffe`39d9e3bf 488bfa          mov     rdi,rdx
00007ffe`39d9e3c2 418bf0          mov     esi,r8d
00007ffe`39d9e3c5 498bd9          mov     rbx,r9
00007ffe`39d9e3c8 85f6            test    esi,esi  ; test for charCount == 0

windbg also shows a call instruction directly from GetCharCount(...) to GetIndexOfFirstNonAsciiByte2, which means inlining appears to be working. The stack trace also shows this.

0:000> t
System_Private_CoreLib!System.Text.ASCIIEncoding.GetCharCount(Byte ByRef, Int32, System.Text.DecoderWorker)+0xffffffff`a0c2e555:
00007ffe`39d9e415 e8eec7ffff      call    System.Text.AsciiUtility.GetIndexOfFirstNonAsciiByte2(Byte ByRef, UInt64) (00007ffe`39d9ac08)

0:000> kv
 # Child-SP          RetAddr           : Args to Child                                                           : Call Site
00 000000f6`8cd7c448 00007ffe`39d9e489 : 00000000`00000000 00007ffe`99639b3a 0000f8df`00000002 00007ffe`39ce7348 : System_Private_CoreLib!System.Text.AsciiUtility.GetIndexOfFirstNonAsciiByte_Sse2(Byte ByRef, UInt64)
01 000000f6`8cd7c450 00007ffe`39d9e41a : 00000000`00000000 00000000`00000000 000000f6`8cd7c4e8 00007ffe`ebb3e3fd : System_Private_CoreLib!System.Text.AsciiUtility.GetIndexOfFirstNonAsciiByte2(Byte ByRef, UInt64)+0x2809
02 000000f6`8cd7c480 00007ffe`39d9e379 : 00002cc3`94d667db 00000000`80000000 000000f6`8cd7c700 00007ffe`995bcb56 : System_Private_CoreLib!System.Text.ASCIIEncoding.GetCharCount(Byte ByRef, Int32, System.Text.DecoderWorker)+0xffffffff`a0c2e55a
03 000000f6`8cd7c4e0 00007ffe`39d9e544 : 000000f6`00000000 00000273`9a058718 00000000`00000000 00000000`00000000 : System_Private_CoreLib!System.Text.ASCIIEncoding.GetCharCount(System.ReadOnlySpan`1<Byte>)+0xffffffff`a0c2e4d9

Your point about the inner fixed statement being unnecessary is well-taken. I'll rewrite this proper without any ref at all - no shortcuts :) - and see if the aberration disappears.

AndyAyersMS · 2019-02-12T02:05:37Z

Support for inlining some methods with pinned locals was added in #7788.

jkotas · 2019-02-12T02:25:15Z

Ah ok, I have seen this in the inliner that made me think that it was not fixed yet:

            inlineResult->Note(InlineObservation::CALLEE_HAS_PINNED_LOCALS);
            if (inlineResult->IsFailure())

GrabYourPitchforks · 2019-02-12T02:33:51Z

@tannergooding Here are the results from the change you suggested. They look really good! I experimented with some variations on it and settled on GrabYourPitchforks@e619b04 (which I still need to test fully, and which requires further tweaking so as not to take a strict dependency on BMI1).

Experiment-A is the original PR; Experiment-C are the changes to use unaligned vectors at beginning + end with no overruns.

(Edit: Removed old perf numbers.)

@jkotas Still working on the pointers-only implementation as you suggested.

AndyAyersMS · 2019-02-12T03:04:37Z

NoteFatal is used if the observation always causes the inlines to fail -- Note lets the inline policy decide.

GrabYourPitchforks · 2019-02-12T07:17:29Z

@jkotas Here's the run where the logic was rewritten to use byte* rather than ref byte everywhere. The code diff is at GrabYourPitchforks@cec62a0.

I was able to improve the performance slightly over the previous byte*-based iteration by tweaking a few code paths and controlling jumps more tightly. The throughput is still not quite as good as the ref byte-based code path, though.

(Edit: Remove outdated perf numbers.)

jkotas · 2019-02-12T12:42:36Z

Have you looked where the difference is comming from?

I am having hard time reasoning about the delta you have shared. It has new abstraction layer in Encoding, tricky JIT helpers, etc. I do not think any of that is necessary. This should simply pin the span at entrypoint, and the rest should be regular unmanged code that works with unmanaged pointers. It is the most maintainable way to do this. Maybe we will lose a cycle here or there compared to what you got, but that's ok.

GrabYourPitchforks · 2019-02-12T17:14:11Z

My gut tells me it's the rep stos introduced in GetCharCount (see
#22516 (comment)), especially since the remainder of the code tries to minimize the total number of reads, but I don't know if I trust my gut. Beyond that I don't know where to look.

The changed abstraction layer is not the culprit because all Experiments (even the ref based ones) share the exact same logic in AsciiEncoding.

GrabYourPitchforks · 2019-02-12T17:33:55Z

I'll try your suggestion of moving the pin up to a higher method in the call stack (GetCharCount(ROS<byte>)) to see if this regains some of the perf. Agree with your assessment - not too worried about individual cycles, but these results were unexpected.

jkotas · 2019-02-12T17:36:35Z

I'll try your suggestion of moving the pin up to a higher method in the call stack (GetCharCount(ROS<byte>)

Yes, it is how I think these should be implemented.

these results were unexpected

The inefficient zero initialization is a known performance issue in the JIT that we have not fixed yet. I agree that it has more impact here than what one would expect.

gfoidl · 2019-02-12T18:12:47Z

My gut tells me it's the rep stos introduced in GetCharCount

Something similar I saw in dotnet/corefx#34529 (comment) (where my gut telled me the same 😉)

GrabYourPitchforks · 2019-02-12T18:54:47Z

@jkotas Here are the latest numbers with moving the pinning up. This was a good suggestion! 👍

(Edit: Removed old perf numbers.)

Experiment-E:

public override unsafe int GetCharCount(ReadOnlySpan<byte> bytes)
{
    fixed (byte* pBytes = &MemoryMarshal.GetReference(bytes))
    {
        return GetCharCount(pBytes, bytes.Length, (DecoderWorker)null);
    }
}

Experiment-F:

public override unsafe int GetCharCount(ReadOnlySpan<byte> bytes)
{
    ref byte rBytes = ref MemoryMarshal.GetReference(bytes);
    fixed (byte* pBytes = &rBytes)
    {
        return GetCharCount((byte*)Unsafe.AsPointer(ref rBytes), bytes.Length, (DecoderWorker)null);
    }
}

Neither of these patterns produces the problematic rep stos code gen. Experiment F was just a test to see if we could coerce the runtime into producing even tighter codegen than normal for a typical fixed statement. It was an interesting experiment but I don't think the difference was significant enough to use that particular pattern.

benaadams · 2019-02-12T19:08:23Z

rep stos is https://github.com/dotnet/coreclr/issues/13827 as I understand it has a fixed startup cost which gets amortized for longer lengths so ends up faster; whereas for very short lengths its more costly than reg or vector clears.

GrabYourPitchforks · 2019-02-13T05:02:56Z

I'll run the numbers once more across all methods with the latest iteration - GetByteCount / GetBytes / GetCharCount / GetChars. Then I'll work on getting the ASCIIEncoding changes added to this PR.

GrabYourPitchforks · 2019-02-13T08:38:18Z

Updated numbers. I think I know why GetBytes / GetChars regressed compared to the original PR, so will fix these up before submitting the ASCIIEncoding changes for review.

Method	Toolchain	DataLength	Mean	Error	StdDev	Median	Ratio	RatioSD
GetByteCount	Experiment-A	0	4.393 us	0.0946 us	0.2729 us	4.459 us	0.55	0.04
GetByteCount	Experiment-G	0	5.241 us	0.1049 us	0.2941 us	5.284 us	0.72	0.04
GetByteCount	baseline (27411-2)	0	7.394 us	0.1410 us	0.1448 us	7.387 us	1.00	0.00

GetCharCount	Experiment-A	0	4.482 us	0.0893 us	0.1923 us	4.480 us	0.39	0.03
GetCharCount	Experiment-G	0	5.043 us	0.0991 us	0.1289 us	5.047 us	0.44	0.03
GetCharCount	baseline (27411-2)	0	11.434 us	0.2277 us	0.6272 us	11.523 us	1.00	0.00

GetBytes	Experiment-A	0	9.108 us	0.1799 us	0.4311 us	9.134 us	0.87	0.07
GetBytes	Experiment-G	0	17.727 us	0.3523 us	0.8373 us	17.785 us	1.69	0.12
GetBytes	baseline (27411-2)	0	10.296 us	0.2470 us	0.7283 us	10.476 us	1.00	0.00

GetChars	Experiment-A	0	8.902 us	0.1761 us	0.4481 us	8.890 us	0.86	0.05
GetChars	Experiment-G	0	18.668 us	0.3700 us	0.8794 us	18.704 us	1.78	0.11
GetChars	baseline (27411-2)	0	10.465 us	0.2048 us	0.3587 us	10.498 us	1.00	0.00

GetByteCount	Experiment-A	1	10.491 us	0.2090 us	0.5685 us	10.557 us	1.12	0.11
GetByteCount	Experiment-G	1	9.603 us	0.1915 us	0.2049 us	9.541 us	1.05	0.07
GetByteCount	baseline (27411-2)	1	9.417 us	0.2460 us	0.7254 us	9.359 us	1.00	0.00

GetCharCount	Experiment-A	1	10.515 us	0.2103 us	0.5502 us	10.669 us	0.84	0.06
GetCharCount	Experiment-G	1	9.071 us	0.2381 us	0.7020 us	8.936 us	0.73	0.06
GetCharCount	baseline (27411-2)	1	12.380 us	0.2467 us	0.5668 us	12.451 us	1.00	0.00

GetBytes	Experiment-A	1	9.525 us	0.1904 us	0.4846 us	9.566 us	0.85	0.07
GetBytes	Experiment-G	1	17.101 us	0.3408 us	0.9887 us	17.274 us	1.54	0.14
GetBytes	baseline (27411-2)	1	11.167 us	0.2462 us	0.7259 us	11.255 us	1.00	0.00

GetChars	Experiment-A	1	8.817 us	0.1757 us	0.3966 us	8.793 us	0.81	0.06
GetChars	Experiment-G	1	18.170 us	0.3619 us	0.7791 us	18.216 us	1.67	0.12
GetChars	baseline (27411-2)	1	10.931 us	0.2170 us	0.5866 us	10.969 us	1.00	0.00

GetByteCount	Experiment-A	11	11.658 us	0.2325 us	0.4802 us	11.806 us	0.46	0.03
GetByteCount	Experiment-G	11	10.507 us	0.2093 us	0.4891 us	10.624 us	0.42	0.03
GetByteCount	baseline (27411-2)	11	25.084 us	0.4960 us	1.2803 us	25.073 us	1.00	0.00

GetCharCount	Experiment-A	11	11.260 us	0.2220 us	0.3323 us	11.187 us	0.55	0.04
GetCharCount	Experiment-G	11	9.868 us	0.2179 us	0.6425 us	9.910 us	0.48	0.04
GetCharCount	baseline (27411-2)	11	20.541 us	0.4113 us	1.0834 us	20.516 us	1.00	0.00

GetBytes	Experiment-A	11	11.758 us	0.2345 us	0.4108 us	11.733 us	0.52	0.03
GetBytes	Experiment-G	11	20.895 us	0.2995 us	0.2802 us	20.780 us	0.95	0.06
GetBytes	baseline (27411-2)	11	23.117 us	0.4607 us	1.2136 us	23.102 us	1.00	0.00

GetChars	Experiment-A	11	10.821 us	0.2164 us	0.5851 us	10.726 us	0.51	0.04
GetChars	Experiment-G	11	19.625 us	0.3888 us	1.0037 us	19.708 us	0.93	0.07
GetChars	baseline (27411-2)	11	21.236 us	0.4242 us	1.1542 us	21.386 us	1.00	0.00

GetByteCount	Experiment-A	64	13.655 us	0.2727 us	0.5509 us	13.778 us	0.18	0.01
GetByteCount	Experiment-G	64	12.749 us	0.2544 us	0.7217 us	12.679 us	0.17	0.01
GetByteCount	baseline (27411-2)	64	75.522 us	0.0939 us	0.0733 us	75.510 us	1.00	0.00

GetCharCount	Experiment-A	64	11.465 us	0.2293 us	0.6001 us	11.316 us	0.19	0.01
GetCharCount	Experiment-G	64	9.843 us	0.1955 us	0.5047 us	9.815 us	0.17	0.01
GetCharCount	baseline (27411-2)	64	58.303 us	1.1604 us	1.6268 us	58.174 us	1.00	0.00

GetBytes	Experiment-A	64	18.319 us	0.3658 us	0.9765 us	18.254 us	0.15	0.01
GetBytes	Experiment-G	64	24.700 us	0.4729 us	0.4424 us	24.494 us	0.20	0.01
GetBytes	baseline (27411-2)	64	122.697 us	4.5594 us	3.8073 us	121.023 us	1.00	0.00

GetChars	Experiment-A	64	16.518 us	0.3304 us	0.9742 us	16.372 us	0.23	0.02
GetChars	Experiment-G	64	25.403 us	0.5053 us	1.4085 us	25.188 us	0.36	0.03
GetChars	baseline (27411-2)	64	71.245 us	1.4688 us	4.3077 us	71.475 us	1.00	0.00

GetByteCount	Experiment-A	128	16.048 us	0.3172 us	0.6186 us	15.765 us	0.12	0.01
GetByteCount	Experiment-G	128	16.856 us	0.3399 us	1.0021 us	17.098 us	0.12	0.01
GetByteCount	baseline (27411-2)	128	135.799 us	2.7126 us	7.1934 us	130.693 us	1.00	0.00

GetCharCount	Experiment-A	128	13.425 us	0.2482 us	0.4900 us	13.496 us	0.14	0.01
GetCharCount	Experiment-G	128	11.747 us	0.2309 us	0.3921 us	11.792 us	0.12	0.01
GetCharCount	baseline (27411-2)	128	98.865 us	1.9714 us	5.0177 us	98.749 us	1.00	0.00

GetBytes	Experiment-A	128	23.947 us	0.4766 us	1.1327 us	23.743 us	0.10	0.01
GetBytes	Experiment-G	128	33.240 us	0.7156 us	2.1098 us	32.897 us	0.14	0.01
GetBytes	baseline (27411-2)	128	243.074 us	4.9451 us	14.5807 us	240.323 us	1.00	0.00

GetChars	Experiment-A	128	20.094 us	0.4014 us	1.0643 us	19.917 us	0.18	0.01
GetChars	Experiment-G	128	30.721 us	0.7109 us	2.0849 us	31.001 us	0.28	0.02
GetChars	baseline (27411-2)	128	111.117 us	2.3269 us	4.5384 us	109.006 us	1.00	0.00

GrabYourPitchforks · 2019-02-14T22:20:39Z

@stephentoub @jkotas @layomia I'm about to update the PR with the ASCIIEncoding changes. Here's a brief walkthrough of how things changed:

The GetByteCount / GetCharCount / GetBytes / GetChars methods are now split into two parts. The first part is the happy path where the result can be computed in constant time (for instance, if a replacement fallback is in use and has specific properties we can reason about) or where the input is verified to be all-ASCII so no fallback work needs to take place at all. If it is determined that fallback work must take place, control is given to a more complex method that handles fallback logic.
The EncoderFallbackBuffer / DecoderFallbackBuffer types have new internal helper routines that perform the fallback logic in terms of Rune. This simplifies the call sites considerably, since all of the fallback logic is self-contained within a single method. The caller (in this case, the ASCIIEncoding type) doesn't need to know anything about the internals of how fallback works in the general case. There is no need to call InternalInitialize and no need to invoke the fallback buffer's logic within a loop.
To support (2), new internal virtual methods have been added to the Encoding base class. These are used internally by the fallback mechanism to help convert fallback data into bytes. Currently only ASCIIEncoding needs to override these methods because only ASCIIEncoding invokes the new helper routines in the fallback buffers.

The real benefit of (2) and (3) above is that it makes things like UTF-8 fallback logic much simpler, since now all of the infrastructure is already in place to support it. All of the fallback knowledge can be moved out of the UTF8Encoding class entirely when the time comes to plumb new changes through that type.

There are two behavioral changes between the old logic and the new logic. @tarekgh would be a good person to comment on these changes.

ASCIIEncoding.GetByteCount(..., EncoderNLS) did not correctly check EncoderReplacementFallback.DefaultString for ASCII-only data before short-circuiting the main method logic. This meant that calls to GetByteCount could succeed (and return a wrong answer) even if the corresponding call to GetBytes would throw an exception. This has been corrected in the new logic: if GetByteCount succeeds (or fails), then GetBytes will also succeed (or fail), modulo OOMs and other asynchronous exceptions.
The fallback buffer logic is no longer recursive. Previously, if we were transcoding chars to bytes, it was possible that the fallback buffer could say "please replace this non-ASCII char with this other non-ASCII char before shrinking to ASCII." ASCIIEncoding would then go into a loop of trying to replace the invalid data over and over, eventually aborting the operation after some number of recursive tries. Now, the fallback buffer is no longer queried recursively. If the fallback buffer tries to provide fallback data that still isn't representable in the target encoding, the new helper methods on EncoderFallbackBuffer will abort the operation without recursing.

There is already substantial unit test coverage across coreclr / corefx. Since this type is also used heavily by other places in the framework, bugs would likely show up as failures in other unrelated tests. There are some other tests coming online at dotnet/corefx#35331 and my GitHub repo to help validate these behaviors.

tarekgh · 2019-02-15T03:04:13Z

GetByteCount is meant to return the exact count; not estimate

Sorry I was confused it with GetMaxByteCount. so you are right. if GetByteCount not returning the expected right result then I think we should fix it too.

GrabYourPitchforks · 2019-02-15T04:48:11Z

@jkotas What if we put #if FAST_TRANSCODE around the intrinsic and vectorized code paths? That would give us the option to undefine the constant if we want a more compact (but slower) implementation suitable for mobile.

src/System.Private.CoreLib/shared/System/Text/ASCIIEncoding.cs

marek-safar · 2019-02-15T07:43:39Z

src/System.Private.CoreLib/shared/System/Text/AsciiUtility.cs

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static bool AllBytesInUInt32AreAscii(uint value)
+        {
+            return ((value & 0x80808080u) == 0);


Why all these extra parens?

src/System.Private.CoreLib/shared/System/Text/AsciiUtility.cs

marek-safar · 2019-02-15T07:51:47Z

src/System.Private.CoreLib/shared/System/Text/AsciiUtility.cs

+            goto Finish;
+        }
+
+        [MethodImpl(MethodImplOptions.AggressiveOptimization)]


What is your expected effect of this attribute here?

src/System.Private.CoreLib/shared/System/Text/AsciiUtility.cs

src/System.Private.CoreLib/shared/System/Text/DecoderFallback.cs

marek-safar · 2019-02-15T08:06:47Z

I'd prefer to have more compact version available

GrabYourPitchforks · 2019-02-22T06:57:27Z

@jkotas @stephentoub @tarekgh I'm having difficulty refactoring the code in such a manner that it works with the current EncoderNLS and DecoderNLS implementations while preserving all of the same fallback behaviors. The issue is with what Jan identified earlier - the existing EncoderNLS and DecoderNLS types retain extra incomplete data from the previous calls, and it's up to the implementation (in this case, ASCIIEncoding) to flush this data upon entry to the method.

I was experimenting with keeping the ASCIIEncoding implementation as slim as possible by pushing this concern up to the EncoderNLS / DecoderNLS class (since it's pretty much the same logic across all *Encoding types). This keeps ASCIIEncoding very lean and very fast for the common case where no EncoderNLS / DecoderNLS has been provided, but it resulted in the creation of new internal types EncoderNLS2 / DecoderNLS2 to encapsulate the fallback logic which is common to all encodings. This will eventually make all of the other encoding types (UTF8Encoding / UTF16Encoding / UTF32Encoding) simpler if they're replumbed to use the newly created internal types. But it has the effect of making this particular PR significantly larger in scope than what was intended.

Another option is to keep the ASCIIEncoding logic mostly unchanged, instead replacing the hot loops within its GetBytes / GetChars / etc. routines with calls to the new helpers. This would have the effect of regressing performance from baseline for small payloads, since much of the earlier reported gain for small payloads came from minimizing the amount of code in these methods. The performance improvement for large payloads is roughly about the same as reported earlier since the overhead is insignificant at that point.

Does any of you have thoughts on the best way to proceed here? I'm weighing large churn with perf improvements for all tested cases vs. small churn with perf regressions for some inputs. Will try to get measurements to you when possible to help inform your opinions.

jkotas · 2019-02-22T14:33:58Z

it has the effect of making this particular PR significantly larger in scope than what was intended.

Can you do this refactoring first for all encodings - mechanically move the hot paths and slow paths as they exist today into separate methods, without any vectorization, etc. This should result into a incremental performance gains because the hot path will be leaner.

And then you do the vectorization work on the refactored code.

GrabYourPitchforks · 2019-02-22T21:15:35Z

Can you do this refactoring first for all encodings - mechanically move the hot paths and slow paths as they exist today into separate methods, without any vectorization, etc. This should result into a incremental performance gains because the hot path will be leaner.

And then you do the vectorization work on the refactored code.

Certainly. I guess in the end there are three distinct steps, as listed below. Would splitting the PRs like this make it easier to review?

Split the logic within each individual Encoding into fast path / slow path. We'll find that there's a lot of duplicated logic (mainly coordinating error handling and leftover char draining) between all the different Encoding types.
Move all of the duplicated logic to a common location so that it can be shared across all the Encoding instances.
Replace the individual logic within each Encoding with an optimized version, as was the original plan with this PR.

jkotas · 2019-02-22T21:20:06Z

Yes, that sounds reasonable to me. (I think it may be ok for 1. and 2. to be together; but having them as separate steps would be fine too.)

GrabYourPitchforks · 2019-02-26T00:56:34Z

@jkotas @tarekgh @layomia In a separate branch I mechanically split ASCIIEncoding into fast-path / slow-path without adding the vectorized ASCIIUtility code. This is essentially the Step (1) we discussed above. See the implementation for details.

This resulted in an expansion of the size of ASCIIEncoding.cs, which is expected because I intentionally wanted to point out duplicate / common code before it got refactored into a common location. Search for the text !! REVIEW !! in the link above to see in particular how this would eventually look once it's refactored.

Before I start work on the other types, is this about what you expected to see?

jkotas · 2019-02-26T02:15:29Z

Left some comments on the commit.

GrabYourPitchforks · 2019-03-07T07:50:37Z

See also #23098, which attempts to answer the question "What would it look like to split all of these code paths mechanically?"

GrabYourPitchforks · 2019-03-12T21:44:53Z

I updated the PR to contain only the ASCIIUtility vectorized changes now that the Encoding core refactoring is complete.

src/System.Private.CoreLib/shared/System/Text/ASCIIUtility.cs

GrabYourPitchforks · 2019-03-21T22:34:15Z

@stephentoub @jkotas One of you earlier had asked what the perf difference was between the hand-implemented hardware intrinsic version and the normal Vector<T> code path. Answer is around 50% perf difference.

Method	Toolchain	StringLength	Mean	Error	StdDev	Ratio	RatioSD
GetBytes	with_intrin	256	2.936 us	0.0519 us	0.0486 us	1.00	0.00
GetBytes	without_intrin	256	4.523 us	0.0551 us	0.0515 us	1.54	0.03

Now requires only SSE2, not SSSE3

GrabYourPitchforks · 2019-03-22T01:47:42Z

Unit tests for the ASCIIUtility changes over at dotnet/corefx#36227.

dotnet/coreclr#22516) Commit migrated from dotnet/coreclr@125e7c9

jkotas reviewed Feb 11, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/AsciiUtility.cs Outdated Show resolved Hide resolved

Tornhoof reviewed Feb 11, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/AsciiUtility.cs Outdated Show resolved Hide resolved

tannergooding reviewed Feb 11, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/AsciiUtility.cs Outdated Show resolved Hide resolved

GrabYourPitchforks changed the title ~~Add AsciiUtility, with workhorse methods for ASCII transcoding~~ Add internal high-performance ASCII transcoding methods, plumb them through ASCIIEncoding Feb 14, 2019

marek-safar reviewed Feb 15, 2019

View reviewed changes

GrabYourPitchforks changed the title ~~Add internal high-performance ASCII transcoding methods, plumb them through ASCIIEncoding~~ [WIP] Add internal high-performance ASCII transcoding methods, plumb them through ASCIIEncoding Mar 6, 2019

GrabYourPitchforks mentioned this pull request Mar 7, 2019

Refactor Encoding to split fast-path and fallback logic #23098

Merged

GrabYourPitchforks changed the title ~~[WIP] Add internal high-performance ASCII transcoding methods, plumb them through ASCIIEncoding~~ Replace slow implementations in ASCIIUtility with fast implementations Mar 12, 2019

Replace slow implementations in ASCIIUtility with fast implementations

7baf971

GrabYourPitchforks force-pushed the ascii_2 branch from 2c80bd8 to 7baf971 Compare March 12, 2019 21:41

EgorBo reviewed Mar 12, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/ASCIIUtility.cs Show resolved Hide resolved

benaadams reviewed Mar 12, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/ASCIIUtility.cs Outdated Show resolved Hide resolved

GrabYourPitchforks mentioned this pull request Mar 13, 2019

Add OperationStatus-based UTF8 transcoding APIs #23219

Merged

gfoidl reviewed Mar 13, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/ASCIIUtility.cs Show resolved Hide resolved

src/System.Private.CoreLib/shared/System/Text/ASCIIUtility.cs Show resolved Hide resolved

PR feedback; further optimize hardware intrinsic code paths

12284f4

Now requires only SSE2, not SSSE3

GrabYourPitchforks mentioned this pull request Mar 22, 2019

Add ASCIIUtility tests dotnet/corefx#36227

Merged

Fix DEBUG build break

bee9ed4

GrabYourPitchforks merged commit 125e7c9 into dotnet:master Mar 23, 2019

GrabYourPitchforks deleted the ascii_2 branch April 12, 2019 15:12

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022

Replace slow implementations in ASCIIUtility with fast implementations (

1d70d9f

dotnet/coreclr#22516) Commit migrated from dotnet/coreclr@125e7c9

tannergooding mentioned this pull request Apr 17, 2023

Ascii and Latin1 APIs can return wrong results for misaligned buffers dotnet/runtime#84877

Closed

Replace slow implementations in ASCIIUtility with fast implementations #22516

Replace slow implementations in ASCIIUtility with fast implementations #22516

Uh oh!

Conversation

GrabYourPitchforks commented Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GrabYourPitchforks commented Feb 11, 2019

Uh oh!

GrabYourPitchforks commented Feb 11, 2019

Uh oh!

Uh oh!

jkotas commented Feb 11, 2019

Uh oh!

stephentoub commented Feb 11, 2019

Uh oh!

Uh oh!

tannergooding commented Feb 11, 2019

Uh oh!

Uh oh!

GrabYourPitchforks commented Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Feb 11, 2019

Uh oh!

jkotas commented Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GrabYourPitchforks commented Feb 12, 2019

Uh oh!

jkotas commented Feb 12, 2019

Uh oh!

GrabYourPitchforks commented Feb 12, 2019

Uh oh!

AndyAyersMS commented Feb 12, 2019

Uh oh!

jkotas commented Feb 12, 2019

Uh oh!

GrabYourPitchforks commented Feb 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented Feb 12, 2019

Uh oh!

GrabYourPitchforks commented Feb 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Feb 12, 2019

Uh oh!

GrabYourPitchforks commented Feb 12, 2019

Uh oh!

GrabYourPitchforks commented Feb 12, 2019

Uh oh!

jkotas commented Feb 12, 2019

Uh oh!

gfoidl commented Feb 12, 2019

Uh oh!

GrabYourPitchforks commented Feb 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Feb 12, 2019

Uh oh!

GrabYourPitchforks commented Feb 13, 2019

Uh oh!

GrabYourPitchforks commented Feb 13, 2019

Uh oh!

GrabYourPitchforks commented Feb 14, 2019

Uh oh!

tarekgh commented Feb 15, 2019

Uh oh!

GrabYourPitchforks commented Feb 15, 2019

Uh oh!

Uh oh!

Uh oh!

marek-safar Feb 15, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marek-safar Feb 15, 2019

Choose a reason for hiding this comment

Uh oh!

GrabYourPitchforks commented Feb 11, 2019 •

edited

Loading

GrabYourPitchforks commented Feb 11, 2019 •

edited

Loading

jkotas commented Feb 11, 2019 •

edited

Loading

GrabYourPitchforks commented Feb 12, 2019 •

edited

Loading

GrabYourPitchforks commented Feb 12, 2019 •

edited

Loading

GrabYourPitchforks commented Feb 12, 2019 •

edited

Loading