Skip to content

Conversation

@Sergio0694
Copy link
Contributor

@Sergio0694 Sergio0694 commented May 7, 2021

Overview

This PR adds a fast path to System.HexConverter.IsHexChar(int) on 64 bit systems, which has:

  • No branches (down from 2 conditional + 1 unconditional)
  • Smaller codegen (66 bytes ---> 35 bytes)
  • No memory accesses (so the speed is no longer affected by cache state)

The change is a specialized version of what I used in BitHelper.HasLookupFlag in the Microsoft.Toolkit.HighPerformance package, and just uses bit trickery to make the code branchless and read the lookup value from a constant value and not from memory.

Codegen diff

Before (click to expand):
; Method IsHexCharFast.HexConverter2:IsHexChar_OG(int):bool
G_M3768_IG01:
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25

G_M3768_IG02:
       cmp      ecx, 256
       jge      SHORT G_M3768_IG04
						;; bbWeight=1    PerfScore 1.25

G_M3768_IG03:
       cmp      ecx, 256
       jae      SHORT G_M3768_IG07
       movsxd   rax, ecx
       mov      rdx, 0xD1FFAB1E
       movzx    rax, byte  ptr [rax+rdx]
       jmp      SHORT G_M3768_IG05
						;; bbWeight=0.50 PerfScore 2.88

G_M3768_IG04:
       mov      eax, 255
						;; bbWeight=0.50 PerfScore 0.12

G_M3768_IG05:
       cmp      eax, 255
       setne    al
       movzx    rax, al
						;; bbWeight=1    PerfScore 1.50

G_M3768_IG06:
       add      rsp, 40
       ret      
						;; bbWeight=1    PerfScore 1.25

G_M3768_IG07:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
						;; bbWeight=0    PerfScore 0.00
; Total bytes of code: 66
After (click to expand):
; Method IsHexCharFast.HexConverter2:IsHexChar(int):bool
G_M2063_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M2063_IG02:
       add      ecx, -48
       lea      rax, [rcx-64]
       mov      rdx, 0xD1FFAB1E
       shl      rdx, cl
       and      rax, rdx
       jl       SHORT G_M2063_IG05
						;; bbWeight=1    PerfScore 4.25

G_M2063_IG03:
       xor      eax, eax
						;; bbWeight=0.50 PerfScore 0.12

G_M2063_IG04:
       ret      
						;; bbWeight=0.50 PerfScore 0.50

G_M2063_IG05:
       mov      eax, 1
						;; bbWeight=0.50 PerfScore 0.12

G_M2063_IG06:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
; Total bytes of code: 34

Additionally, the new version can also be JITted to just a constant, if the input is a constant.
That is, if you call IsHexChar with an input constant like 'A', you get this JIT diff:

Before (click to expand):
; Method IsHexCharFast.HexConverter2:Check_Constant_OG():bool
G_M36347_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M36347_IG02:
       mov      rax, 0xD1FFAB1E
       movzx    rax, byte  ptr [rax]
       cmp      eax, 255
       setne    al
       movzx    rax, al
						;; bbWeight=1    PerfScore 3.75

G_M36347_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 25
After (click to expand):
; Method IsHexCharFast.HexConverter2:Check_Constant_NEW():bool
G_M50831_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M50831_IG02:
       mov      eax, 1
						;; bbWeight=1    PerfScore 0.25

G_M50831_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 6

Benchmark

I've put together a small test benchmark which you can find here.

Method Mean Error StdDev Ratio Code Size
IsHexChar_OG_Random 9.230 us 0.0508 us 0.0475 us 1.00 72 B
IsHexChar_NEW_Random 4.406 us 0.0331 us 0.0309 us 0.48 62 B
IsHexChar_OG_AlwaysTrue 2.934 us 0.0183 us 0.0172 us 1.00 72 B
IsHexChar_NEW_AlwaysTrue 2.947 us 0.0199 us 0.0186 us 1.00 62 B

The new version is about 2x faster over random data in this test. When the input is always valid and the branch predictor can be more effective in the original implementation, this test shows the new version is still on par, but a couple notes:

  • This benchmark is not representative of real world data since the method is just constantly being invoked in a loop, so the original version will just do cache hits every single time, which gives it an advantage here.
  • Even not taking this into consideration, the new version is just generally not dependent on input data and always produces consistent performance in all cases (which I would argue is better even with performance being the same in this test)

JIT diff

Currently work in progress, I'm not having luck with the runtime tooling today... 😄
Opened the PR in the meantime to have the CI run on it at least.

Add a branchless fast path on 64 bit systems that doesn't do memory accesses either
@ghost
Copy link

ghost commented May 7, 2021

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

public static bool IsHexChar(int c)
public static unsafe bool IsHexChar(int c)
{
if (sizeof(IntPtr) == 8)
Copy link
Member

@EgorBo EgorBo May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since corelib is always arch specific I guess it's better (e.g. for JIT) to use #if TARGET_64BIT here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the Common folder and is pulled in by a bunch of different libs, some of which only build AnyCPU. If we wanted an ifdef for the target platform, we'd also need to ifdef CORELIB. (See ValueStringBuilder.cs for some examples.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GrabYourPitchforks ah, good point!

shift = unchecked((int)((35465847073801215UL >> i) & 1)),
and = shift & mask;
byte result = unchecked((byte)and);
bool valid = *(bool*)&result;
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We considered similar tricks to this a while back and those tricks were rejected. See dotnet/coreclr#16138 for one example.
The tl;dr was "somebody's probably branching on the result of this method, so it's better if the final statement is an actual comparison that can be folded into the caller's branch condition."

The implication would be that these lines become return (and & 1) != 0;, or return (and & 1) != 0 ? true : false; if we're trying to work around #4207.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I was thinking the last return expression could be changed to that due to callers branching 🙂
Question though: given the input is guaranteed to be either 1 or 0 here due to the previous logic, we can just test and != 0 here, right? As in, we can skip that & 1 since that wouldn't affect the actual semantics in this case, no?

So at the end of the day:

return and != 0 ? true : false;

Should work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that should work!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should read ['0', '0' + 64).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, aren't we checking that c is also >= 0 here?
From your original snippet on Discord:

ulong mask = i - 64; // c is negative IFF '0' <= c < '0' + 64; else c is non-negative

Am I mixing things up here? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If c is within [0, '0'), bit 31 will be set, but bit 63 (the bit we care about for masking purposes) will not be set.
Bit 63 only gets set if c is within ['0', '0' + 64).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, yeah. Fixed, thanks! 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unsafe keyword; no longer needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below, done in 78bc559532402d2d4dd1a2e9a85b9e9efd4240a5.

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer IntPtr.Size rather than sizeof(IntPtr) here. illink can see this as a proper method call to IntPtr.get_Size and will perform substitution. See, e.g., https://github.com/dotnet/runtime/blob/74cafe71de68fac46cd4956da98387bce9e1043c/src/libraries/System.Private.CoreLib/src/ILLink/ILLink.Substitutions.64bit.xml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, TIL. Done in 78bc559532402d2d4dd1a2e9a85b9e9efd4240a5 🙂

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Appreciate the perf numbers in the issue as well.

@ghost
Copy link

ghost commented May 7, 2021

Tagging subscribers to this area: @tannergooding
See info in area-owners.md if you want to be subscribed.

Issue Details

Overview

This PR adds a fast path to System.HexConverter.IsHexChar(int) on 64 bit systems, which has:

  • No branches (down from 2 conditional + 1 unconditional)
  • Smaller codegen (66 bytes ---> 35 bytes)
  • No memory accesses (so the speed is no longer affected by cache state)

The change is a specialized version of what I used in BitHelper.HasLookupFlag in the Microsoft.Toolkit.HighPerformance package, and just uses bit trickery to make the code branchless and read the lookup value from a constant value and not from memory.

Codegen diff

Before (click to expand):
; Method IsHexCharFast.HexConverter2:IsHexChar_OG(int):bool
G_M3768_IG01:
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25

G_M3768_IG02:
       cmp      ecx, 256
       jge      SHORT G_M3768_IG04
						;; bbWeight=1    PerfScore 1.25

G_M3768_IG03:
       cmp      ecx, 256
       jae      SHORT G_M3768_IG07
       movsxd   rax, ecx
       mov      rdx, 0xD1FFAB1E
       movzx    rax, byte  ptr [rax+rdx]
       jmp      SHORT G_M3768_IG05
						;; bbWeight=0.50 PerfScore 2.88

G_M3768_IG04:
       mov      eax, 255
						;; bbWeight=0.50 PerfScore 0.12

G_M3768_IG05:
       cmp      eax, 255
       setne    al
       movzx    rax, al
						;; bbWeight=1    PerfScore 1.50

G_M3768_IG06:
       add      rsp, 40
       ret      
						;; bbWeight=1    PerfScore 1.25

G_M3768_IG07:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
						;; bbWeight=0    PerfScore 0.00
; Total bytes of code: 66
After (click to expand):
; Method IsHexCharFast.HexConverter2:IsHexChar(int):bool
G_M2063_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M2063_IG02:
       add      ecx, -48
       lea      rax, [rcx-64]
       mov      rdx, 0xD1FFAB1E
       shl      rdx, cl
       and      rax, rdx
       jl       SHORT G_M2063_IG05
						;; bbWeight=1    PerfScore 4.25

G_M2063_IG03:
       xor      eax, eax
						;; bbWeight=0.50 PerfScore 0.12

G_M2063_IG04:
       ret      
						;; bbWeight=0.50 PerfScore 0.50

G_M2063_IG05:
       mov      eax, 1
						;; bbWeight=0.50 PerfScore 0.12

G_M2063_IG06:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
; Total bytes of code: 34

Additionally, the new version can also be JITted to just a constant, if the input is a constant.
That is, if you call IsHexChar with an input constant like 'A', you get this JIT diff:

Before (click to expand):
; Method IsHexCharFast.HexConverter2:Check_Constant_OG():bool
G_M36347_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M36347_IG02:
       mov      rax, 0xD1FFAB1E
       movzx    rax, byte  ptr [rax]
       cmp      eax, 255
       setne    al
       movzx    rax, al
						;; bbWeight=1    PerfScore 3.75

G_M36347_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 25
After (click to expand):
; Method IsHexCharFast.HexConverter2:Check_Constant_NEW():bool
G_M50831_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M50831_IG02:
       mov      eax, 1
						;; bbWeight=1    PerfScore 0.25

G_M50831_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 6

Benchmark

I've put together a small test benchmark which you can find here.

Method Mean Error StdDev Ratio Code Size
IsHexChar_OG_Random 9.230 us 0.0508 us 0.0475 us 1.00 72 B
IsHexChar_NEW_Random 4.406 us 0.0331 us 0.0309 us 0.48 62 B
IsHexChar_OG_AlwaysTrue 2.934 us 0.0183 us 0.0172 us 1.00 72 B
IsHexChar_NEW_AlwaysTrue 2.947 us 0.0199 us 0.0186 us 1.00 62 B

The new version is about 2x faster over random data in this test. When the input is always valid and the branch predictor can be more effective in the original implementation, this test shows the new version is still on par, but a couple notes:

  • This benchmark is not representative of real world data since the method is just constantly being invoked in a loop, so the original version will just do cache hits every single time, which gives it an advantage here.
  • Even not taking this into consideration, the new version is just generally not dependent on input data and always produces consistent performance in all cases (which I would argue is better even with performance being the same in this test)

JIT diff

Currently work in progress, I'm not having luck with the runtime tooling today... 😄
Opened the PR in the meantime to have the CI run on it at least.

Author: Sergio0694
Assignees: -
Labels:

area-System.Runtime

Milestone: -

@GrabYourPitchforks
Copy link
Member

@EgorBo @tannergooding any other thoughts on this? Thinking of merging it in Tuesday, which gives a little while longer for comments.

@GrabYourPitchforks GrabYourPitchforks merged commit 0ee10e9 into dotnet:main May 11, 2021
@Sergio0694 Sergio0694 deleted the is-hex-digit-opt branch May 11, 2021 19:16
@danmoseley
Copy link
Member

Nice contribution @Sergio0694 thank you.

@karelz karelz added this to the 6.0.0 milestone May 20, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jun 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants