Optimize System.HexConverter.IsHexChar on 64 bits #52470

Sergio0694 · 2021-05-07T20:38:08Z

Overview

This PR adds a fast path to System.HexConverter.IsHexChar(int) on 64 bit systems, which has:

No branches (down from 2 conditional + 1 unconditional)
Smaller codegen (66 bytes ---> 35 bytes)
No memory accesses (so the speed is no longer affected by cache state)

The change is a specialized version of what I used in BitHelper.HasLookupFlag in the Microsoft.Toolkit.HighPerformance package, and just uses bit trickery to make the code branchless and read the lookup value from a constant value and not from memory.

Codegen diff

Before (click to expand):

; Method IsHexCharFast.HexConverter2:IsHexChar_OG(int):bool
G_M3768_IG01:
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25

G_M3768_IG02:
       cmp      ecx, 256
       jge      SHORT G_M3768_IG04
						;; bbWeight=1    PerfScore 1.25

G_M3768_IG03:
       cmp      ecx, 256
       jae      SHORT G_M3768_IG07
       movsxd   rax, ecx
       mov      rdx, 0xD1FFAB1E
       movzx    rax, byte  ptr [rax+rdx]
       jmp      SHORT G_M3768_IG05
						;; bbWeight=0.50 PerfScore 2.88

G_M3768_IG04:
       mov      eax, 255
						;; bbWeight=0.50 PerfScore 0.12

G_M3768_IG05:
       cmp      eax, 255
       setne    al
       movzx    rax, al
						;; bbWeight=1    PerfScore 1.50

G_M3768_IG06:
       add      rsp, 40
       ret      
						;; bbWeight=1    PerfScore 1.25

G_M3768_IG07:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
						;; bbWeight=0    PerfScore 0.00
; Total bytes of code: 66

After (click to expand):

; Method IsHexCharFast.HexConverter2:IsHexChar(int):bool
G_M2063_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M2063_IG02:
       add      ecx, -48
       lea      rax, [rcx-64]
       mov      rdx, 0xD1FFAB1E
       shl      rdx, cl
       and      rax, rdx
       jl       SHORT G_M2063_IG05
						;; bbWeight=1    PerfScore 4.25

G_M2063_IG03:
       xor      eax, eax
						;; bbWeight=0.50 PerfScore 0.12

G_M2063_IG04:
       ret      
						;; bbWeight=0.50 PerfScore 0.50

G_M2063_IG05:
       mov      eax, 1
						;; bbWeight=0.50 PerfScore 0.12

G_M2063_IG06:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
; Total bytes of code: 34

Additionally, the new version can also be JITted to just a constant, if the input is a constant.
That is, if you call IsHexChar with an input constant like 'A', you get this JIT diff:

Before (click to expand):

; Method IsHexCharFast.HexConverter2:Check_Constant_OG():bool
G_M36347_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M36347_IG02:
       mov      rax, 0xD1FFAB1E
       movzx    rax, byte  ptr [rax]
       cmp      eax, 255
       setne    al
       movzx    rax, al
						;; bbWeight=1    PerfScore 3.75

G_M36347_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 25

After (click to expand):

; Method IsHexCharFast.HexConverter2:Check_Constant_NEW():bool
G_M50831_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M50831_IG02:
       mov      eax, 1
						;; bbWeight=1    PerfScore 0.25

G_M50831_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 6

Benchmark

I've put together a small test benchmark which you can find here.

Method	Mean	Error	StdDev	Ratio	Code Size
IsHexChar_OG_Random	9.230 us	0.0508 us	0.0475 us	1.00	72 B
IsHexChar_NEW_Random	4.406 us	0.0331 us	0.0309 us	0.48	62 B

IsHexChar_OG_AlwaysTrue	2.934 us	0.0183 us	0.0172 us	1.00	72 B
IsHexChar_NEW_AlwaysTrue	2.947 us	0.0199 us	0.0186 us	1.00	62 B

The new version is about 2x faster over random data in this test. When the input is always valid and the branch predictor can be more effective in the original implementation, this test shows the new version is still on par, but a couple notes:

This benchmark is not representative of real world data since the method is just constantly being invoked in a loop, so the original version will just do cache hits every single time, which gives it an advantage here.
Even not taking this into consideration, the new version is just generally not dependent on input data and always produces consistent performance in all cases (which I would argue is better even with performance being the same in this test)

JIT diff

Currently work in progress, I'm not having luck with the runtime tooling today... 😄
Opened the PR in the meantime to have the CI run on it at least.

Add a branchless fast path on 64 bit systems that doesn't do memory accesses either

ghost · 2021-05-07T20:38:12Z

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

EgorBo · 2021-05-07T20:43:50Z