Vectorize codegen for left shift operator on byte vector #108336

alexcovington · 2024-09-27T16:58:44Z

The left shift operator for Vector128<byte>/Vector256<byte>/Vector512<byte> does not vectorize. This PR vectorizes the left shift operator for these types on x64.

For example, this C# code will now vectorize:

public Vector128<Byte> LeftShiftByte(Vector128<Byte> v, int shiftAmount)
{
    return v << shiftAmount;
}

Codegen

Before

; Vector128LeftShiftRepro.VectorBenchmarks.LeftShiftByte(System.Runtime.Intrinsics.Vector128`1<Byte>, Int32)
       sub       rsp,38
       mov       rax,[r8]
       mov       [rsp+28],rax
       movzx     eax,byte ptr [rsp+28]
       and       r9d,7
       shlx      eax,eax,r9d
       mov       [rsp+30],al
       movzx     eax,byte ptr [rsp+29]
       shlx      eax,eax,r9d
       mov       [rsp+31],al
       movzx     eax,byte ptr [rsp+2A]
       shlx      eax,eax,r9d
       mov       [rsp+32],al
       movzx     eax,byte ptr [rsp+2B]
       shlx      eax,eax,r9d
       mov       [rsp+33],al
       movzx     eax,byte ptr [rsp+2C]
       shlx      eax,eax,r9d
       mov       [rsp+34],al
       movzx     eax,byte ptr [rsp+2D]
       shlx      eax,eax,r9d
       mov       [rsp+35],al
       movzx     eax,byte ptr [rsp+2E]
       shlx      eax,eax,r9d
       mov       [rsp+36],al
       movzx     eax,byte ptr [rsp+2F]
       shlx      eax,eax,r9d
       mov       [rsp+37],al
       mov       rax,[rsp+30]
       mov       rcx,[r8+8]
       mov       [rsp+18],rcx
       movzx     ecx,byte ptr [rsp+18]
       shlx      ecx,ecx,r9d
       mov       [rsp+20],cl
       movzx     ecx,byte ptr [rsp+19]
       shlx      ecx,ecx,r9d
       mov       [rsp+21],cl
       movzx     ecx,byte ptr [rsp+1A]
       shlx      ecx,ecx,r9d
       mov       [rsp+22],cl
       movzx     ecx,byte ptr [rsp+1B]
       shlx      ecx,ecx,r9d
       mov       [rsp+23],cl
       movzx     ecx,byte ptr [rsp+1C]
       shlx      ecx,ecx,r9d
       mov       [rsp+24],cl
       movzx     ecx,byte ptr [rsp+1D]
       shlx      ecx,ecx,r9d
       mov       [rsp+25],cl
       movzx     ecx,byte ptr [rsp+1E]
       shlx      ecx,ecx,r9d
       mov       [rsp+26],cl
       movzx     ecx,byte ptr [rsp+1F]
       shlx      ecx,ecx,r9d
       mov       [rsp+27],cl
       mov       rcx,[rsp+20]
       mov       [rsp],rax
       mov       [rsp+8],rcx
       vmovaps   xmm0,[rsp]
       vmovups   [rdx],xmm0
       mov       rax,rdx
       add       rsp,38
       ret
; Total bytes of code 285

After

; Vector128LeftShiftRepro.VectorBenchmarks.LeftShiftByte(System.Runtime.Intrinsics.Vector128`1<Byte>, Int32)
       vmovups   xmm0,[r8]
       and       r9d,7
       vmovd     xmm1,r9d
       vpsllw    xmm0,xmm0,xmm1
       mov       eax,0FF
       shlx      eax,eax,r9d
       vpbroadcastb xmm1,eax
       vpand     xmm0,xmm1,xmm0
       vmovups   [rdx],xmm0
       mov       rax,rdx
       ret
; Total bytes of code 46

I see performance improvements in some of the System.Numerics.Tensors microbenchmarks on my x64 machine:

Baseline

| Method    | Job        | Toolchain                                                                             | BufferLength | Mean        | Error     | StdDev    | Median      | Min         | Max         | Ratio | Allocated | Alloc Ratio |
|---------- |----------- |-------------------------------------------------------------------------------------- |------------- |------------:|----------:|----------:|------------:|------------:|------------:|------:|----------:|------------:|
| ShiftLeft | Job-UPQSMH | \runtime-base\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe | 128          |   304.26 ns |  0.937 ns |  0.876 ns |   304.17 ns |   303.00 ns |   305.91 ns |  1.00 |         - |          NA |
| ShiftLeft | Job-UPQSMH | \runtime-base\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe | 3079         | 3,972.14 ns | 27.019 ns | 25.274 ns | 3,966.06 ns | 3,941.09 ns | 4,010.73 ns | 1.000 |         - |          NA |

Diff

| Method    | Job        | Toolchain                                                                             | BufferLength | Mean        | Error     | StdDev    | Median      | Min         | Max         | Ratio | Allocated | Alloc Ratio |
|---------- |----------- |-------------------------------------------------------------------------------------- |------------- |------------:|----------:|----------:|------------:|------------:|------------:|------:|----------:|------------:|
| ShiftLeft | Job-ERWYNR | \runtime-diff\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe | 128          |    31.29 ns |  0.174 ns |  0.163 ns |    31.25 ns |    31.00 ns |    31.57 ns |  0.10 |         - |          NA |
| ShiftLeft | Job-ERWYNR | \runtime-diff\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe | 3079         |    31.36 ns |  0.269 ns |  0.252 ns |    31.34 ns |    30.86 ns |    31.75 ns | 0.008 |         - |          NA |

dotnet-policy-service · 2024-09-27T16:59:12Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

src/coreclr/jit/hwintrinsicxarch.cpp

…vington/runtime into vector128-byte-left-shift

src/coreclr/jit/gentree.cpp

…t calculation

src/coreclr/jit/gentree.cpp

tannergooding · 2024-10-01T21:01:47Z

src/coreclr/jit/gentree.cpp

        {
            // We don't have actual instructions for shifting bytes, so we'll emulate them
            // by shifting 32-bit values and masking off the bits that should be zeroed.

            assert(varTypeIsByte(simdBaseType));

            intrinsic =
-                GenTreeHWIntrinsic::GetHWIntrinsicIdForBinOp(this, op, op1, op2ForLookup, TYP_INT, simdSize, false);
+                GenTreeHWIntrinsic::GetHWIntrinsicIdForBinOp(this, op, op1, op2ForLookup,
+                                                             op == GT_RSZ ? TYP_INT : TYP_SHORT, simdSize, false);


Does this need to be TYP_SHORT for left shift?

pslld is also available for SSE2 and should achieve the same thing, correct? We typically prefer int where it is possible since instructions operating on 32-bit elements tend to have better overall throughput and optimization opportunities.

You're also right, it doesn't need to be TYP_SHORT. I hadn't originally considered that it could be TYP_INT as well.

I updated the type here to be TYP_INT.

tannergooding

LGTM, just one question about whether we actually need to use TYP_SHORT.

Thanks for getting this up and completed!

tannergooding · 2024-10-02T17:08:55Z

CC. @dotnet/jit-contrib for secondary review.

Simple PR from our friends at AMD

EgorBo · 2024-10-14T16:18:46Z

/azp run Fuzzlyn

azure-pipelines · 2024-10-14T16:18:56Z

Azure Pipelines successfully started running 1 pipeline(s).

tannergooding · 2024-10-15T17:16:34Z

Fuzzlyn failures are unrelated.

tannergooding · 2024-10-15T18:19:41Z

PFX failure is also unrelated, was fixed by f059869

tannergooding · 2024-10-16T10:57:38Z

Thanks for the contribution @alexcovington

Alex Covington (Advanced Micro Devices added 3 commits September 9, 2024 15:02

Generate vectorized left shift for Vector128<Byte>

af1bbcb

Use lower ISA instead

3f6a4de

Expand for larger size vector widths

d65af60

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 27, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 27, 2024

Merge branch 'main' into vector128-byte-left-shift

e9f7717

tannergooding reviewed Sep 27, 2024

View reviewed changes

src/coreclr/jit/hwintrinsicxarch.cpp Outdated Show resolved Hide resolved

Alex Covington (Advanced Micro Devices added 2 commits September 27, 2024 10:24

jit format

113d5bf

Merge branch 'vector128-byte-left-shift' of https://github.com/alexco…

039677f

…vington/runtime into vector128-byte-left-shift

build-analysis bot mentioned this pull request Sep 27, 2024

restarted. Azure DevOps can't recover from restarts. dotnet/dnceng#3879

Open

3 tasks

Alex Covington (Advanced Micro Devices added 3 commits September 27, 2024 13:33

Move transformation to gtNewSimdBinOp

f071fb2

Remove conditional, use assert similar to GT_RSZ case

9085483

Fix missing semicolon

a169289

alexcovington marked this pull request as ready for review September 30, 2024 15:56

Merge branch 'main' into vector128-byte-left-shift

da2783f

tannergooding reviewed Sep 30, 2024

View reviewed changes