Skip to content

Vectorize codegen for left shift operator on byte vector #108336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Oct 16, 2024

Conversation

alexcovington
Copy link
Contributor

The left shift operator for Vector128<byte>/Vector256<byte>/Vector512<byte> does not vectorize. This PR vectorizes the left shift operator for these types on x64.

For example, this C# code will now vectorize:

public Vector128<Byte> LeftShiftByte(Vector128<Byte> v, int shiftAmount)
{
    return v << shiftAmount;
}
Codegen

Before

; Vector128LeftShiftRepro.VectorBenchmarks.LeftShiftByte(System.Runtime.Intrinsics.Vector128`1<Byte>, Int32)
       sub       rsp,38
       mov       rax,[r8]
       mov       [rsp+28],rax
       movzx     eax,byte ptr [rsp+28]
       and       r9d,7
       shlx      eax,eax,r9d
       mov       [rsp+30],al
       movzx     eax,byte ptr [rsp+29]
       shlx      eax,eax,r9d
       mov       [rsp+31],al
       movzx     eax,byte ptr [rsp+2A]
       shlx      eax,eax,r9d
       mov       [rsp+32],al
       movzx     eax,byte ptr [rsp+2B]
       shlx      eax,eax,r9d
       mov       [rsp+33],al
       movzx     eax,byte ptr [rsp+2C]
       shlx      eax,eax,r9d
       mov       [rsp+34],al
       movzx     eax,byte ptr [rsp+2D]
       shlx      eax,eax,r9d
       mov       [rsp+35],al
       movzx     eax,byte ptr [rsp+2E]
       shlx      eax,eax,r9d
       mov       [rsp+36],al
       movzx     eax,byte ptr [rsp+2F]
       shlx      eax,eax,r9d
       mov       [rsp+37],al
       mov       rax,[rsp+30]
       mov       rcx,[r8+8]
       mov       [rsp+18],rcx
       movzx     ecx,byte ptr [rsp+18]
       shlx      ecx,ecx,r9d
       mov       [rsp+20],cl
       movzx     ecx,byte ptr [rsp+19]
       shlx      ecx,ecx,r9d
       mov       [rsp+21],cl
       movzx     ecx,byte ptr [rsp+1A]
       shlx      ecx,ecx,r9d
       mov       [rsp+22],cl
       movzx     ecx,byte ptr [rsp+1B]
       shlx      ecx,ecx,r9d
       mov       [rsp+23],cl
       movzx     ecx,byte ptr [rsp+1C]
       shlx      ecx,ecx,r9d
       mov       [rsp+24],cl
       movzx     ecx,byte ptr [rsp+1D]
       shlx      ecx,ecx,r9d
       mov       [rsp+25],cl
       movzx     ecx,byte ptr [rsp+1E]
       shlx      ecx,ecx,r9d
       mov       [rsp+26],cl
       movzx     ecx,byte ptr [rsp+1F]
       shlx      ecx,ecx,r9d
       mov       [rsp+27],cl
       mov       rcx,[rsp+20]
       mov       [rsp],rax
       mov       [rsp+8],rcx
       vmovaps   xmm0,[rsp]
       vmovups   [rdx],xmm0
       mov       rax,rdx
       add       rsp,38
       ret
; Total bytes of code 285

After

; Vector128LeftShiftRepro.VectorBenchmarks.LeftShiftByte(System.Runtime.Intrinsics.Vector128`1<Byte>, Int32)
       vmovups   xmm0,[r8]
       and       r9d,7
       vmovd     xmm1,r9d
       vpsllw    xmm0,xmm0,xmm1
       mov       eax,0FF
       shlx      eax,eax,r9d
       vpbroadcastb xmm1,eax
       vpand     xmm0,xmm1,xmm0
       vmovups   [rdx],xmm0
       mov       rax,rdx
       ret
; Total bytes of code 46

I see performance improvements in some of the System.Numerics.Tensors microbenchmarks on my x64 machine:

Baseline

| Method    | Job        | Toolchain                                                                             | BufferLength | Mean        | Error     | StdDev    | Median      | Min         | Max         | Ratio | Allocated | Alloc Ratio |
|---------- |----------- |-------------------------------------------------------------------------------------- |------------- |------------:|----------:|----------:|------------:|------------:|------------:|------:|----------:|------------:|
| ShiftLeft | Job-UPQSMH | \runtime-base\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe | 128          |   304.26 ns |  0.937 ns |  0.876 ns |   304.17 ns |   303.00 ns |   305.91 ns |  1.00 |         - |          NA |
| ShiftLeft | Job-UPQSMH | \runtime-base\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe | 3079         | 3,972.14 ns | 27.019 ns | 25.274 ns | 3,966.06 ns | 3,941.09 ns | 4,010.73 ns | 1.000 |         - |          NA |

Diff

| Method    | Job        | Toolchain                                                                             | BufferLength | Mean        | Error     | StdDev    | Median      | Min         | Max         | Ratio | Allocated | Alloc Ratio |
|---------- |----------- |-------------------------------------------------------------------------------------- |------------- |------------:|----------:|----------:|------------:|------------:|------------:|------:|----------:|------------:|
| ShiftLeft | Job-ERWYNR | \runtime-diff\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe | 128          |    31.29 ns |  0.174 ns |  0.163 ns |    31.25 ns |    31.00 ns |    31.57 ns |  0.10 |         - |          NA |
| ShiftLeft | Job-ERWYNR | \runtime-diff\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe | 3079         |    31.36 ns |  0.269 ns |  0.252 ns |    31.34 ns |    30.86 ns |    31.75 ns | 0.008 |         - |          NA |

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 27, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 27, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Alex Covington (Advanced Micro Devices added 2 commits September 27, 2024 10:24
@alexcovington alexcovington marked this pull request as ready for review September 30, 2024 15:56
{
// We don't have actual instructions for shifting bytes, so we'll emulate them
// by shifting 32-bit values and masking off the bits that should be zeroed.

assert(varTypeIsByte(simdBaseType));

intrinsic =
GenTreeHWIntrinsic::GetHWIntrinsicIdForBinOp(this, op, op1, op2ForLookup, TYP_INT, simdSize, false);
GenTreeHWIntrinsic::GetHWIntrinsicIdForBinOp(this, op, op1, op2ForLookup,
op == GT_RSZ ? TYP_INT : TYP_SHORT, simdSize, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be TYP_SHORT for left shift?

pslld is also available for SSE2 and should achieve the same thing, correct? We typically prefer int where it is possible since instructions operating on 32-bit elements tend to have better overall throughput and optimization opportunities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're also right, it doesn't need to be TYP_SHORT. I hadn't originally considered that it could be TYP_INT as well.

I updated the type here to be TYP_INT.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one question about whether we actually need to use TYP_SHORT.

Thanks for getting this up and completed!

@tannergooding
Copy link
Member

CC. @dotnet/jit-contrib for secondary review.

Simple PR from our friends at AMD

@EgorBo
Copy link
Member

EgorBo commented Oct 14, 2024

/azp run Fuzzlyn

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tannergooding
Copy link
Member

Fuzzlyn failures are unrelated.

@tannergooding
Copy link
Member

PFX failure is also unrelated, was fixed by f059869

@tannergooding tannergooding merged commit 790a607 into dotnet:main Oct 16, 2024
115 checks passed
@tannergooding
Copy link
Member

Thanks for the contribution @alexcovington

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants