Skip to content

Vector{128,256} operations that use MmShuffle fall back to method call #2121

@gfoidl

Description

@gfoidl

Prerequisites

  • I have written a descriptive issue title
  • I have verified that I am running the latest version of ImageSharp
  • I have verified if the problem exist in both DEBUG and RELEASE mode
  • I have searched open and closed issues to ensure it has not already been reported

ImageSharp version

Current main branch

Other ImageSharp packages and versions

none

Environment (Operating system, version and so on)

all .NET supported

.NET Framework version

all

Description

While working on #1762 I recognized that methods that use

[MethodImpl(InliningOptions.ShortMethod)]
public static byte MmShuffle(byte p3, byte p2, byte p1, byte p0)
=> (byte)((p3 << 6) | (p2 << 4) | (p1 << 2) | p0);
won't emit platform-intrinsics, rather fallback to a method call as the value isn't a constant.

E.g. Vp8Encoding.FTransformPass1SSE2 looks after inlining the vector constants like

       push      rdi
       push      rsi
       push      rbx
       sub       rsp,60
       vzeroupper
       xor       eax,eax
       mov       [rsp+50],rax
       mov       [rsp+58],rax
       mov       rsi,rdx
       mov       rdi,r8
       mov       rbx,r9
       vmovupd   xmm0,[rcx]
       lea       rcx,[rsp+40]
       vmovapd   [rsp+30],xmm0
       lea       rdx,[rsp+30]
       mov       r8d,0B1
       call      System.Runtime.Intrinsics.X86.Sse2.ShuffleHigh(System.Runtime.Intrinsics.Vector128`1<Int16>, Byte)
       vmovupd   xmm0,[rsi]
       lea       rcx,[rsp+50]
       vmovapd   [rsp+30],xmm0
       lea       rdx,[rsp+30]
       mov       r8d,0B1
       call      System.Runtime.Intrinsics.X86.Sse2.ShuffleHigh(System.Runtime.Intrinsics.Vector128`1<Int16>, Byte)
       vmovapd   xmm0,[rsp+40]
       vpunpcklqdq xmm0,xmm0,[rsp+50]
       vmovapd   xmm1,[rsp+40]
       vpunpckhqdq xmm1,xmm1,[rsp+50]
       vpaddw    xmm2,xmm0,xmm1
       vpmaddwd  xmm3,xmm2,[7FF7D3640A20]
       vpmaddwd  xmm2,xmm2,[7FF7D3640A30]
       vpsubw    xmm0,xmm0,xmm1
       vpmaddwd  xmm1,xmm0,[7FF7D3640A40]
       vpaddd    xmm1,xmm1,[7FF7D3640A50]
       vpsrad    xmm1,xmm1,9
       vpmaddwd  xmm0,xmm0,[7FF7D3640A60]
       vpaddd    xmm0,xmm0,[7FF7D3640A70]
       vpsrad    xmm0,xmm0,9
       vpackssdw xmm0,xmm1,xmm0
       vpackssdw xmm1,xmm3,xmm2
       vpunpcklwd xmm2,xmm1,xmm0
       vpunpckhwd xmm0,xmm1,xmm0
       vpunpckhdq xmm1,xmm2,xmm0
       vpunpckldq xmm0,xmm2,xmm0
       vmovupd   [rdi],xmm0
       mov       rcx,rbx
       vmovapd   [rsp+20],xmm1
       lea       rdx,[rsp+20]
       mov       r8d,4E
       call      System.Runtime.Intrinsics.X86.Sse2.Shuffle(System.Runtime.Intrinsics.Vector128`1<Int32>, Byte)
       nop
       add       rsp,60
       pop       rbx
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 245

If instead SimdUtils.Shuffle.MmShuffle the constant is given as literal, then the code boils down to:

       vzeroupper
       vpshufhw  xmm0,[rdx],0B1
       vpshufhw  xmm1,[rcx],0B1
       vpunpcklqdq xmm2,xmm1,xmm0
       vpunpckhqdq xmm0,xmm1,xmm0
       vpaddw    xmm1,xmm2,xmm0
       vpmaddwd  xmm3,xmm1,[7FF7D365DAC0]
       vpmaddwd  xmm1,xmm1,[7FF7D365DAD0]
       vpsubw    xmm0,xmm2,xmm0
       vpmaddwd  xmm2,xmm0,[7FF7D365DAE0]
       vpaddd    xmm2,xmm2,[7FF7D365DAF0]
       vpsrad    xmm2,xmm2,9
       vpmaddwd  xmm0,xmm0,[7FF7D365DB00]
       vpaddd    xmm0,xmm0,[7FF7D365DB10]
       vpsrad    xmm0,xmm0,9
       vpackssdw xmm0,xmm2,xmm0
       vpackssdw xmm1,xmm3,xmm1
       vpunpcklwd xmm2,xmm1,xmm0
       vpunpckhwd xmm0,xmm1,xmm0
       vpunpckhdq xmm1,xmm2,xmm0
       vpunpckldq xmm0,xmm2,xmm0
       vmovupd   [r8],xmm0
       vpshufd   xmm0,xmm1,4E
       vmovupd   [r9],xmm0
       ret
; Total bytes of code 127

This is a de-facto a JIT-limitation, recorded in dotnet/runtime#9989 and dotnet/runtime#38003, also noticed in #1517 (comment)

As this is quite a difference in code-gen, hence in perf it should show up too, I propose to change to typing the shuffle literals explicetely. E.g.

-Vector128<short> shuf01_p = Sse2.ShuffleHigh(row01, SimdUtils.Shuffle.MmShuffle(2, 3, 0, 1));   // or any similar pattern
+Vector128<short> shuf01_p = Sse2.ShuffleHigh(row01, 0xB1);  // MmShuffle(2, 3, 0, 1)

If OK I'd like to tackle this in one shot with #1762 (as I'm touching these pieces anyway).
Edit: I was too eager, the PR is out...

Steps to Reproduce

Look at dissassembly of any method that uses SimdUtils.Shuffle.MmShuffle.

Images

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions