-
-
Notifications
You must be signed in to change notification settings - Fork 888
Description
Prerequisites
- I have written a descriptive issue title
- I have verified that I am running the latest version of ImageSharp
- I have verified if the problem exist in both
DEBUGandRELEASEmode - I have searched open and closed issues to ensure it has not already been reported
ImageSharp version
Current main branch
Other ImageSharp packages and versions
none
Environment (Operating system, version and so on)
all .NET supported
.NET Framework version
all
Description
While working on #1762 I recognized that methods that use
ImageSharp/src/ImageSharp/Common/Helpers/SimdUtils.Shuffle.cs
Lines 236 to 238 in c661ab1
| [MethodImpl(InliningOptions.ShortMethod)] | |
| public static byte MmShuffle(byte p3, byte p2, byte p1, byte p0) | |
| => (byte)((p3 << 6) | (p2 << 4) | (p1 << 2) | p0); |
E.g. Vp8Encoding.FTransformPass1SSE2 looks after inlining the vector constants like
push rdi
push rsi
push rbx
sub rsp,60
vzeroupper
xor eax,eax
mov [rsp+50],rax
mov [rsp+58],rax
mov rsi,rdx
mov rdi,r8
mov rbx,r9
vmovupd xmm0,[rcx]
lea rcx,[rsp+40]
vmovapd [rsp+30],xmm0
lea rdx,[rsp+30]
mov r8d,0B1
call System.Runtime.Intrinsics.X86.Sse2.ShuffleHigh(System.Runtime.Intrinsics.Vector128`1<Int16>, Byte)
vmovupd xmm0,[rsi]
lea rcx,[rsp+50]
vmovapd [rsp+30],xmm0
lea rdx,[rsp+30]
mov r8d,0B1
call System.Runtime.Intrinsics.X86.Sse2.ShuffleHigh(System.Runtime.Intrinsics.Vector128`1<Int16>, Byte)
vmovapd xmm0,[rsp+40]
vpunpcklqdq xmm0,xmm0,[rsp+50]
vmovapd xmm1,[rsp+40]
vpunpckhqdq xmm1,xmm1,[rsp+50]
vpaddw xmm2,xmm0,xmm1
vpmaddwd xmm3,xmm2,[7FF7D3640A20]
vpmaddwd xmm2,xmm2,[7FF7D3640A30]
vpsubw xmm0,xmm0,xmm1
vpmaddwd xmm1,xmm0,[7FF7D3640A40]
vpaddd xmm1,xmm1,[7FF7D3640A50]
vpsrad xmm1,xmm1,9
vpmaddwd xmm0,xmm0,[7FF7D3640A60]
vpaddd xmm0,xmm0,[7FF7D3640A70]
vpsrad xmm0,xmm0,9
vpackssdw xmm0,xmm1,xmm0
vpackssdw xmm1,xmm3,xmm2
vpunpcklwd xmm2,xmm1,xmm0
vpunpckhwd xmm0,xmm1,xmm0
vpunpckhdq xmm1,xmm2,xmm0
vpunpckldq xmm0,xmm2,xmm0
vmovupd [rdi],xmm0
mov rcx,rbx
vmovapd [rsp+20],xmm1
lea rdx,[rsp+20]
mov r8d,4E
call System.Runtime.Intrinsics.X86.Sse2.Shuffle(System.Runtime.Intrinsics.Vector128`1<Int32>, Byte)
nop
add rsp,60
pop rbx
pop rsi
pop rdi
ret
; Total bytes of code 245If instead SimdUtils.Shuffle.MmShuffle the constant is given as literal, then the code boils down to:
vzeroupper
vpshufhw xmm0,[rdx],0B1
vpshufhw xmm1,[rcx],0B1
vpunpcklqdq xmm2,xmm1,xmm0
vpunpckhqdq xmm0,xmm1,xmm0
vpaddw xmm1,xmm2,xmm0
vpmaddwd xmm3,xmm1,[7FF7D365DAC0]
vpmaddwd xmm1,xmm1,[7FF7D365DAD0]
vpsubw xmm0,xmm2,xmm0
vpmaddwd xmm2,xmm0,[7FF7D365DAE0]
vpaddd xmm2,xmm2,[7FF7D365DAF0]
vpsrad xmm2,xmm2,9
vpmaddwd xmm0,xmm0,[7FF7D365DB00]
vpaddd xmm0,xmm0,[7FF7D365DB10]
vpsrad xmm0,xmm0,9
vpackssdw xmm0,xmm2,xmm0
vpackssdw xmm1,xmm3,xmm1
vpunpcklwd xmm2,xmm1,xmm0
vpunpckhwd xmm0,xmm1,xmm0
vpunpckhdq xmm1,xmm2,xmm0
vpunpckldq xmm0,xmm2,xmm0
vmovupd [r8],xmm0
vpshufd xmm0,xmm1,4E
vmovupd [r9],xmm0
ret
; Total bytes of code 127This is a de-facto a JIT-limitation, recorded in dotnet/runtime#9989 and dotnet/runtime#38003, also noticed in #1517 (comment)
As this is quite a difference in code-gen, hence in perf it should show up too, I propose to change to typing the shuffle literals explicetely. E.g.
-Vector128<short> shuf01_p = Sse2.ShuffleHigh(row01, SimdUtils.Shuffle.MmShuffle(2, 3, 0, 1)); // or any similar pattern
+Vector128<short> shuf01_p = Sse2.ShuffleHigh(row01, 0xB1); // MmShuffle(2, 3, 0, 1)If OK I'd like to tackle this in one shot with #1762 (as I'm touching these pieces anyway).
Edit: I was too eager, the PR is out...
Steps to Reproduce
Look at dissassembly of any method that uses SimdUtils.Shuffle.MmShuffle.
Images
No response