-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Closed
Labels
area-System.Runtime.Intrinsicsgood first issueIssue should be easy to implement, good for first-time contributorsIssue should be easy to implement, good for first-time contributorshelp wanted[up-for-grabs] Good issue for external contributors[up-for-grabs] Good issue for external contributorstenet-performancePerformance related issuePerformance related issue
Milestone
Description
From #44111 (comment)
There are 3 patterns we currently use across the BCL for const vectors:
//
// Case 1: Plain Vector.Create
public Vector128<byte> Case1(Vector128<byte> vec)
{
Vector128<byte> mask = Vector128.Create(
0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF);
return Ssse3.Shuffle(vec, mask);
}
//
// Case 1.1: Plain Vector.Create as argument of some SIMD instruction directly
// Should be the same codegen as for Case1 ^ (spoiler: it's not. Forward Substitution? see #4655)
public Vector128<byte> Case1_1(Vector128<byte> vec)
{
return Ssse3.Shuffle(vec,
Vector128.Create( // used without "mask" local as in Case1 ^
0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF));
}
//
// Case 2: static readonly Vector
private static readonly Vector128<byte> s_mask = Vector128.Create(
0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF);
public Vector128<byte> Case2(Vector128<byte> vec)
{
// we also often save it to a local first (e.g. before loops)
return Ssse3.Shuffle(vec, s_mask);
}
//
// Case 3: Roslyn's hack
private static ReadOnlySpan<byte> Mask => new byte[] {
0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF };
public Vector128<byte> Case3(Vector128<byte> vec)
{
return Ssse3.Shuffle(vec, Unsafe.ReadUnaligned<Vector128<byte>>(
ref MemoryMarshal.GetReference(Mask)));
}Here is the current codegen for these cases:
; Method Case1
G_M46269_IG01:
vzeroupper
G_M46269_IG02:
vmovupd xmm0, xmmword ptr [reloc @RWD00] ; loaded from the data section, OK
vmovupd xmm1, xmmword ptr [r8]
vpshufb xmm0, xmm1, xmm0
vmovupd xmmword ptr [rdx], xmm0
mov rax, rdx
G_M46269_IG03:
ret
RWD00 dq FF01FFFFFF00FFFFh, FF03FFFFFF02FFFFh ; <-- it's here
; Total bytes of code: 29
; Method Case1_1
G_M7091_IG01:
vzeroupper
G_M7091_IG02:
vmovupd xmm0, xmmword ptr [r8]
vpshufb xmm0, xmm0, xmmword ptr [reloc @RWD00] ; loaded as part of vshufb without
; additional registers from the data secion - PERFECT!!
vmovupd xmmword ptr [rdx], xmm0
mov rax, rdx
G_M7091_IG03:
ret
RWD00 dq FF01FFFFFF00FFFFh, FF03FFFFFF02FFFFh ; <-- it's here
; Total bytes of code: 25
; Method Case2
G_M31870_IG01:
push rsi
sub rsp, 48
vzeroupper
mov rsi, rdx
G_M31870_IG02:
vmovupd xmm0, xmmword ptr [r8]
vmovupd xmmword ptr [rsp+20H], xmm0
mov rcx, 0xD1FFAB1E
mov edx, 2
call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE ; static initialization (in some cases can be eliminated by jit but still...)
mov rax, 0xD1FFAB1E ; <- additional mov
mov rax, gword ptr [rax]
vmovupd xmm0, xmmword ptr [rsp+20H]
vpshufb xmm0, xmm0, xmmword ptr [rax+8]
vmovupd xmmword ptr [rsi], xmm0
mov rax, rsi
G_M31870_IG03:
add rsp, 48
pop rsi
ret
; Total bytes of code: 80
; Method Case3
G_M23615_IG01:
vzeroupper
G_M23615_IG02:
mov rax, 0xD1FFAB1E ; <- additional mov
; kind of makes sense if the same mask is used from different methods
; but the C# code looks a bit ugly
vmovupd xmm0, xmmword ptr [rax]
vmovupd xmm1, xmmword ptr [r8]
vpshufb xmm0, xmm1, xmm0
vmovupd xmmword ptr [rdx], xmm0
mov rax, rdx
G_M23615_IG03:
ret
; Total bytes of code: 35The first case used to be avoided due to some codegen issues, but looks like those were resolved (e.g. JIT now saves such vectors into the data section, does Value Numbering for SIMDs including constant vectors, does CSE, etc - #31834?) so we now have a lot of static readonly fields and we can revise them and convert into Case1(1.1)-style where possible (maybe even if we need to duplicate them in different methods), e.g.:
Places to revise:
- Base64Encoder.cs
- Base64Decoder.cs
- Sse2Helper.cs
- Ssse3Helper.cs
- BitArray.cs
- Maybe there are more
- Scan other repos: aspnetcore, ML.NET, etc
Known limitations for Case1:
- See Case1.1 comment
- JIT doesn't hoist Vector.Create from loops' bodies yet (would be nice to have)
/cc @stephentoub @GrabYourPitchforks @benaadams @tannergooding
JimBobSquarePants and zdivelbiss
Metadata
Metadata
Assignees
Labels
area-System.Runtime.Intrinsicsgood first issueIssue should be easy to implement, good for first-time contributorsIssue should be easy to implement, good for first-time contributorshelp wanted[up-for-grabs] Good issue for external contributors[up-for-grabs] Good issue for external contributorstenet-performancePerformance related issuePerformance related issue