Skip to content

Revise how constant SIMD vectors are defined in BCL #44115

@EgorBo

Description

@EgorBo

From #44111 (comment)

There are 3 patterns we currently use across the BCL for const vectors:

    //
    // Case 1: Plain Vector.Create
    public Vector128<byte> Case1(Vector128<byte> vec)
    {
        Vector128<byte> mask = Vector128.Create(
            0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
            0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF);

        return Ssse3.Shuffle(vec, mask);
    }


    //
    // Case 1.1: Plain Vector.Create as argument of some SIMD instruction directly
    // Should be the same codegen as for Case1 ^ (spoiler: it's not. Forward Substitution? see #4655)
    public Vector128<byte> Case1_1(Vector128<byte> vec)
    {
        return Ssse3.Shuffle(vec,
                    Vector128.Create( // used without "mask" local as in Case1 ^
                        0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
                        0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF));
    }


    //
    // Case 2: static readonly Vector
    private static readonly Vector128<byte> s_mask = Vector128.Create(
        0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
        0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF);
    public Vector128<byte> Case2(Vector128<byte> vec)
    {
        // we also often save it to a local first (e.g. before loops)
        return Ssse3.Shuffle(vec, s_mask);
    }


    //
    // Case 3: Roslyn's hack
    private static ReadOnlySpan<byte> Mask => new byte[] {
        0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
        0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF };
    public Vector128<byte> Case3(Vector128<byte> vec)
    {
        return Ssse3.Shuffle(vec, Unsafe.ReadUnaligned<Vector128<byte>>(
            ref MemoryMarshal.GetReference(Mask)));
    }

Here is the current codegen for these cases:

; Method Case1
G_M46269_IG01:
       vzeroupper 
G_M46269_IG02:
       vmovupd  xmm0, xmmword ptr [reloc @RWD00]        ; loaded from the data section, OK
       vmovupd  xmm1, xmmword ptr [r8]
       vpshufb  xmm0, xmm1, xmm0
       vmovupd  xmmword ptr [rdx], xmm0
       mov      rax, rdx
G_M46269_IG03:
       ret      
RWD00  	dq	FF01FFFFFF00FFFFh, FF03FFFFFF02FFFFh    ; <-- it's here
; Total bytes of code: 29



; Method Case1_1
G_M7091_IG01:
       vzeroupper 
G_M7091_IG02:
       vmovupd  xmm0, xmmword ptr [r8]
       vpshufb  xmm0, xmm0, xmmword ptr [reloc @RWD00]  ; loaded as part of vshufb without 
                                                        ; additional registers from the data secion - PERFECT!!
       vmovupd  xmmword ptr [rdx], xmm0
       mov      rax, rdx
G_M7091_IG03:
       ret      
RWD00  	dq	FF01FFFFFF00FFFFh, FF03FFFFFF02FFFFh    ; <-- it's here
; Total bytes of code: 25



; Method Case2
G_M31870_IG01:
       push     rsi
       sub      rsp, 48
       vzeroupper 
       mov      rsi, rdx
G_M31870_IG02:
       vmovupd  xmm0, xmmword ptr [r8]
       vmovupd  xmmword ptr [rsp+20H], xmm0
       mov      rcx, 0xD1FFAB1E
       mov      edx, 2
       call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE  ; static initialization (in some cases can be eliminated by jit but still...)
       mov      rax, 0xD1FFAB1E                          ; <- additional mov
       mov      rax, gword ptr [rax]
       vmovupd  xmm0, xmmword ptr [rsp+20H]
       vpshufb  xmm0, xmm0, xmmword ptr [rax+8]
       vmovupd  xmmword ptr [rsi], xmm0
       mov      rax, rsi
G_M31870_IG03:
       add      rsp, 48
       pop      rsi
       ret      
; Total bytes of code: 80



; Method Case3
G_M23615_IG01:
       vzeroupper 
G_M23615_IG02:
       mov      rax, 0xD1FFAB1E                          ; <- additional mov
                                                         ; kind of makes sense if the same mask is used from different methods
                                                         ; but the C# code looks a bit ugly
       vmovupd  xmm0, xmmword ptr [rax]
       vmovupd  xmm1, xmmword ptr [r8]
       vpshufb  xmm0, xmm1, xmm0
       vmovupd  xmmword ptr [rdx], xmm0
       mov      rax, rdx
G_M23615_IG03:
       ret      
; Total bytes of code: 35

The first case used to be avoided due to some codegen issues, but looks like those were resolved (e.g. JIT now saves such vectors into the data section, does Value Numbering for SIMDs including constant vectors, does CSE, etc - #31834?) so we now have a lot of static readonly fields and we can revise them and convert into Case1(1.1)-style where possible (maybe even if we need to duplicate them in different methods), e.g.:

Places to revise:

  1. Base64Encoder.cs
  2. Base64Decoder.cs
  3. Sse2Helper.cs
  4. Ssse3Helper.cs
  5. BitArray.cs
  6. Maybe there are more
  7. Scan other repos: aspnetcore, ML.NET, etc

Known limitations for Case1:

  • See Case1.1 comment
  • JIT doesn't hoist Vector.Create from loops' bodies yet (would be nice to have)

/cc @stephentoub @GrabYourPitchforks @benaadams @tannergooding

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions