Skip to content

Conversation

tannergooding
Copy link
Member

Without this flag constants aren't propagated into Shuffle if there are two or more uses, which results in significantly worse codegen.

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 11, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming there are no other intrinsics needing this flag too

@tannergooding
Copy link
Member Author

tannergooding commented Jan 11, 2025

The example diff is as follows, which takes us from 380 to 59 bytes of codegen:

public static Vector4 Cross(Vector4 vector1, Vector4 vector2)
{
    Vector128<float> v1 = vector1.AsVector128();
    Vector128<float> v2 = vector2.AsVector128();

    Vector128<int> shuftleYZXW = Vector128.Create(1, 2, 0, 3);
    Vector128<int> shuftleZXYW = Vector128.Create(2, 0, 1, 3);

    Vector128<float> m1 = Vector128.Shuffle(v1, shuftleYZXW) *
        Vector128.Shuffle(v2, shuftleZXYW);
    Vector128<float> m2 = Vector128.Shuffle(v1, shuftleZXYW) *
        Vector128.Shuffle(v2, shuftleYZXW);
    m2 = m2.WithElement(3, 0);

    return (m1 - m2).AsVector4();
}

Before

; Method Program:Cross(System.Numerics.Vector4,System.Numerics.Vector4):System.Numerics.Vector4 (FullOpts)
G_M000_IG01:                ;; offset=0x0000
       push     rbx
       sub      rsp, 320
       vmovaps  xmmword ptr [rsp+0x130], xmm6
       vmovaps  xmmword ptr [rsp+0x120], xmm7
       vmovaps  xmmword ptr [rsp+0x110], xmm8
       vmovaps  xmmword ptr [rsp+0x100], xmm9
       mov      rbx, rcx

G_M000_IG02:                ;; offset=0x002F
       vmovups  xmm6, xmmword ptr [r8]
       vmovups  xmm7, xmmword ptr [reloc @RWD00]
       vmovups  xmm8, xmmword ptr [reloc @RWD16]
       vmovups  xmm9, xmmword ptr [rdx]
       vmovaps  xmmword ptr [rsp+0xC0], xmm9
       vmovaps  xmmword ptr [rsp+0xB0], xmm7
       lea      rdx, [rsp+0xC0]
       lea      r8, [rsp+0xB0]
       lea      rcx, [rsp+0xD0]
       call     [System.Runtime.Intrinsics.Vector128:Shuffle(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[int]):System.Runtime.Intrinsics.Vector128`1[float]]
       vmovaps  xmm0, xmmword ptr [rsp+0xD0]
       vmovups  xmmword ptr [rsp+0xF0], xmm0
       vmovaps  xmmword ptr [rsp+0x90], xmm6
       vmovaps  xmmword ptr [rsp+0x80], xmm8
       lea      rdx, [rsp+0x90]
       lea      r8, [rsp+0x80]
       lea      rcx, [rsp+0xA0]
       call     [System.Runtime.Intrinsics.Vector128:Shuffle(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[int]):System.Runtime.Intrinsics.Vector128`1[float]]
       vmovups  xmm0, xmmword ptr [rsp+0xF0]
       vmulps   xmm0, xmm0, xmmword ptr [rsp+0xA0]
       vmovups  xmmword ptr [rsp+0xF0], xmm0
       vmovaps  xmmword ptr [rsp+0x60], xmm9
       vmovaps  xmmword ptr [rsp+0x50], xmm8
       lea      rdx, [rsp+0x60]
       lea      r8, [rsp+0x50]
       lea      rcx, [rsp+0x70]
       call     [System.Runtime.Intrinsics.Vector128:Shuffle(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[int]):System.Runtime.Intrinsics.Vector128`1[float]]
       vmovaps  xmm0, xmmword ptr [rsp+0x70]
       vmovups  xmmword ptr [rsp+0xE0], xmm0
       vmovaps  xmmword ptr [rsp+0x30], xmm6
       vmovaps  xmmword ptr [rsp+0x20], xmm7
       lea      rdx, [rsp+0x30]
       lea      r8, [rsp+0x20]
       lea      rcx, [rsp+0x40]
       call     [System.Runtime.Intrinsics.Vector128:Shuffle(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[int]):System.Runtime.Intrinsics.Vector128`1[float]]
       vmovups  xmm0, xmmword ptr [rsp+0xE0]
       vmulps   xmm0, xmm0, xmmword ptr [rsp+0x40]
       vinsertps xmm0, xmm0, xmm0, 56
       vmovups  xmm1, xmmword ptr [rsp+0xF0]
       vsubps   xmm0, xmm1, xmm0
       vmovups  xmmword ptr [rbx], xmm0
       mov      rax, rbx

G_M000_IG03:                ;; offset=0x014F
       vmovaps  xmm6, xmmword ptr [rsp+0x130]
       vmovaps  xmm7, xmmword ptr [rsp+0x120]
       vmovaps  xmm8, xmmword ptr [rsp+0x110]
       vmovaps  xmm9, xmmword ptr [rsp+0x100]
       add      rsp, 320
       pop      rbx
       ret      
RWD00  	dq	0000000200000001h, 0000000300000000h
RWD16  	dq	0000000000000002h, 0000000300000001h
; Total bytes of code: 380

After

; Method Program:Cross(System.Numerics.Vector4,System.Numerics.Vector4):System.Numerics.Vector4 (FullOpts)
G_M35725_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M35725_IG02:  ;; offset=0x0000
       vmovups  xmm0, xmmword ptr [r8]
       vmovups  xmm1, xmmword ptr [rdx]
       vpermilps xmm2, xmm1, -55
       vpermilps xmm3, xmm0, -46
       vmulps   xmm2, xmm3, xmm2
       vpermilps xmm1, xmm1, -46
       vpermilps xmm0, xmm0, -55
       vmulps   xmm0, xmm0, xmm1
       vinsertps xmm0, xmm0, xmm0, 56
       vsubps   xmm0, xmm2, xmm0
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
						;; size=58 bbWeight=1 PerfScore 24.25

G_M35725_IG03:  ;; offset=0x003A
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 59

@tannergooding tannergooding marked this pull request as ready for review January 11, 2025 12:12
@tannergooding tannergooding merged commit 82ab892 into dotnet:main Jan 11, 2025
114 of 116 checks passed
@tannergooding tannergooding deleted the fix-shuffle-prop branch January 11, 2025 16:37
grendello added a commit to grendello/runtime that referenced this pull request Jan 13, 2025
* main:
  JIT: Model GT_RETURN kills with contained operand (dotnet#111230)
  Update dependencies from https://github.com/dotnet/runtime-assets build 20250110.2 (dotnet#111290)
  [NativeAOT/ARM64] Generate frames compatible with Apple compact unwinding (dotnet#107766)
  Cleanup unused JIT stubs in vm (dotnet#111237)
  Ensure that Shuffle is marked as HW_Flag_CanBenefitFromConstantProp (dotnet#111303)
  Fix CMP0173 policy warning with cmake 3.31 (dotnet#110522)
  [RISC-V] Fix HostActivation.Tests unknown-rid (dotnet#110687)
  Fix accidentally duplicated global-build-step.yml in runtime-official.yml (dotnet#111302)
  JIT: run extra SPMI queries for arrays (dotnet#111293)
  Split the Runtime Shared Framework project and combine legs in the official build (dotnet#111136)
  Do not ignore `MemoryMarshal.TryWrite` result (dotnet#108661)
  Update dependencies from https://github.com/dotnet/emsdk build 20250109.1 (dotnet#111263)
  Clean up in Number.Formatting.cs (dotnet#110955)
@github-actions github-actions bot locked and limited conversation to collaborators Feb 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants