Skip to content

Ensure that Shuffle is marked as HW_Flag_CanBenefitFromConstantProp #111303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 11, 2025

Conversation

tannergooding
Copy link
Member

Without this flag constants aren't propagated into Shuffle if there are two or more uses, which results in significantly worse codegen.

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 11, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming there are no other intrinsics needing this flag too

@tannergooding
Copy link
Member Author

tannergooding commented Jan 11, 2025

The example diff is as follows, which takes us from 380 to 59 bytes of codegen:

public static Vector4 Cross(Vector4 vector1, Vector4 vector2)
{
    Vector128<float> v1 = vector1.AsVector128();
    Vector128<float> v2 = vector2.AsVector128();

    Vector128<int> shuftleYZXW = Vector128.Create(1, 2, 0, 3);
    Vector128<int> shuftleZXYW = Vector128.Create(2, 0, 1, 3);

    Vector128<float> m1 = Vector128.Shuffle(v1, shuftleYZXW) *
        Vector128.Shuffle(v2, shuftleZXYW);
    Vector128<float> m2 = Vector128.Shuffle(v1, shuftleZXYW) *
        Vector128.Shuffle(v2, shuftleYZXW);
    m2 = m2.WithElement(3, 0);

    return (m1 - m2).AsVector4();
}

Before

; Method Program:Cross(System.Numerics.Vector4,System.Numerics.Vector4):System.Numerics.Vector4 (FullOpts)
G_M000_IG01:                ;; offset=0x0000
       push     rbx
       sub      rsp, 320
       vmovaps  xmmword ptr [rsp+0x130], xmm6
       vmovaps  xmmword ptr [rsp+0x120], xmm7
       vmovaps  xmmword ptr [rsp+0x110], xmm8
       vmovaps  xmmword ptr [rsp+0x100], xmm9
       mov      rbx, rcx

G_M000_IG02:                ;; offset=0x002F
       vmovups  xmm6, xmmword ptr [r8]
       vmovups  xmm7, xmmword ptr [reloc @RWD00]
       vmovups  xmm8, xmmword ptr [reloc @RWD16]
       vmovups  xmm9, xmmword ptr [rdx]
       vmovaps  xmmword ptr [rsp+0xC0], xmm9
       vmovaps  xmmword ptr [rsp+0xB0], xmm7
       lea      rdx, [rsp+0xC0]
       lea      r8, [rsp+0xB0]
       lea      rcx, [rsp+0xD0]
       call     [System.Runtime.Intrinsics.Vector128:Shuffle(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[int]):System.Runtime.Intrinsics.Vector128`1[float]]
       vmovaps  xmm0, xmmword ptr [rsp+0xD0]
       vmovups  xmmword ptr [rsp+0xF0], xmm0
       vmovaps  xmmword ptr [rsp+0x90], xmm6
       vmovaps  xmmword ptr [rsp+0x80], xmm8
       lea      rdx, [rsp+0x90]
       lea      r8, [rsp+0x80]
       lea      rcx, [rsp+0xA0]
       call     [System.Runtime.Intrinsics.Vector128:Shuffle(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[int]):System.Runtime.Intrinsics.Vector128`1[float]]
       vmovups  xmm0, xmmword ptr [rsp+0xF0]
       vmulps   xmm0, xmm0, xmmword ptr [rsp+0xA0]
       vmovups  xmmword ptr [rsp+0xF0], xmm0
       vmovaps  xmmword ptr [rsp+0x60], xmm9
       vmovaps  xmmword ptr [rsp+0x50], xmm8
       lea      rdx, [rsp+0x60]
       lea      r8, [rsp+0x50]
       lea      rcx, [rsp+0x70]
       call     [System.Runtime.Intrinsics.Vector128:Shuffle(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[int]):System.Runtime.Intrinsics.Vector128`1[float]]
       vmovaps  xmm0, xmmword ptr [rsp+0x70]
       vmovups  xmmword ptr [rsp+0xE0], xmm0
       vmovaps  xmmword ptr [rsp+0x30], xmm6
       vmovaps  xmmword ptr [rsp+0x20], xmm7
       lea      rdx, [rsp+0x30]
       lea      r8, [rsp+0x20]
       lea      rcx, [rsp+0x40]
       call     [System.Runtime.Intrinsics.Vector128:Shuffle(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[int]):System.Runtime.Intrinsics.Vector128`1[float]]
       vmovups  xmm0, xmmword ptr [rsp+0xE0]
       vmulps   xmm0, xmm0, xmmword ptr [rsp+0x40]
       vinsertps xmm0, xmm0, xmm0, 56
       vmovups  xmm1, xmmword ptr [rsp+0xF0]
       vsubps   xmm0, xmm1, xmm0
       vmovups  xmmword ptr [rbx], xmm0
       mov      rax, rbx

G_M000_IG03:                ;; offset=0x014F
       vmovaps  xmm6, xmmword ptr [rsp+0x130]
       vmovaps  xmm7, xmmword ptr [rsp+0x120]
       vmovaps  xmm8, xmmword ptr [rsp+0x110]
       vmovaps  xmm9, xmmword ptr [rsp+0x100]
       add      rsp, 320
       pop      rbx
       ret      
RWD00  	dq	0000000200000001h, 0000000300000000h
RWD16  	dq	0000000000000002h, 0000000300000001h
; Total bytes of code: 380

After

; Method Program:Cross(System.Numerics.Vector4,System.Numerics.Vector4):System.Numerics.Vector4 (FullOpts)
G_M35725_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M35725_IG02:  ;; offset=0x0000
       vmovups  xmm0, xmmword ptr [r8]
       vmovups  xmm1, xmmword ptr [rdx]
       vpermilps xmm2, xmm1, -55
       vpermilps xmm3, xmm0, -46
       vmulps   xmm2, xmm3, xmm2
       vpermilps xmm1, xmm1, -46
       vpermilps xmm0, xmm0, -55
       vmulps   xmm0, xmm0, xmm1
       vinsertps xmm0, xmm0, xmm0, 56
       vsubps   xmm0, xmm2, xmm0
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
						;; size=58 bbWeight=1 PerfScore 24.25

G_M35725_IG03:  ;; offset=0x003A
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 59

@tannergooding tannergooding marked this pull request as ready for review January 11, 2025 12:12
@tannergooding tannergooding merged commit 82ab892 into dotnet:main Jan 11, 2025
114 of 116 checks passed
@tannergooding tannergooding deleted the fix-shuffle-prop branch January 11, 2025 16:37
grendello added a commit to grendello/runtime that referenced this pull request Jan 13, 2025
* main:
  JIT: Model GT_RETURN kills with contained operand (dotnet#111230)
  Update dependencies from https://github.com/dotnet/runtime-assets build 20250110.2 (dotnet#111290)
  [NativeAOT/ARM64] Generate frames compatible with Apple compact unwinding (dotnet#107766)
  Cleanup unused JIT stubs in vm (dotnet#111237)
  Ensure that Shuffle is marked as HW_Flag_CanBenefitFromConstantProp (dotnet#111303)
  Fix CMP0173 policy warning with cmake 3.31 (dotnet#110522)
  [RISC-V] Fix HostActivation.Tests unknown-rid (dotnet#110687)
  Fix accidentally duplicated global-build-step.yml in runtime-official.yml (dotnet#111302)
  JIT: run extra SPMI queries for arrays (dotnet#111293)
  Split the Runtime Shared Framework project and combine legs in the official build (dotnet#111136)
  Do not ignore `MemoryMarshal.TryWrite` result (dotnet#108661)
  Update dependencies from https://github.com/dotnet/emsdk build 20250109.1 (dotnet#111263)
  Clean up in Number.Formatting.cs (dotnet#110955)
@github-actions github-actions bot locked and limited conversation to collaborators Feb 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants