Skip to content

JIT: Support bitwise field extractions from parameter registers #112740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

jakobbotsch
Copy link
Member

Recent work now allows us to finally add support for the backend to extract fields out of parameters without spilling them to stack. Previously this was only supported when the fields mapped cleanly to registers.

A win-x64 example:

static int Foo(int? foo)
{
    return foo.HasValue ? foo.Value : 0;
}
 ; Method Program:Foo(System.Nullable`1[int]):int (FullOpts)
 G_M19236_IG01:  ;; offset=0x0000
-       mov      qword ptr [rsp+0x08], rcx
-						;; size=5 bbWeight=0.50 PerfScore 0.50
+						;; size=0 bbWeight=0.50 PerfScore 0.00

-G_M19236_IG02:  ;; offset=0x0005
-       movzx    rcx, cl
-       xor      eax, eax
-       test     ecx, ecx
-       cmovne   eax, dword ptr [rsp+0x0C]
-						;; size=12 bbWeight=0.50 PerfScore 1.38
+G_M19236_IG02:  ;; offset=0x0000
+       movzx    rax, cl
+       shr      rcx, 32
+       xor      edx, edx
+       test     eax, eax
+       mov      eax, edx
+       cmovne   eax, ecx
+						;; size=16 bbWeight=0.50 PerfScore 0.88

-G_M19236_IG03:  ;; offset=0x0011
+G_M19236_IG03:  ;; offset=0x0010
        ret
 						;; size=1 bbWeight=0.50 PerfScore 0.50
-; Total bytes of code: 18
-
+; Total bytes of code: 17

Another win-x64 example:

static float Sum(PointF p)
{
    return p.X + p.Y;
}
 ; Method Program:Sum(System.Drawing.PointF):float (FullOpts)
 G_M48891_IG01:  ;; offset=0x0000
-       mov      qword ptr [rsp+0x08], rcx
-						;; size=5 bbWeight=1 PerfScore 1.00
+						;; size=0 bbWeight=1 PerfScore 0.00

-G_M48891_IG02:  ;; offset=0x0005
-       vmovss   xmm0, dword ptr [rsp+0x08]
-       vaddss   xmm0, xmm0, dword ptr [rsp+0x0C]
-						;; size=12 bbWeight=1 PerfScore 8.00
+G_M48891_IG02:  ;; offset=0x0000
+       vmovd    xmm0, ecx
+       shr      rcx, 32
+       vmovd    xmm1, ecx
+       vaddss   xmm0, xmm0, xmm1
+						;; size=16 bbWeight=1 PerfScore 7.50

-G_M48891_IG03:  ;; offset=0x0011
+G_M48891_IG03:  ;; offset=0x0010
        ret
 						;; size=1 bbWeight=1 PerfScore 1.00
-; Total bytes of code: 18
+; Total bytes of code: 17

An arm64 example:

static bool Test(Memory<int> mem)
{
    return mem.Length > 10;
}
 ; Method Program:Test(System.Memory`1[int]):ubyte (FullOpts)
 G_M53448_IG01:  ;; offset=0x0000
-            stp     fp, lr, [sp, #-0x20]!
+            stp     fp, lr, [sp, #-0x10]!
             mov     fp, sp
-            stp     x0, x1, [fp, #0x10]	// [V00 arg0], [V00 arg0+0x08]
-						;; size=12 bbWeight=1 PerfScore 2.50
+						;; size=8 bbWeight=1 PerfScore 1.50

-G_M53448_IG02:  ;; offset=0x000C
-            ldr     w0, [fp, #0x1C]	// [V00 arg0+0x0c]
+G_M53448_IG02:  ;; offset=0x0008
+            lsr     x0, x1, #32
             cmp     w0, #10
             cset    x0, gt
-						;; size=12 bbWeight=1 PerfScore 3.00
+						;; size=12 bbWeight=1 PerfScore 2.00

-G_M53448_IG03:  ;; offset=0x0018
-            ldp     fp, lr, [sp], #0x20
+G_M53448_IG03:  ;; offset=0x0014
+            ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
-; Total bytes of code: 32
+; Total bytes of code: 28

Float -> float extractions that do not map cleanly is still not supported, but should be doable (via vector register extractions). Float -> int extractions are not supported, but I'm not sure we see these.

This is often not a code size improvement, but typically a perfscore improvement. Also this seems to have some bad interactions with call arguments since they do not yet support something similar, but hopefully that can be improved separately.

This should fix a number of issues that I need to go find.

Recent work now allows us to finally add support for the backend to
extract fields out of parameters without spilling them to stack.
Previously this was only supported when the fields mapped cleanly to
registers.

A win-x64 example:
```csharp
static int Foo(int? foo)
{
    return foo.HasValue ? foo.Value : 0;
}
```
```diff
 ; Method Program:Foo(System.Nullable`1[int]):int (FullOpts)
 G_M19236_IG01:  ;; offset=0x0000
-       mov      qword ptr [rsp+0x08], rcx
-						;; size=5 bbWeight=0.50 PerfScore 0.50
+						;; size=0 bbWeight=0.50 PerfScore 0.00

-G_M19236_IG02:  ;; offset=0x0005
-       movzx    rcx, cl
-       xor      eax, eax
-       test     ecx, ecx
-       cmovne   eax, dword ptr [rsp+0x0C]
-						;; size=12 bbWeight=0.50 PerfScore 1.38
+G_M19236_IG02:  ;; offset=0x0000
+       movzx    rax, cl
+       shr      rcx, 32
+       xor      edx, edx
+       test     eax, eax
+       mov      eax, edx
+       cmovne   eax, ecx
+						;; size=16 bbWeight=0.50 PerfScore 0.88

-G_M19236_IG03:  ;; offset=0x0011
+G_M19236_IG03:  ;; offset=0x0010
        ret
 						;; size=1 bbWeight=0.50 PerfScore 0.50
-; Total bytes of code: 18
-
+; Total bytes of code: 17
```

Another win-x64 example:
```csharp
static float Sum(PointF p)
{
    return p.X + p.Y;
}
```

```diff
 ; Method Program:Sum(System.Drawing.PointF):float (FullOpts)
 G_M48891_IG01:  ;; offset=0x0000
-       mov      qword ptr [rsp+0x08], rcx
-						;; size=5 bbWeight=1 PerfScore 1.00
+						;; size=0 bbWeight=1 PerfScore 0.00

-G_M48891_IG02:  ;; offset=0x0005
-       vmovss   xmm0, dword ptr [rsp+0x08]
-       vaddss   xmm0, xmm0, dword ptr [rsp+0x0C]
-						;; size=12 bbWeight=1 PerfScore 8.00
+G_M48891_IG02:  ;; offset=0x0000
+       vmovd    xmm0, ecx
+       shr      rcx, 32
+       vmovd    xmm1, ecx
+       vaddss   xmm0, xmm0, xmm1
+						;; size=16 bbWeight=1 PerfScore 7.50

-G_M48891_IG03:  ;; offset=0x0011
+G_M48891_IG03:  ;; offset=0x0010
        ret
 						;; size=1 bbWeight=1 PerfScore 1.00
-; Total bytes of code: 18
+; Total bytes of code: 17
```

An arm64 example:
```csharp
static bool Test(Memory<int> mem)
{
    return mem.Length > 10;
}
```

```diff
 ; Method Program:Test(System.Memory`1[int]):ubyte (FullOpts)
 G_M53448_IG01:  ;; offset=0x0000
-            stp     fp, lr, [sp, #-0x20]!
+            stp     fp, lr, [sp, #-0x10]!
             mov     fp, sp
-            stp     x0, x1, [fp, #0x10]	// [V00 arg0], [V00 arg0+0x08]
-						;; size=12 bbWeight=1 PerfScore 2.50
+						;; size=8 bbWeight=1 PerfScore 1.50

-G_M53448_IG02:  ;; offset=0x000C
-            ldr     w0, [fp, #0x1C]	// [V00 arg0+0x0c]
+G_M53448_IG02:  ;; offset=0x0008
+            lsr     x0, x1, dotnet#32
             cmp     w0, dotnet#10
             cset    x0, gt
-						;; size=12 bbWeight=1 PerfScore 3.00
+						;; size=12 bbWeight=1 PerfScore 2.00

-G_M53448_IG03:  ;; offset=0x0018
-            ldp     fp, lr, [sp], #0x20
+G_M53448_IG03:  ;; offset=0x0014
+            ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
-; Total bytes of code: 32
+; Total bytes of code: 28
```
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 20, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr superpmi-replay, runtime-coreclr superpmi-diffs

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Comment on lines +8037 to +8042
// SIMD12 extractions are not supported as they would require the upper
// field to be zeroed on extraction from a SIMD16 argument.
if (fld->TypeIs(TYP_SIMD12))
{
continue;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the case. The upper field is zeroed by operations where it's relevant, it's not expected to be 0.

Notably you can guarantee zeroing pretty easily, however. Either via shuffle (if its elements 1/2/3) or via insert (if its element 0/1/2)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the case. The upper field is zeroed by operations where it's relevant, it's not expected to be 0.

Is this comment wrong then?

if (node->TypeGet() == TYP_SIMD12)
{
// Assumption 1:
// RyuJit backend depends on the assumption that on 64-Bit targets Vector3 size is rounded off
// to TARGET_POINTER_SIZE and hence Vector3 locals on stack can be treated as TYP_SIMD16 for
// reading and writing purposes.
//
// Assumption 2:
// RyuJit backend is making another implicit assumption that Vector3 type args when passed in
// registers or on stack, the upper most 4-bytes will be zero.
//
// For P/Invoke return and Reverse P/Invoke argument passing, native compiler doesn't guarantee
// that upper 4-bytes of a Vector3 type struct is zero initialized and hence assumption 2 is
// invalid.
//
// RyuJIT x64 Windows: arguments are treated as passed by ref and hence read/written just 12
// bytes. In case of Vector3 returns, Caller allocates a zero initialized Vector3 local and
// passes it retBuf arg and Callee method writes only 12 bytes to retBuf. For this reason,
// there is no need to clear upper 4-bytes of Vector3 type args.
//
// RyuJIT x64 Unix: arguments are treated as passed by value and read/writen as if TYP_SIMD16.
// Vector3 return values are returned two return registers and Caller assembles them into a
// single xmm reg. Hence RyuJIT explicitly generates code to clears upper 4-bytes of Vector3
// type args in prolog and Vector3 type return value of a call
//
// RyuJIT x86 Windows: all non-param Vector3 local vars are allocated as 16 bytes. Vector3 arguments
// are pushed as 12 bytes. For return values, a 16-byte local is allocated and the address passed
// as a return buffer pointer. The callee doesn't write the high 4 bytes, and we don't need to clear
// it either.

Copy link
Member

@tannergooding tannergooding Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks incorrect. We can't really make the assumption that the last element is zero due to various zero-cost conversions or other operations (it'd be very expensive to have to explicitly zero after every operation).

We do try and zero the upper element when it's trivial to do as part of another existing operation, as that can avoid delays caused by handling NaN or subnormal values, but all the APIs that operate across lanes (like Sum or DotProduct) have to explicitly ensure the element is zero for correctness prior to handling the input.

// TODO-CQ: If it is already remapped, we can reuse the value from
// the remapping.
if (comp->FindParameterRegisterLocalMappingByRegister(segment.GetRegister()) == nullptr)
// TODO-CQ: Float -> !float extractions are not supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be gtNewSimdGetElementNode(op, index), no?

or in the case that it's known to be index == 0, it can just be bitcast?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with float->float extractions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will add it in a follow-up.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr superpmi-diffs

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

jakobbotsch added a commit that referenced this pull request Mar 8, 2025
Based on the new `FIELD_LIST` support for returns this PR adds support for the
JIT to combine smaller fields via bitwise operations when returned, instead of
spilling these to stack.

win-x64 examples:
```csharp
static int? Test()
{
    return Environment.TickCount;
}
```

```diff
        call     System.Environment:get_TickCount():int
-       mov      dword ptr [rsp+0x24], eax
-       mov      byte  ptr [rsp+0x20], 1
-       mov      rax, qword ptr [rsp+0x20]
-						;; size=19 bbWeight=1 PerfScore 4.00
+       mov      eax, eax
+       shl      rax, 32
+       or       rax, 1
+						;; size=15 bbWeight=1 PerfScore 2.00
```
(the `mov eax, eax` is unnecessary, but not that simple to get rid of)

 ```csharp
static (int x, float y) Test(int x, float y)
{
    return (x, y);
}
```

```diff
-       mov      dword ptr [rsp], ecx
-       vmovss   dword ptr [rsp+0x04], xmm1
-       mov      rax, qword ptr [rsp]
+       vmovd    eax, xmm1
+       shl      rax, 32
+       mov      ecx, ecx
+       or       rax, rcx
 						;; size=13 bbWeight=1 PerfScore 3.00
```

An arm64 example:
```csharp
static Memory<int> ToMemory(int[] arr)
{
    return arr.AsMemory();
}
```

```diff
 G_M45070_IG01:  ;; offset=0x0000
-            stp     fp, lr, [sp, #-0x20]!
+            stp     fp, lr, [sp, #-0x10]!
             mov     fp, sp
-            str     xzr, [fp, #0x10]	// [V03 tmp2]
-						;; size=12 bbWeight=1 PerfScore 2.50
-G_M45070_IG02:  ;; offset=0x000C
+						;; size=8 bbWeight=1 PerfScore 1.50
+G_M45070_IG02:  ;; offset=0x0008
             cbz     x0, G_M45070_IG06
 						;; size=4 bbWeight=1 PerfScore 1.00
-G_M45070_IG03:  ;; offset=0x0010
-            str     x0, [fp, #0x10]	// [V07 tmp6]
-            str     wzr, [fp, #0x18]	// [V08 tmp7]
-            ldr     x0, [fp, #0x10]	// [V07 tmp6]
-            ldr     w0, [x0, #0x08]
-            str     w0, [fp, #0x1C]	// [V09 tmp8]
-						;; size=20 bbWeight=0.80 PerfScore 6.40
-G_M45070_IG04:  ;; offset=0x0024
-            ldp     x0, x1, [fp, #0x10]	// [V03 tmp2], [V03 tmp2+0x08]
-						;; size=4 bbWeight=1 PerfScore 3.00
-G_M45070_IG05:  ;; offset=0x0028
-            ldp     fp, lr, [sp], #0x20
+G_M45070_IG03:  ;; offset=0x000C
+            ldr     w1, [x0, #0x08]
+						;; size=4 bbWeight=0.80 PerfScore 2.40
+G_M45070_IG04:  ;; offset=0x0010
+            mov     w1, w1
+            mov     x2, xzr
+            orr     x1, x2, x1,  LSL #32
+						;; size=12 bbWeight=1 PerfScore 2.00
+G_M45070_IG05:  ;; offset=0x001C
+            ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
-G_M45070_IG06:  ;; offset=0x0030
-            str     xzr, [fp, #0x10]	// [V07 tmp6]
-            str     xzr, [fp, #0x18]
+G_M45070_IG06:  ;; offset=0x0024
+            mov     x0, xzr
+            mov     w1, wzr
             b       G_M45070_IG04
-						;; size=12 bbWeight=0.20 PerfScore 0.60
+						;; size=12 bbWeight=0.20 PerfScore 0.40
```
(sneak peek -- this codegen requires some supplementary changes, and there's
additional opportunities here)

This is the return counterpart to #112740. That PR has a bunch of regressions
that makes it look like we need to support returns/call arguments first, before
we try to support parameters.

There's a few follow-ups here:
- Support for float->float insertions (when a float value needs to be returned
  as the 1st, 2nd, .... field of a SIMD register)
- Support for coalescing memory loads, particularly because the fields of the
  `FIELD_LIST` come from a promoted struct that ended up DNER. In those cases we
  should be able to recombine the fields back to a single large field, instead
  of combining them with bitwise operations.
- Support for constant folding the bitwise insertions. This requires some more
  constant folding support in lowering.
- The JIT has lots of (now outdated) restrictions based around multi-reg returns
  that get in the way. Lifting these should improve things considerably.
Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@jakobbotsch
Copy link
Member Author

This work is still planned, but I need to add support for assembling call arguments first, to avoid various regressions here. Will keep it closed for now.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 29, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants