-
Notifications
You must be signed in to change notification settings - Fork 5k
JIT: Support bitwise field extractions from parameter registers #112740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Recent work now allows us to finally add support for the backend to extract fields out of parameters without spilling them to stack. Previously this was only supported when the fields mapped cleanly to registers. A win-x64 example: ```csharp static int Foo(int? foo) { return foo.HasValue ? foo.Value : 0; } ``` ```diff ; Method Program:Foo(System.Nullable`1[int]):int (FullOpts) G_M19236_IG01: ;; offset=0x0000 - mov qword ptr [rsp+0x08], rcx - ;; size=5 bbWeight=0.50 PerfScore 0.50 + ;; size=0 bbWeight=0.50 PerfScore 0.00 -G_M19236_IG02: ;; offset=0x0005 - movzx rcx, cl - xor eax, eax - test ecx, ecx - cmovne eax, dword ptr [rsp+0x0C] - ;; size=12 bbWeight=0.50 PerfScore 1.38 +G_M19236_IG02: ;; offset=0x0000 + movzx rax, cl + shr rcx, 32 + xor edx, edx + test eax, eax + mov eax, edx + cmovne eax, ecx + ;; size=16 bbWeight=0.50 PerfScore 0.88 -G_M19236_IG03: ;; offset=0x0011 +G_M19236_IG03: ;; offset=0x0010 ret ;; size=1 bbWeight=0.50 PerfScore 0.50 -; Total bytes of code: 18 - +; Total bytes of code: 17 ``` Another win-x64 example: ```csharp static float Sum(PointF p) { return p.X + p.Y; } ``` ```diff ; Method Program:Sum(System.Drawing.PointF):float (FullOpts) G_M48891_IG01: ;; offset=0x0000 - mov qword ptr [rsp+0x08], rcx - ;; size=5 bbWeight=1 PerfScore 1.00 + ;; size=0 bbWeight=1 PerfScore 0.00 -G_M48891_IG02: ;; offset=0x0005 - vmovss xmm0, dword ptr [rsp+0x08] - vaddss xmm0, xmm0, dword ptr [rsp+0x0C] - ;; size=12 bbWeight=1 PerfScore 8.00 +G_M48891_IG02: ;; offset=0x0000 + vmovd xmm0, ecx + shr rcx, 32 + vmovd xmm1, ecx + vaddss xmm0, xmm0, xmm1 + ;; size=16 bbWeight=1 PerfScore 7.50 -G_M48891_IG03: ;; offset=0x0011 +G_M48891_IG03: ;; offset=0x0010 ret ;; size=1 bbWeight=1 PerfScore 1.00 -; Total bytes of code: 18 +; Total bytes of code: 17 ``` An arm64 example: ```csharp static bool Test(Memory<int> mem) { return mem.Length > 10; } ``` ```diff ; Method Program:Test(System.Memory`1[int]):ubyte (FullOpts) G_M53448_IG01: ;; offset=0x0000 - stp fp, lr, [sp, #-0x20]! + stp fp, lr, [sp, #-0x10]! mov fp, sp - stp x0, x1, [fp, #0x10] // [V00 arg0], [V00 arg0+0x08] - ;; size=12 bbWeight=1 PerfScore 2.50 + ;; size=8 bbWeight=1 PerfScore 1.50 -G_M53448_IG02: ;; offset=0x000C - ldr w0, [fp, #0x1C] // [V00 arg0+0x0c] +G_M53448_IG02: ;; offset=0x0008 + lsr x0, x1, dotnet#32 cmp w0, dotnet#10 cset x0, gt - ;; size=12 bbWeight=1 PerfScore 3.00 + ;; size=12 bbWeight=1 PerfScore 2.00 -G_M53448_IG03: ;; offset=0x0018 - ldp fp, lr, [sp], #0x20 +G_M53448_IG03: ;; offset=0x0014 + ldp fp, lr, [sp], #0x10 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 -; Total bytes of code: 32 +; Total bytes of code: 28 ```
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
/azp run runtime-coreclr superpmi-replay, runtime-coreclr superpmi-diffs |
Azure Pipelines successfully started running 2 pipeline(s). |
// SIMD12 extractions are not supported as they would require the upper | ||
// field to be zeroed on extraction from a SIMD16 argument. | ||
if (fld->TypeIs(TYP_SIMD12)) | ||
{ | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the case. The upper field is zeroed by operations where it's relevant, it's not expected to be 0.
Notably you can guarantee zeroing pretty easily, however. Either via shuffle (if its elements 1/2/3) or via insert (if its element 0/1/2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the case. The upper field is zeroed by operations where it's relevant, it's not expected to be 0.
Is this comment wrong then?
runtime/src/coreclr/jit/lower.cpp
Lines 7725 to 7753 in 5f97ddf
if (node->TypeGet() == TYP_SIMD12) | |
{ | |
// Assumption 1: | |
// RyuJit backend depends on the assumption that on 64-Bit targets Vector3 size is rounded off | |
// to TARGET_POINTER_SIZE and hence Vector3 locals on stack can be treated as TYP_SIMD16 for | |
// reading and writing purposes. | |
// | |
// Assumption 2: | |
// RyuJit backend is making another implicit assumption that Vector3 type args when passed in | |
// registers or on stack, the upper most 4-bytes will be zero. | |
// | |
// For P/Invoke return and Reverse P/Invoke argument passing, native compiler doesn't guarantee | |
// that upper 4-bytes of a Vector3 type struct is zero initialized and hence assumption 2 is | |
// invalid. | |
// | |
// RyuJIT x64 Windows: arguments are treated as passed by ref and hence read/written just 12 | |
// bytes. In case of Vector3 returns, Caller allocates a zero initialized Vector3 local and | |
// passes it retBuf arg and Callee method writes only 12 bytes to retBuf. For this reason, | |
// there is no need to clear upper 4-bytes of Vector3 type args. | |
// | |
// RyuJIT x64 Unix: arguments are treated as passed by value and read/writen as if TYP_SIMD16. | |
// Vector3 return values are returned two return registers and Caller assembles them into a | |
// single xmm reg. Hence RyuJIT explicitly generates code to clears upper 4-bytes of Vector3 | |
// type args in prolog and Vector3 type return value of a call | |
// | |
// RyuJIT x86 Windows: all non-param Vector3 local vars are allocated as 16 bytes. Vector3 arguments | |
// are pushed as 12 bytes. For return values, a 16-byte local is allocated and the address passed | |
// as a return buffer pointer. The callee doesn't write the high 4 bytes, and we don't need to clear | |
// it either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that looks incorrect. We can't really make the assumption that the last element is zero due to various zero-cost conversions or other operations (it'd be very expensive to have to explicitly zero after every operation).
We do try and zero the upper element when it's trivial to do as part of another existing operation, as that can avoid delays caused by handling NaN or subnormal values, but all the APIs that operate across lanes (like Sum
or DotProduct
) have to explicitly ensure the element is zero for correctness prior to handling the input.
// TODO-CQ: If it is already remapped, we can reuse the value from | ||
// the remapping. | ||
if (comp->FindParameterRegisterLocalMappingByRegister(segment.GetRegister()) == nullptr) | ||
// TODO-CQ: Float -> !float extractions are not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just be gtNewSimdGetElementNode(op, index)
, no?
or in the case that it's known to be index == 0
, it can just be bitcast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with float->float extractions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will add it in a follow-up.
/azp run runtime-coreclr superpmi-diffs |
Azure Pipelines successfully started running 1 pipeline(s). |
Based on the new `FIELD_LIST` support for returns this PR adds support for the JIT to combine smaller fields via bitwise operations when returned, instead of spilling these to stack. win-x64 examples: ```csharp static int? Test() { return Environment.TickCount; } ``` ```diff call System.Environment:get_TickCount():int - mov dword ptr [rsp+0x24], eax - mov byte ptr [rsp+0x20], 1 - mov rax, qword ptr [rsp+0x20] - ;; size=19 bbWeight=1 PerfScore 4.00 + mov eax, eax + shl rax, 32 + or rax, 1 + ;; size=15 bbWeight=1 PerfScore 2.00 ``` (the `mov eax, eax` is unnecessary, but not that simple to get rid of) ```csharp static (int x, float y) Test(int x, float y) { return (x, y); } ``` ```diff - mov dword ptr [rsp], ecx - vmovss dword ptr [rsp+0x04], xmm1 - mov rax, qword ptr [rsp] + vmovd eax, xmm1 + shl rax, 32 + mov ecx, ecx + or rax, rcx ;; size=13 bbWeight=1 PerfScore 3.00 ``` An arm64 example: ```csharp static Memory<int> ToMemory(int[] arr) { return arr.AsMemory(); } ``` ```diff G_M45070_IG01: ;; offset=0x0000 - stp fp, lr, [sp, #-0x20]! + stp fp, lr, [sp, #-0x10]! mov fp, sp - str xzr, [fp, #0x10] // [V03 tmp2] - ;; size=12 bbWeight=1 PerfScore 2.50 -G_M45070_IG02: ;; offset=0x000C + ;; size=8 bbWeight=1 PerfScore 1.50 +G_M45070_IG02: ;; offset=0x0008 cbz x0, G_M45070_IG06 ;; size=4 bbWeight=1 PerfScore 1.00 -G_M45070_IG03: ;; offset=0x0010 - str x0, [fp, #0x10] // [V07 tmp6] - str wzr, [fp, #0x18] // [V08 tmp7] - ldr x0, [fp, #0x10] // [V07 tmp6] - ldr w0, [x0, #0x08] - str w0, [fp, #0x1C] // [V09 tmp8] - ;; size=20 bbWeight=0.80 PerfScore 6.40 -G_M45070_IG04: ;; offset=0x0024 - ldp x0, x1, [fp, #0x10] // [V03 tmp2], [V03 tmp2+0x08] - ;; size=4 bbWeight=1 PerfScore 3.00 -G_M45070_IG05: ;; offset=0x0028 - ldp fp, lr, [sp], #0x20 +G_M45070_IG03: ;; offset=0x000C + ldr w1, [x0, #0x08] + ;; size=4 bbWeight=0.80 PerfScore 2.40 +G_M45070_IG04: ;; offset=0x0010 + mov w1, w1 + mov x2, xzr + orr x1, x2, x1, LSL #32 + ;; size=12 bbWeight=1 PerfScore 2.00 +G_M45070_IG05: ;; offset=0x001C + ldp fp, lr, [sp], #0x10 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 -G_M45070_IG06: ;; offset=0x0030 - str xzr, [fp, #0x10] // [V07 tmp6] - str xzr, [fp, #0x18] +G_M45070_IG06: ;; offset=0x0024 + mov x0, xzr + mov w1, wzr b G_M45070_IG04 - ;; size=12 bbWeight=0.20 PerfScore 0.60 + ;; size=12 bbWeight=0.20 PerfScore 0.40 ``` (sneak peek -- this codegen requires some supplementary changes, and there's additional opportunities here) This is the return counterpart to #112740. That PR has a bunch of regressions that makes it look like we need to support returns/call arguments first, before we try to support parameters. There's a few follow-ups here: - Support for float->float insertions (when a float value needs to be returned as the 1st, 2nd, .... field of a SIMD register) - Support for coalescing memory loads, particularly because the fields of the `FIELD_LIST` come from a promoted struct that ended up DNER. In those cases we should be able to recombine the fields back to a single large field, instead of combining them with bitwise operations. - Support for constant folding the bitwise insertions. This requires some more constant folding support in lowering. - The JIT has lots of (now outdated) restrictions based around multi-reg returns that get in the way. Lifting these should improve things considerably.
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
This work is still planned, but I need to add support for assembling call arguments first, to avoid various regressions here. Will keep it closed for now. |
Recent work now allows us to finally add support for the backend to extract fields out of parameters without spilling them to stack. Previously this was only supported when the fields mapped cleanly to registers.
A win-x64 example:
Another win-x64 example:
An arm64 example:
Float -> float extractions that do not map cleanly is still not supported, but should be doable (via vector register extractions). Float -> int extractions are not supported, but I'm not sure we see these.
This is often not a code size improvement, but typically a perfscore improvement. Also this seems to have some bad interactions with call arguments since they do not yet support something similar, but hopefully that can be improved separately.
This should fix a number of issues that I need to go find.