JIT: Accelerate more casts on x86 #116805

saucecontrol · 2025-06-19T04:03:46Z

This is the follow-up to #114597. It addresses the invalid removal of intermediate double casts in long->double->float chains and does more cleanup/normalization in fgMorphExpandCast, removing more of the platform-specific logic.

It also adds some more acceleration to x86. With this, all integral<->floating casts are accelerated with AVX-512, and all 32-bit integral<->floating casts are accelerated with the baseline ISAs.

Example AVX-512 implementation of floating->long/ulong casts:

 G_M38981_IG01:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
-       sub      esp, 8
-       vzeroupper 
-						;; size=6 bbWeight=1 PerfScore 1.25
+						;; size=0 bbWeight=1 PerfScore 0.00
 G_M38981_IG02:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
-       vmovsd   xmm0, qword ptr [esp+0x0C]
-       sub      esp, 8
-       ; npt arg push 0
-       ; npt arg push 1
-       vmovsd   qword ptr [esp], xmm0
-       call     CORINFO_HELP_DBL2LNG
-       ; gcr arg pop 2
-						;; size=19 bbWeight=1 PerfScore 6.25
+       vmovsd   xmm0, qword ptr [esp+0x04]
+       vcmpsd   k1, xmm0, xmm0, 7
+       vcvttpd2qq xmm1 {k1}{z}, xmm0
+       vcmpsd   k1, xmm0, qword ptr [@RWD00], 29
+       vpcmpeqd xmm0, xmm0, xmm0
+       vpsrlq   xmm1 {k1}, xmm0, 1
+       vmovd    eax, xmm1
+       vpextrd  edx, xmm1, 1
+						;; size=51 bbWeight=1 PerfScore 18.50
 G_M38981_IG03:        ; bbWeight=1, epilog, nogc, extend
-       add      esp, 8
        ret      8
-						;; size=6 bbWeight=1 PerfScore 2.25
+						;; size=3 bbWeight=1 PerfScore 2.00
+RWD00  	dq	43E0000000000000h

Example SSE2/SSE4.1 implementation of uint->floating casts:

 G_M24329_IG01:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}
 						;; size=3 bbWeight=1 PerfScore 0.25
 G_M24329_IG02:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000002 {ecx}, byref
        ; byrRegs +[ecx]
-       push     0
-       ; npt arg push 0
-       push     dword ptr [ecx]
-       ; npt arg push 1
-       call     [CORINFO_HELP_LNG2DBL]
-       ; byrRegs -[ecx]
-       ; gcr arg pop 2
-       fstp     qword ptr [esp]
-       movsd    xmm0, qword ptr [esp]
+       mov      eax, dword ptr [ecx]
+       xorps    xmm0, xmm0
+       cvtsi2sd xmm0, eax
+       movaps   xmm1, xmm0
+       addsd    xmm1, qword ptr [reloc @RWD00]
+       movaps   xmm2, xmm0
+       blendvpd xmm0, xmm1
+       movsd    qword ptr [esp], xmm0
        fld      qword ptr [esp]
-						;; size=21 bbWeight=1 PerfScore 10.00
+						;; size=36 bbWeight=1 PerfScore 15.83
 G_M24329_IG03:        ; bbWeight=1, epilog, nogc, extend
        add      esp, 8
        ret      
 						;; size=4 bbWeight=1 PerfScore 1.25
+RWD00  	dq	41F0000000000000h

Full diffs

dotnet-policy-service · 2025-06-19T04:04:33Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

saucecontrol · 2025-06-19T23:16:57Z

cc @tannergooding

saucecontrol · 2025-06-20T01:37:57Z

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

+
+                    baseReg = internalRegisters.GetSingle(node);
+                    GetEmitter()->emitIns_R_C(INS_lea, emitTypeSize(TYP_I_IMPL), baseReg, hnd, 0, INS_OPTS_NONE);
+                }


This handles the case of GetElement with const vector and non-const index, which is used in the SSE2 implementation of uint->floating casts. If the const vector isn't contained, it gets loaded, spilled to stack, and loaded again from there. With this, it's a lea+load.

Codegen for the cast goes like this:

xorps xmm0, xmm0 cvtsi2sd xmm0, ecx shr ecx, 31 lea eax, [@RWD00] movsd xmm1, qword ptr [eax+8*ecx] addsd xmm0, xmm1 movsd qword ptr [esp], xmm0

Should a similar optimization be added to the WithElement path? We're generally trying to keep those two "in sync" since they have many similar considerations between them.

I'm not sure it would be useful, because tbh, I don't imagine even the GetElement pattern is likely to appear naturally in managed code. If I were writing the same algorithm in C#, I'd probably use a ROS for the const values, which already yields optimal code in net10.0 now that we can eliminate the bounds checks.

For example, the first method here is the C# equivalent of the code this PR creates in lowering, while the second is what I'd write if creating a ROS were easy enough: https://godbolt.org/z/nKcsYzM8v

Can you think of something similar someone would do with WithElement?

A lot of people using intrinsics aren't going and looking at the exact disassembly. They're doing more naive patterns with types like Vector4 and not necessarily following "best" practices for SIMD.

Some of that will be addressed long term via analyzers, but in general it is something I've seen frequently in code devs write, particularly for beginning game devs and so it is impactful to many real world scenarios.

I'd recommend extracting this to its own PR as well, just like the bug fix.

While it is related, it makes it easier to track the changes and help isolate what is causing what in terms of disasm changes and other scenarios (I've done some similar splits recently where I did 4-5 incremental PRs to help isolate the different changes being made)

saucecontrol · 2025-06-20T01:40:25Z

src/coreclr/jit/morph.cpp

+    // This pattern most often appears as CAST(float <- CAST(double <- float)),
+    // which is reduced to CAST(float <- float) and handled in codegen as an optional mov.
+    else if ((srcType == TYP_DOUBLE) && (dstType == TYP_FLOAT) && oper->OperIs(GT_CAST) &&
+             !varTypeIsLong(oper->AsCast()->CastOp()))


This is the bug fix -- just needed a check for long types.

It would probably be beneficial to extract the bug fix to its own PR. That is less controversial to get in as it is a correctness fix.

The full PR here is likely to land, but I imagine it will have more consideration/discussion due to the throughput hit. It may even need some more iteration, such as only doing the decomposition for full-opts and not for min-opts

saucecontrol · 2025-06-20T01:42:06Z

src/coreclr/jit/lowerxarch.cpp

-            castRange.InsertAtEnd(select);
-
-            // The original cast becomes a no-op, because its input is already the correct type.
-            tree->AsCast()->CastOp() = select;


Removing the cast rather than re-using it resulted in some size improvements on x64 since the redundant cast wasn't always eliminated.

Is this something we can remove earlier in the IR? The earlier we can remove redundant or unnecessary casts, the better overall IR we tend to have and the better throughput (plus more optimizations tend to kick in)

tannergooding · 2025-06-23T14:55:14Z

src/coreclr/jit/codegenxarch.cpp

-    }
+    // Unless AVX10.2 saturating conversion instructions are available, these
+    // casts should have been lowered to a sequence of HWIntrinsic nodes.
+    assert(compiler->compIsaSupportedDebugOnly(InstructionSet_AVX10v2));


Do we even "need" this method anymore? Why not just have lowering always introduce the GenTreeHWIntrinsic representing the conversion instead?

Isn't that more expensive?

I don't mind getting rid of this, but genFloatToIntCast is part of the shared codegen logic, so it would mean more #ifdefs in the shared code.

Shouldn't be more expensive and generally we have the better handling for intrinsics.

The general consideration is that we're duplicating logic here for the general xarch emission, while we always have SIMD enabled for xarch currently so the amount of ifdefs should be near zero.

I can look at that in a followup if you don't mind. I wanted to do one more PR in this area, to move the x64 SSE2 ulong->floating fallback logic from CodeGen::genIntToFloatCast to lowering and change it to a branchless implementation.

src/coreclr/jit/decomposelongs.cpp

saucecontrol · 2025-07-08T20:20:00Z

The codegen for AVX-512 floating->long regressed after merging in main (likely something with #116983). I'll give it another look.

saucecontrol · 2025-07-10T06:35:20Z

#117485 resolved the regression. Diffs look good again.

tannergooding · 2025-07-10T17:02:13Z

src/coreclr/jit/decomposelongs.cpp

-        !(tree->OperIs(GT_CAST) && varTypeIsLong(tree->AsCast()->CastOp()) && varTypeIsFloating(tree)))
+    // On x86, long->floating casts are implemented in DecomposeCast.
+    bool isLongToFloatingCast =
+        (tree->OperIs(GT_CAST) && varTypeIsLong(tree->AsCast()->CastOp()) && varTypeIsFloating(tree));


This is pulled up and is potentially causing some TP hit due to additional checks happening for trees that are TYP_LONG

Doesn't look like the check was otherwise changed, so I'd default to just adding the comment and keeping the if/check the same otherwise

tannergooding · 2025-07-10T17:03:59Z

src/coreclr/jit/decomposelongs.cpp

+        CorInfoType srcBaseType = CORINFO_TYPE_UNDEF;
+        CorInfoType dstBaseType = CORINFO_TYPE_UNDEF;
+
+        if (varTypeIsFloating(srcType))


Do we know if the compiler is CSEing this check with the above check as expected?

-- Asking since manually caching might be a way to win some throughput back.

Not for sure, but I generally assume C++ compilers will handle 'obvious' ones like this. It should be noted, the throughput hit to x86 directly correlates with the number of casts that are now inlined.

i.e. the only significant throughput hit is on the coreclr_tests collection

which is also the one that had the most casts in it

👍. The biggest concern is the TP hit to minopts. It may be desirable to leave that using the helper there so that floating-point heavy code doesn't start up slower.

I think #117512 will reduce the hit a bit.

It may be desirable to leave that using the helper there so that floating-point heavy code doesn't start up slower.

This is interesting, because the same argument could apply to the complicated saturating logic that we have for x64 as well. #97529 introduced a similar throughput regression, and although it was done for correctness instead of perf, the throughput hit could have been avoided by using the helper in minopts there too.

the throughput hit could have been avoided by using the helper in minopts there too.

AFAIR, the JIT throughput hit there ended up being very minimal (and often an improvement). It was the perf score and code output size that regressed, which was expected.

If there was a significant perf score hit to minopts, then yes the same would apply here and it would likely be beneficial to ensure that is doing the "better" thing as well.

tannergooding · 2025-07-10T17:06:23Z

src/coreclr/jit/flowgraph.cpp

+{
+    if (varTypeIsFloating(fromType))
+    {
+        return (varTypeIsIntegral(toType) && overflow)


Do we need the varTypeIsIntegral check? We should never have a float->float cast that is marked for overflow, right?

tannergooding · 2025-07-10T17:10:14Z

src/coreclr/jit/flowgraph.cpp

+#if defined(TARGET_X86)
+               || (varTypeIsLong(toType) && !compOpportunisticallyDependsOn(InstructionSet_AVX512))
+#elif !defined(TARGET_64BIT)
+               || varTypeIsLong(toType)
+#endif


nit: I find the current logic a lot harder to follow, due to the single return statement spanning across multiple ifdefs. I think something like the following, although longer, makes it much clearer as to what is happening:

if (varTypeIsFloating(fromType)) { if (overflow) { assert(varTypeIsIntegral(toType)); return true; } #if !defined(TARGET_64BIT) if (varTypeIsLong) { #if defined(TARGET_X86) if (compOpportunisticallyDependsOn(InstructionSet_AVX512)) { return false; } #endif return true; } #endif return false; }

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 19, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 19, 2025

build-analysis bot mentioned this pull request Jun 19, 2025

browser-wasm Windows build error #116746

Closed

accelerate more casts on x86

91116c7

saucecontrol force-pushed the lng2flt2 branch from c153a51 to 91116c7 Compare June 19, 2025 21:17

formatting

f65daa7

saucecontrol marked this pull request as ready for review June 19, 2025 23:16

saucecontrol commented Jun 20, 2025

View reviewed changes

saucecontrol added 2 commits June 21, 2025 10:17

add method header

f23c534

Merge remote-tracking branch 'upstream/main' into lng2flt2

3af2fd6