Skip to content

JIT: Accelerate more casts on x86 #116805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

saucecontrol
Copy link
Member

@saucecontrol saucecontrol commented Jun 19, 2025

This is the follow-up to #114597. It addresses the invalid removal of intermediate double casts in long->double->float chains and does more cleanup/normalization in fgMorphExpandCast, removing more of the platform-specific logic.

It also adds some more acceleration to x86. With this, all integral<->floating casts are accelerated with AVX-512, and all 32-bit integral<->floating casts are accelerated with the baseline ISAs.

Example AVX-512 implementation of floating->long/ulong casts:

 G_M38981_IG01:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
-       sub      esp, 8
-       vzeroupper 
-						;; size=6 bbWeight=1 PerfScore 1.25
+						;; size=0 bbWeight=1 PerfScore 0.00
 G_M38981_IG02:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
-       vmovsd   xmm0, qword ptr [esp+0x0C]
-       sub      esp, 8
-       ; npt arg push 0
-       ; npt arg push 1
-       vmovsd   qword ptr [esp], xmm0
-       call     CORINFO_HELP_DBL2LNG
-       ; gcr arg pop 2
-						;; size=19 bbWeight=1 PerfScore 6.25
+       vmovsd   xmm0, qword ptr [esp+0x04]
+       vcmpsd   k1, xmm0, xmm0, 7
+       vcvttpd2qq xmm1 {k1}{z}, xmm0
+       vcmpsd   k1, xmm0, qword ptr [@RWD00], 29
+       vpcmpeqd xmm0, xmm0, xmm0
+       vpsrlq   xmm1 {k1}, xmm0, 1
+       vmovd    eax, xmm1
+       vpextrd  edx, xmm1, 1
+						;; size=51 bbWeight=1 PerfScore 18.50
 G_M38981_IG03:        ; bbWeight=1, epilog, nogc, extend
-       add      esp, 8
        ret      8
-						;; size=6 bbWeight=1 PerfScore 2.25
+						;; size=3 bbWeight=1 PerfScore 2.00
+RWD00  	dq	43E0000000000000h

Example SSE2/SSE4.1 implementation of uint->floating casts:

 G_M24329_IG01:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}
 						;; size=3 bbWeight=1 PerfScore 0.25
 G_M24329_IG02:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000002 {ecx}, byref
        ; byrRegs +[ecx]
-       push     0
-       ; npt arg push 0
-       push     dword ptr [ecx]
-       ; npt arg push 1
-       call     [CORINFO_HELP_LNG2DBL]
-       ; byrRegs -[ecx]
-       ; gcr arg pop 2
-       fstp     qword ptr [esp]
-       movsd    xmm0, qword ptr [esp]
+       mov      eax, dword ptr [ecx]
+       xorps    xmm0, xmm0
+       cvtsi2sd xmm0, eax
+       movaps   xmm1, xmm0
+       addsd    xmm1, qword ptr [reloc @RWD00]
+       movaps   xmm2, xmm0
+       blendvpd xmm0, xmm1
+       movsd    qword ptr [esp], xmm0
        fld      qword ptr [esp]
-						;; size=21 bbWeight=1 PerfScore 10.00
+						;; size=36 bbWeight=1 PerfScore 15.83
 G_M24329_IG03:        ; bbWeight=1, epilog, nogc, extend
        add      esp, 8
        ret      
 						;; size=4 bbWeight=1 PerfScore 1.25
+RWD00  	dq	41F0000000000000h

Full diffs

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 19, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 19, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@saucecontrol saucecontrol marked this pull request as ready for review June 19, 2025 23:16
@saucecontrol
Copy link
Member Author

cc @tannergooding


baseReg = internalRegisters.GetSingle(node);
GetEmitter()->emitIns_R_C(INS_lea, emitTypeSize(TYP_I_IMPL), baseReg, hnd, 0, INS_OPTS_NONE);
}
Copy link
Member Author

@saucecontrol saucecontrol Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handles the case of GetElement with const vector and non-const index, which is used in the SSE2 implementation of uint->floating casts. If the const vector isn't contained, it gets loaded, spilled to stack, and loaded again from there. With this, it's a lea+load.

Codegen for the cast goes like this:

       xorps    xmm0, xmm0
       cvtsi2sd xmm0, ecx
       shr      ecx, 31
       lea      eax, [@RWD00]
       movsd    xmm1, qword ptr [eax+8*ecx]
       addsd    xmm0, xmm1
       movsd    qword ptr [esp], xmm0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a similar optimization be added to the WithElement path? We're generally trying to keep those two "in sync" since they have many similar considerations between them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it would be useful, because tbh, I don't imagine even the GetElement pattern is likely to appear naturally in managed code. If I were writing the same algorithm in C#, I'd probably use a ROS for the const values, which already yields optimal code in net10.0 now that we can eliminate the bounds checks.

For example, the first method here is the C# equivalent of the code this PR creates in lowering, while the second is what I'd write if creating a ROS were easy enough: https://godbolt.org/z/nKcsYzM8v

Can you think of something similar someone would do with WithElement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of people using intrinsics aren't going and looking at the exact disassembly. They're doing more naive patterns with types like Vector4 and not necessarily following "best" practices for SIMD.

Some of that will be addressed long term via analyzers, but in general it is something I've seen frequently in code devs write, particularly for beginning game devs and so it is impactful to many real world scenarios.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend extracting this to its own PR as well, just like the bug fix.

While it is related, it makes it easier to track the changes and help isolate what is causing what in terms of disasm changes and other scenarios (I've done some similar splits recently where I did 4-5 incremental PRs to help isolate the different changes being made)

// This pattern most often appears as CAST(float <- CAST(double <- float)),
// which is reduced to CAST(float <- float) and handled in codegen as an optional mov.
else if ((srcType == TYP_DOUBLE) && (dstType == TYP_FLOAT) && oper->OperIs(GT_CAST) &&
!varTypeIsLong(oper->AsCast()->CastOp()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug fix -- just needed a check for long types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be beneficial to extract the bug fix to its own PR. That is less controversial to get in as it is a correctness fix.

The full PR here is likely to land, but I imagine it will have more consideration/discussion due to the throughput hit. It may even need some more iteration, such as only doing the decomposition for full-opts and not for min-opts

castRange.InsertAtEnd(select);

// The original cast becomes a no-op, because its input is already the correct type.
tree->AsCast()->CastOp() = select;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the cast rather than re-using it resulted in some size improvements on x64 since the redundant cast wasn't always eliminated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we can remove earlier in the IR? The earlier we can remove redundant or unnecessary casts, the better overall IR we tend to have and the better throughput (plus more optimizations tend to kick in)

}
// Unless AVX10.2 saturating conversion instructions are available, these
// casts should have been lowered to a sequence of HWIntrinsic nodes.
assert(compiler->compIsaSupportedDebugOnly(InstructionSet_AVX10v2));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even "need" this method anymore? Why not just have lowering always introduce the GenTreeHWIntrinsic representing the conversion instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that more expensive?

I don't mind getting rid of this, but genFloatToIntCast is part of the shared codegen logic, so it would mean more #ifdefs in the shared code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be more expensive and generally we have the better handling for intrinsics.

The general consideration is that we're duplicating logic here for the general xarch emission, while we always have SIMD enabled for xarch currently so the amount of ifdefs should be near zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look at that in a followup if you don't mind. I wanted to do one more PR in this area, to move the x64 SSE2 ulong->floating fallback logic from CodeGen::genIntToFloatCast to lowering and change it to a branchless implementation.

@saucecontrol
Copy link
Member Author

The codegen for AVX-512 floating->long regressed after merging in main (likely something with #116983). I'll give it another look.

@saucecontrol
Copy link
Member Author

#117485 resolved the regression. Diffs look good again.

!(tree->OperIs(GT_CAST) && varTypeIsLong(tree->AsCast()->CastOp()) && varTypeIsFloating(tree)))
// On x86, long->floating casts are implemented in DecomposeCast.
bool isLongToFloatingCast =
(tree->OperIs(GT_CAST) && varTypeIsLong(tree->AsCast()->CastOp()) && varTypeIsFloating(tree));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pulled up and is potentially causing some TP hit due to additional checks happening for trees that are TYP_LONG

Doesn't look like the check was otherwise changed, so I'd default to just adding the comment and keeping the if/check the same otherwise

CorInfoType srcBaseType = CORINFO_TYPE_UNDEF;
CorInfoType dstBaseType = CORINFO_TYPE_UNDEF;

if (varTypeIsFloating(srcType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if the compiler is CSEing this check with the above check as expected?

-- Asking since manually caching might be a way to win some throughput back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for sure, but I generally assume C++ compilers will handle 'obvious' ones like this. It should be noted, the throughput hit to x86 directly correlates with the number of casts that are now inlined.

i.e. the only significant throughput hit is on the coreclr_tests collection

image

which is also the one that had the most casts in it

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍. The biggest concern is the TP hit to minopts. It may be desirable to leave that using the helper there so that floating-point heavy code doesn't start up slower.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #117512 will reduce the hit a bit.

It may be desirable to leave that using the helper there so that floating-point heavy code doesn't start up slower.

This is interesting, because the same argument could apply to the complicated saturating logic that we have for x64 as well. #97529 introduced a similar throughput regression, and although it was done for correctness instead of perf, the throughput hit could have been avoided by using the helper in minopts there too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the throughput hit could have been avoided by using the helper in minopts there too.

AFAIR, the JIT throughput hit there ended up being very minimal (and often an improvement). It was the perf score and code output size that regressed, which was expected.

If there was a significant perf score hit to minopts, then yes the same would apply here and it would likely be beneficial to ensure that is doing the "better" thing as well.

{
if (varTypeIsFloating(fromType))
{
return (varTypeIsIntegral(toType) && overflow)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the varTypeIsIntegral check? We should never have a float->float cast that is marked for overflow, right?

Comment on lines +1289 to +1293
#if defined(TARGET_X86)
|| (varTypeIsLong(toType) && !compOpportunisticallyDependsOn(InstructionSet_AVX512))
#elif !defined(TARGET_64BIT)
|| varTypeIsLong(toType)
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find the current logic a lot harder to follow, due to the single return statement spanning across multiple ifdefs. I think something like the following, although longer, makes it much clearer as to what is happening:

if (varTypeIsFloating(fromType))
{
    if (overflow)
    {
        assert(varTypeIsIntegral(toType));
        return true;
    }

#if !defined(TARGET_64BIT)
    if (varTypeIsLong)
    {
#if defined(TARGET_X86)
        if (compOpportunisticallyDependsOn(InstructionSet_AVX512))
        {
            return false;
        }
#endif

        return true;
    }
#endif

    return false;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants