-
Notifications
You must be signed in to change notification settings - Fork 5.1k
JIT: Accelerate more casts on x86 #116805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
|
||
baseReg = internalRegisters.GetSingle(node); | ||
GetEmitter()->emitIns_R_C(INS_lea, emitTypeSize(TYP_I_IMPL), baseReg, hnd, 0, INS_OPTS_NONE); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This handles the case of GetElement with const vector and non-const index, which is used in the SSE2 implementation of uint->floating casts. If the const vector isn't contained, it gets loaded, spilled to stack, and loaded again from there. With this, it's a lea+load.
Codegen for the cast goes like this:
xorps xmm0, xmm0
cvtsi2sd xmm0, ecx
shr ecx, 31
lea eax, [@RWD00]
movsd xmm1, qword ptr [eax+8*ecx]
addsd xmm0, xmm1
movsd qword ptr [esp], xmm0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should a similar optimization be added to the WithElement
path? We're generally trying to keep those two "in sync" since they have many similar considerations between them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it would be useful, because tbh, I don't imagine even the GetElement
pattern is likely to appear naturally in managed code. If I were writing the same algorithm in C#, I'd probably use a ROS for the const values, which already yields optimal code in net10.0 now that we can eliminate the bounds checks.
For example, the first method here is the C# equivalent of the code this PR creates in lowering, while the second is what I'd write if creating a ROS were easy enough: https://godbolt.org/z/nKcsYzM8v
Can you think of something similar someone would do with WithElement
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of people using intrinsics aren't going and looking at the exact disassembly. They're doing more naive patterns with types like Vector4
and not necessarily following "best" practices for SIMD.
Some of that will be addressed long term via analyzers, but in general it is something I've seen frequently in code devs write, particularly for beginning game devs and so it is impactful to many real world scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend extracting this to its own PR as well, just like the bug fix.
While it is related, it makes it easier to track the changes and help isolate what is causing what in terms of disasm changes and other scenarios (I've done some similar splits recently where I did 4-5 incremental PRs to help isolate the different changes being made)
// This pattern most often appears as CAST(float <- CAST(double <- float)), | ||
// which is reduced to CAST(float <- float) and handled in codegen as an optional mov. | ||
else if ((srcType == TYP_DOUBLE) && (dstType == TYP_FLOAT) && oper->OperIs(GT_CAST) && | ||
!varTypeIsLong(oper->AsCast()->CastOp())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the bug fix -- just needed a check for long types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be beneficial to extract the bug fix to its own PR. That is less controversial to get in as it is a correctness fix.
The full PR here is likely to land, but I imagine it will have more consideration/discussion due to the throughput hit. It may even need some more iteration, such as only doing the decomposition for full-opts and not for min-opts
castRange.InsertAtEnd(select); | ||
|
||
// The original cast becomes a no-op, because its input is already the correct type. | ||
tree->AsCast()->CastOp() = select; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the cast rather than re-using it resulted in some size improvements on x64 since the redundant cast wasn't always eliminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we can remove earlier in the IR? The earlier we can remove redundant or unnecessary casts, the better overall IR we tend to have and the better throughput (plus more optimizations tend to kick in)
} | ||
// Unless AVX10.2 saturating conversion instructions are available, these | ||
// casts should have been lowered to a sequence of HWIntrinsic nodes. | ||
assert(compiler->compIsaSupportedDebugOnly(InstructionSet_AVX10v2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even "need" this method anymore? Why not just have lowering always introduce the GenTreeHWIntrinsic
representing the conversion instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that more expensive?
I don't mind getting rid of this, but genFloatToIntCast
is part of the shared codegen logic, so it would mean more #ifdefs in the shared code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be more expensive and generally we have the better handling for intrinsics.
The general consideration is that we're duplicating logic here for the general xarch emission, while we always have SIMD enabled for xarch currently so the amount of ifdefs should be near zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look at that in a followup if you don't mind. I wanted to do one more PR in this area, to move the x64 SSE2 ulong->floating fallback logic from CodeGen::genIntToFloatCast
to lowering and change it to a branchless implementation.
The codegen for AVX-512 floating->long regressed after merging in main (likely something with #116983). I'll give it another look. |
!(tree->OperIs(GT_CAST) && varTypeIsLong(tree->AsCast()->CastOp()) && varTypeIsFloating(tree))) | ||
// On x86, long->floating casts are implemented in DecomposeCast. | ||
bool isLongToFloatingCast = | ||
(tree->OperIs(GT_CAST) && varTypeIsLong(tree->AsCast()->CastOp()) && varTypeIsFloating(tree)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pulled up and is potentially causing some TP hit due to additional checks happening for trees that are TYP_LONG
Doesn't look like the check was otherwise changed, so I'd default to just adding the comment and keeping the if/check the same otherwise
CorInfoType srcBaseType = CORINFO_TYPE_UNDEF; | ||
CorInfoType dstBaseType = CORINFO_TYPE_UNDEF; | ||
|
||
if (varTypeIsFloating(srcType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if the compiler is CSEing this check with the above check as expected?
-- Asking since manually caching might be a way to win some throughput back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for sure, but I generally assume C++ compilers will handle 'obvious' ones like this. It should be noted, the throughput hit to x86 directly correlates with the number of casts that are now inlined.
i.e. the only significant throughput hit is on the coreclr_tests collection

which is also the one that had the most casts in it

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍. The biggest concern is the TP hit to minopts. It may be desirable to leave that using the helper there so that floating-point heavy code doesn't start up slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think #117512 will reduce the hit a bit.
It may be desirable to leave that using the helper there so that floating-point heavy code doesn't start up slower.
This is interesting, because the same argument could apply to the complicated saturating logic that we have for x64 as well. #97529 introduced a similar throughput regression, and although it was done for correctness instead of perf, the throughput hit could have been avoided by using the helper in minopts there too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the throughput hit could have been avoided by using the helper in minopts there too.
AFAIR, the JIT throughput hit there ended up being very minimal (and often an improvement). It was the perf score and code output size that regressed, which was expected.
If there was a significant perf score hit to minopts, then yes the same would apply here and it would likely be beneficial to ensure that is doing the "better" thing as well.
{ | ||
if (varTypeIsFloating(fromType)) | ||
{ | ||
return (varTypeIsIntegral(toType) && overflow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the varTypeIsIntegral
check? We should never have a float->float cast that is marked for overflow, right?
#if defined(TARGET_X86) | ||
|| (varTypeIsLong(toType) && !compOpportunisticallyDependsOn(InstructionSet_AVX512)) | ||
#elif !defined(TARGET_64BIT) | ||
|| varTypeIsLong(toType) | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I find the current logic a lot harder to follow, due to the single return statement spanning across multiple ifdefs. I think something like the following, although longer, makes it much clearer as to what is happening:
if (varTypeIsFloating(fromType))
{
if (overflow)
{
assert(varTypeIsIntegral(toType));
return true;
}
#if !defined(TARGET_64BIT)
if (varTypeIsLong)
{
#if defined(TARGET_X86)
if (compOpportunisticallyDependsOn(InstructionSet_AVX512))
{
return false;
}
#endif
return true;
}
#endif
return false;
}
This is the follow-up to #114597. It addresses the invalid removal of intermediate double casts in
long->double->float
chains and does more cleanup/normalization infgMorphExpandCast
, removing more of the platform-specific logic.It also adds some more acceleration to x86. With this, all integral<->floating casts are accelerated with AVX-512, and all 32-bit integral<->floating casts are accelerated with the baseline ISAs.
Example AVX-512 implementation of floating->long/ulong casts:
Example SSE2/SSE4.1 implementation of uint->floating casts:
Full diffs