-
Notifications
You must be signed in to change notification settings - Fork 5k
JIT: Speed up floating to integer casts on x86/x64 #114410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Some benchmark numbers. Benchmark code here. Intel Skylake x64
AMD Zen 5 x64Signed casts show improvement with the new SSE2 code. The very small regression shown on unsigned is from swapping
And More...AMD Zen 5 x64 SSE2-onlyThis shows the perf improvement for the worst-case scenario of baseline ISAs only, which currently results in all casts going through helpers.
Intel Skylake x86
AMD Zen 5 x86Casts to long/ulong can be accelerated with AVX-512 as well. (Waiting until #113930 lands to do this, due to confilcts)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ready for review.
cc @tannergooding @dotnet/jit-contrib
nextNode = LowerCast(node); | ||
if (nextNode != nullptr) | ||
{ | ||
return nextNode; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reverting a change from #97529. The new implementation always preserves the original cast node and does all IR manipulation ahead of it.
// GT_CAST(float/double, sbyte) = GT_CAST(GT_CAST(float/double, int32), sbyte) | ||
// GT_CAST(float/double, int16) = GT_CAST(GT_CAST(double/double, int32), int16) | ||
// GT_CAST(float/double, uint16) = GT_CAST(GT_CAST(double/double, int32), uint16) | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was copied from xarch, where lowering used to handle the intermediate int cast. That never applied here, and it no longer applies on xarch. I've removed the notes from all the headers and commented the asserts that check the assumptions instead.
// converted it to float -> double -> long conversion. | ||
assert((dstType != TYP_LONG) || (srcType != TYP_FLOAT)); | ||
// If we don't have AVX10v2 saturating conversion instructions for | ||
// floating->integral, we have to handle the saturation logic here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is a complete rewrite, so it's best read top to bottom rather than compared against the current code.
@dotnet/intel |
@khushal1996 You implemented the original code; it would be useful for you to comment on this change. |
/azp run runtime-coreclr libraries-jitstress, runtime-coreclr outerloop, runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-avx512, Fuzzlyn |
Azure Pipelines successfully started running 5 pipeline(s). |
@saucecontrol I am running the benchmark on IceLake to verify the change and then review the changes but so far, the change seems to improve perf. |
Looks like the additional test runs didn't find anything new. |
@khushal1996 Any findings yet? |
Overall the change looks good. I ran some tests on ICX and looks like ICX benchmarks |
@@ -990,14 +976,14 @@ void Lowering::LowerCast(GenTree* tree) | |||
maxIntegralValue = comp->gtNewIconNode(static_cast<ssize_t>(UINT32_MAX)); | |||
if (srcType == TYP_FLOAT) | |||
{ | |||
maxFloatSimdVal->f32[0] = static_cast<float>(UINT32_MAX); | |||
maxFloatSimdVal->f32[0] = 4294967296.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be using global constants for these numbers?
Thanks for the extra numbers @khushal1996!
Yeah, since the intrinsic expansion for vector conversion happens in the JIT front end, the table load can get hoisted out of the loop in benchmarks like this one, which helps quite a bit. There are a couple of improvements we can make to the vector convert codegen, but I'll do them in another PR. |
This replaces the saturating floating to integer cast logic for pre-AVX10v2 hardware (introduced in #97529) with higher performance versions.
Diffs