Skip to content

JIT: Speed up floating to integer casts on x86/x64 #114410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

saucecontrol
Copy link
Member

@saucecontrol saucecontrol commented Apr 8, 2025

This replaces the saturating floating to integer cast logic for pre-AVX10v2 hardware (introduced in #97529) with higher performance versions.

  • Replaces AVX-512 signed integer saturation with SSE/SSE2 versions that are both faster and more hardware compatible.
  • Removes SSE4.1 fallbacks, which were also slower than the new SSE/SSE2 implementations.
  • Replaces float->double->long casts with direct float->long on x64.
  • Replaces helper calls for float/double->ulong casts on pre-AVX-512 x64.
  • Replaces helper calls for float/double->int/uint casts on x86.
  • Adds support for AVX-512 uint cast to x86 (long and ulong to come in a followup PR).

Diffs

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 8, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 8, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@saucecontrol
Copy link
Member Author

saucecontrol commented Apr 8, 2025

Some benchmark numbers. Benchmark code here.

Intel Skylake x64

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5608/22H2/2022Update)
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.202
  [Host]     : .NET 9.0.3 (9.0.325.11113), X64 RyuJIT AVX2
  Job-JPGMZI : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-DMYZSI : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Toolchain Mean Error Ratio Code Size
FloatToInt32 \x64_Main\corerun.exe 1,433.0 ns 8.29 ns 1.00 95 B
FloatToInt32 \x64_PR\corerun.exe 747.8 ns 2.96 ns 0.52 64 B
FloatToUInt32 \x64_Main\corerun.exe 5,887.2 ns 50.63 ns 1.00 95 B
FloatToUInt32 \x64_PR\corerun.exe 813.6 ns 13.85 ns 0.14 64 B
FloatToInt64 \x64_Main\corerun.exe 5,928.0 ns 24.94 ns 1.00 93 B
FloatToInt64 \x64_PR\corerun.exe 822.5 ns 3.31 ns 0.14 69 B
FloatToUInt64 \x64_Main\corerun.exe 62,241.6 ns 662.87 ns 1.00 58 B
FloatToUInt64 \x64_PR\corerun.exe 1,408.5 ns 6.11 ns 0.02 97 B
DoubleToInt32 \x64_Main\corerun.exe 1,430.7 ns 6.54 ns 1.00 92 B
DoubleToInt32 \x64_PR\corerun.exe 746.1 ns 1.78 ns 0.52 64 B
DoubleToUInt32 \x64_Main\corerun.exe 1,392.7 ns 4.48 ns 1.00 95 B
DoubleToUInt32 \x64_PR\corerun.exe 744.6 ns 3.05 ns 0.53 64 B
DoubleToInt64 \x64_Main\corerun.exe 1,429.5 ns 5.04 ns 1.00 93 B
DoubleToInt64 \x64_PR\corerun.exe 753.3 ns 13.95 ns 0.53 69 B
DoubleToUInt64 \x64_Main\corerun.exe 63,930.8 ns 155.39 ns 1.00 58 B
DoubleToUInt64 \x64_PR\corerun.exe 1,245.8 ns 3.33 ns 0.02 97 B

AMD Zen 5 x64

Signed casts show improvement with the new SSE2 code.

The very small regression shown on unsigned is from swapping vfixupimms[sd] for xorps+maxs[sd], which is smaller code and saves a memory load for the fixup table, so it should overall be a perf win.

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3476)
Unknown processor
.NET SDK 9.0.200
  [Host]     : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-LZXYVJ : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-VKKTCQ : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio Code Size
FloatToInt32 \x64_Main\corerun.exe 455.9 ns 1.49 ns 1.00 80 B
FloatToInt32 \x64_PR\corerun.exe 356.8 ns 1.38 ns 0.78 64 B
FloatToUInt32 \x64_Main\corerun.exe 225.0 ns 0.98 ns 1.00 65 B
FloatToUInt32 \x64_PR\corerun.exe 231.2 ns 3.32 ns 1.03 62 B
FloatToInt64 \x64_Main\corerun.exe 3,817.1 ns 12.37 ns 1.00 81 B
FloatToInt64 \x64_PR\corerun.exe 357.2 ns 1.48 ns 0.09 69 B
FloatToUInt64 \x64_Main\corerun.exe 224.4 ns 1.36 ns 1.00 65 B
FloatToUInt64 \x64_PR\corerun.exe 232.0 ns 2.02 ns 1.03 62 B
DoubleToInt32 \x64_Main\corerun.exe 455.0 ns 1.06 ns 1.00 80 B
DoubleToInt32 \x64_PR\corerun.exe 353.6 ns 2.08 ns 0.78 64 B
DoubleToUInt32 \x64_Main\corerun.exe 224.1 ns 1.42 ns 1.00 65 B
DoubleToUInt32 \x64_PR\corerun.exe 230.2 ns 1.64 ns 1.03 62 B
DoubleToInt64 \x64_Main\corerun.exe 455.1 ns 1.64 ns 1.00 81 B
DoubleToInt64 \x64_PR\corerun.exe 356.3 ns 1.26 ns 0.78 69 B
DoubleToUInt64 \x64_Main\corerun.exe 224.2 ns 0.88 ns 1.00 65 B
DoubleToUInt64 \x64_PR\corerun.exe 231.3 ns 1.31 ns 1.03 62 B

And More...

AMD Zen 5 x64 SSE2-only

This shows the perf improvement for the worst-case scenario of baseline ISAs only, which currently results in all casts going through helpers.

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3476)
Unknown processor
.NET SDK 9.0.200
  [Host]     : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT SSE2
  Job-REPAKK : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT SSE2
  Job-TGKOOP : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT SSE2
Method Toolchain Mean Error Ratio Code Size
FloatToInt32 \x64_Main\corerun.exe 34,066.4 ns 51.09 ns 1.00 53 B
FloatToInt32 \x64_PR\corerun.exe 406.6 ns 0.28 ns 0.01 66 B
FloatToUInt32 \x64_Main\corerun.exe 37,660.0 ns 75.95 ns 1.00 53 B
FloatToUInt32 \x64_PR\corerun.exe 406.6 ns 0.51 ns 0.01 65 B
FloatToInt64 \x64_Main\corerun.exe 37,929.7 ns 123.98 ns 1.00 55 B
FloatToInt64 \x64_PR\corerun.exe 406.8 ns 0.33 ns 0.01 70 B
FloatToUInt64 \x64_Main\corerun.exe 38,531.0 ns 117.54 ns 1.00 55 B
FloatToUInt64 \x64_PR\corerun.exe 672.9 ns 2.55 ns 0.02 100 B
DoubleToInt32 \x64_Main\corerun.exe 33,941.7 ns 60.67 ns 1.00 53 B
DoubleToInt32 \x64_PR\corerun.exe 407.0 ns 0.60 ns 0.01 68 B
DoubleToUInt32 \x64_Main\corerun.exe 37,075.2 ns 76.08 ns 1.00 53 B
DoubleToUInt32 \x64_PR\corerun.exe 409.6 ns 0.40 ns 0.01 66 B
DoubleToInt64 \x64_Main\corerun.exe 34,529.2 ns 60.82 ns 1.00 55 B
DoubleToInt64 \x64_PR\corerun.exe 406.8 ns 0.61 ns 0.01 72 B
DoubleToUInt64 \x64_Main\corerun.exe 41,012.6 ns 162.60 ns 1.00 55 B
DoubleToUInt64 \x64_PR\corerun.exe 673.9 ns 0.98 ns 0.02 101 B
Intel Skylake x86
BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5608/22H2/2022Update)
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.202
  [Host]     : .NET 9.0.3 (9.0.325.11113), X86 RyuJIT AVX2
  Job-IFOWKZ : .NET 10.0.0 (42.42.42.42424), X86 RyuJIT AVX2
  Job-MTSNPU : .NET 10.0.0 (42.42.42.42424), X86 RyuJIT AVX2
Method Toolchain Mean Error Ratio Code Size
FloatToInt32 \x86_Main\corerun.exe 95,745.2 ns 507.47 ns 1.000 56 B
FloatToInt32 \x86_PR\corerun.exe 745.2 ns 2.60 ns 0.008 65 B
FloatToUInt32 \x86_Main\corerun.exe 101,492.8 ns 233.97 ns 1.000 56 B
FloatToUInt32 \x86_PR\corerun.exe 996.4 ns 6.03 ns 0.010 86 B
FloatToInt64 \x86_Main\corerun.exe 101,038.6 ns 567.50 ns 1.00 77 B
FloatToInt64 \x86_PR\corerun.exe 100,588.1 ns 250.48 ns 1.00 77 B
FloatToUInt64 \x86_Main\corerun.exe 100,698.7 ns 470.29 ns 1.00 77 B
FloatToUInt64 \x86_PR\corerun.exe 100,957.7 ns 301.72 ns 1.00 77 B
DoubleToInt32 \x86_Main\corerun.exe 93,668.0 ns 170.48 ns 1.000 56 B
DoubleToInt32 \x86_PR\corerun.exe 748.6 ns 4.83 ns 0.008 65 B
DoubleToUInt32 \x86_Main\corerun.exe 101,885.3 ns 315.39 ns 1.00 56 B
DoubleToUInt32 \x86_PR\corerun.exe 1,400.2 ns 12.42 ns 0.01 92 B
DoubleToInt64 \x86_Main\corerun.exe 101,416.3 ns 330.06 ns 1.00 77 B
DoubleToInt64 \x86_PR\corerun.exe 101,028.3 ns 364.28 ns 1.00 77 B
DoubleToUInt64 \x86_Main\corerun.exe 100,405.6 ns 306.54 ns 1.00 77 B
DoubleToUInt64 \x86_PR\corerun.exe 100,552.7 ns 409.43 ns 1.00 77 B
AMD Zen 5 x86

Casts to long/ulong can be accelerated with AVX-512 as well. (Waiting until #113930 lands to do this, due to confilcts)

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3476)
Unknown processor
.NET SDK 9.0.200
  [Host]     : .NET 9.0.4 (9.0.425.16305), X86 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-MVRGFJ : .NET 10.0.0 (42.42.42.42424), X86 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-TKZXLB : .NET 10.0.0 (42.42.42.42424), X86 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio Code Size
FloatToInt32 \x86_Main\corerun.exe 46,022.5 ns 98.03 ns 1.000 56 B
FloatToInt32 \x86_PR\corerun.exe 359.1 ns 1.30 ns 0.008 65 B
FloatToUInt32 \x86_Main\corerun.exe 46,534.7 ns 50.23 ns 1.000 56 B
FloatToUInt32 \x86_PR\corerun.exe 223.9 ns 2.19 ns 0.005 64 B
FloatToInt64 \x86_Main\corerun.exe 47,206.6 ns 48.96 ns 1.00 77 B
FloatToInt64 \x86_PR\corerun.exe 47,028.1 ns 44.93 ns 1.00 77 B
FloatToUInt64 \x86_Main\corerun.exe 52,660.9 ns 83.77 ns 1.00 77 B
FloatToUInt64 \x86_PR\corerun.exe 51,163.5 ns 138.14 ns 0.97 77 B
DoubleToInt32 \x86_Main\corerun.exe 45,949.0 ns 87.72 ns 1.000 56 B
DoubleToInt32 \x86_PR\corerun.exe 355.9 ns 1.03 ns 0.008 65 B
DoubleToUInt32 \x86_Main\corerun.exe 46,553.0 ns 138.63 ns 1.000 56 B
DoubleToUInt32 \x86_PR\corerun.exe 225.0 ns 1.81 ns 0.005 64 B
DoubleToInt64 \x86_Main\corerun.exe 46,140.4 ns 54.25 ns 1.00 77 B
DoubleToInt64 \x86_PR\corerun.exe 47,139.3 ns 172.54 ns 1.02 77 B
DoubleToUInt64 \x86_Main\corerun.exe 51,542.3 ns 90.22 ns 1.00 77 B
DoubleToUInt64 \x86_PR\corerun.exe 51,415.1 ns 102.06 ns 1.00 77 B

@saucecontrol saucecontrol marked this pull request as ready for review April 10, 2025 03:16
Copy link
Member Author

@saucecontrol saucecontrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready for review.
cc @tannergooding @dotnet/jit-contrib

Comment on lines -553 to -557
nextNode = LowerCast(node);
if (nextNode != nullptr)
{
return nextNode;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reverting a change from #97529. The new implementation always preserves the original cast node and does all IR manipulation ahead of it.

// GT_CAST(float/double, sbyte) = GT_CAST(GT_CAST(float/double, int32), sbyte)
// GT_CAST(float/double, int16) = GT_CAST(GT_CAST(double/double, int32), int16)
// GT_CAST(float/double, uint16) = GT_CAST(GT_CAST(double/double, int32), uint16)
//
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was copied from xarch, where lowering used to handle the intermediate int cast. That never applied here, and it no longer applies on xarch. I've removed the notes from all the headers and commented the asserts that check the assumptions instead.

// converted it to float -> double -> long conversion.
assert((dstType != TYP_LONG) || (srcType != TYP_FLOAT));
// If we don't have AVX10v2 saturating conversion instructions for
// floating->integral, we have to handle the saturation logic here.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is a complete rewrite, so it's best read top to bottom rather than compared against the current code.

@BruceForstall
Copy link
Member

@dotnet/intel

@BruceForstall
Copy link
Member

@khushal1996 You implemented the original code; it would be useful for you to comment on this change.

@BruceForstall
Copy link
Member

/azp run runtime-coreclr libraries-jitstress, runtime-coreclr outerloop, runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-avx512, Fuzzlyn

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@khushal1996
Copy link
Member

@saucecontrol I am running the benchmark on IceLake to verify the change and then review the changes but so far, the change seems to improve perf.

@BruceForstall
Copy link
Member

Looks like the additional test runs didn't find anything new.

@BruceForstall
Copy link
Member

@saucecontrol I am running the benchmark on IceLake to verify the change and then review the changes but so far, the change seems to improve perf.

@khushal1996 Any findings yet?

@khushal1996
Copy link
Member

@saucecontrol I am running the benchmark on IceLake to verify the change and then review the changes but so far, the change seems to improve perf.

@khushal1996 Any findings yet?

Overall the change looks good. I ran some tests on ICX and looks like vpfixup slows things down on scalar conversions vs packed conversions. Also tried the same changes on packed conversions which showed packed conversions performing faster with vpfixup.

ICX benchmarks

Avx512
image

Avx2
image

SSE2
image

@@ -990,14 +976,14 @@ void Lowering::LowerCast(GenTree* tree)
maxIntegralValue = comp->gtNewIconNode(static_cast<ssize_t>(UINT32_MAX));
if (srcType == TYP_FLOAT)
{
maxFloatSimdVal->f32[0] = static_cast<float>(UINT32_MAX);
maxFloatSimdVal->f32[0] = 4294967296.0f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using global constants for these numbers?

@saucecontrol
Copy link
Member Author

Thanks for the extra numbers @khushal1996!

Also tried the same changes on packed conversions which showed packed conversions performing faster with vpfixup.

Yeah, since the intrinsic expansion for vector conversion happens in the JIT front end, the table load can get hoisted out of the loop in benchmarks like this one, which helps quite a bit.

There are a couple of improvements we can make to the vector convert codegen, but I'll do them in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants