JIT: Speed up floating to integer casts on x86/x64 #114410

saucecontrol · 2025-04-08T21:43:41Z

This replaces the saturating floating to integer cast logic for pre-AVX10v2 hardware (introduced in #97529) with higher performance versions.

Replaces AVX-512 signed integer saturation with SSE/SSE2 versions that are both faster and more hardware compatible.
Removes SSE4.1 fallbacks, which were also slower than the new SSE/SSE2 implementations.
Replaces float->double->long casts with direct float->long on x64.
Replaces helper calls for float/double->ulong casts on pre-AVX-512 x64.
Replaces helper calls for float/double->int/uint casts on x86.
Adds support for AVX-512 uint cast to x86 (long and ulong to come in a followup PR).

Diffs

dotnet-policy-service · 2025-04-08T21:44:29Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

saucecontrol · 2025-04-08T22:00:57Z

Some benchmark numbers. Benchmark code here.

Intel Skylake x64

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5608/22H2/2022Update)
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.202
  [Host]     : .NET 9.0.3 (9.0.325.11113), X64 RyuJIT AVX2
  Job-JPGMZI : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-DMYZSI : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method	Toolchain	Mean	Error	Ratio	Code Size
FloatToInt32	\x64_Main\corerun.exe	1,433.0 ns	8.29 ns	1.00	95 B
FloatToInt32	\x64_PR\corerun.exe	747.8 ns	2.96 ns	0.52	64 B

FloatToUInt32	\x64_Main\corerun.exe	5,887.2 ns	50.63 ns	1.00	95 B
FloatToUInt32	\x64_PR\corerun.exe	813.6 ns	13.85 ns	0.14	64 B

FloatToInt64	\x64_Main\corerun.exe	5,928.0 ns	24.94 ns	1.00	93 B
FloatToInt64	\x64_PR\corerun.exe	822.5 ns	3.31 ns	0.14	69 B

FloatToUInt64	\x64_Main\corerun.exe	62,241.6 ns	662.87 ns	1.00	58 B
FloatToUInt64	\x64_PR\corerun.exe	1,408.5 ns	6.11 ns	0.02	97 B

DoubleToInt32	\x64_Main\corerun.exe	1,430.7 ns	6.54 ns	1.00	92 B
DoubleToInt32	\x64_PR\corerun.exe	746.1 ns	1.78 ns	0.52	64 B

DoubleToUInt32	\x64_Main\corerun.exe	1,392.7 ns	4.48 ns	1.00	95 B
DoubleToUInt32	\x64_PR\corerun.exe	744.6 ns	3.05 ns	0.53	64 B

DoubleToInt64	\x64_Main\corerun.exe	1,429.5 ns	5.04 ns	1.00	93 B
DoubleToInt64	\x64_PR\corerun.exe	753.3 ns	13.95 ns	0.53	69 B

DoubleToUInt64	\x64_Main\corerun.exe	63,930.8 ns	155.39 ns	1.00	58 B
DoubleToUInt64	\x64_PR\corerun.exe	1,245.8 ns	3.33 ns	0.02	97 B

AMD Zen 5 x64

Signed casts show improvement with the new SSE2 code.

The very small regression shown on unsigned is from swapping vfixupimms[sd] for xorps+maxs[sd], which is smaller code and saves a memory load for the fixup table, so it should overall be a perf win.

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3476)
Unknown processor
.NET SDK 9.0.200
  [Host]     : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-LZXYVJ : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-VKKTCQ : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method	Toolchain	Mean	Error	Ratio	Code Size
FloatToInt32	\x64_Main\corerun.exe	455.9 ns	1.49 ns	1.00	80 B
FloatToInt32	\x64_PR\corerun.exe	356.8 ns	1.38 ns	0.78	64 B

FloatToUInt32	\x64_Main\corerun.exe	225.0 ns	0.98 ns	1.00	65 B
FloatToUInt32	\x64_PR\corerun.exe	231.2 ns	3.32 ns	1.03	62 B

FloatToInt64	\x64_Main\corerun.exe	3,817.1 ns	12.37 ns	1.00	81 B
FloatToInt64	\x64_PR\corerun.exe	357.2 ns	1.48 ns	0.09	69 B

FloatToUInt64	\x64_Main\corerun.exe	224.4 ns	1.36 ns	1.00	65 B
FloatToUInt64	\x64_PR\corerun.exe	232.0 ns	2.02 ns	1.03	62 B

DoubleToInt32	\x64_Main\corerun.exe	455.0 ns	1.06 ns	1.00	80 B
DoubleToInt32	\x64_PR\corerun.exe	353.6 ns	2.08 ns	0.78	64 B

DoubleToUInt32	\x64_Main\corerun.exe	224.1 ns	1.42 ns	1.00	65 B
DoubleToUInt32	\x64_PR\corerun.exe	230.2 ns	1.64 ns	1.03	62 B

DoubleToInt64	\x64_Main\corerun.exe	455.1 ns	1.64 ns	1.00	81 B
DoubleToInt64	\x64_PR\corerun.exe	356.3 ns	1.26 ns	0.78	69 B

DoubleToUInt64	\x64_Main\corerun.exe	224.2 ns	0.88 ns	1.00	65 B
DoubleToUInt64	\x64_PR\corerun.exe	231.3 ns	1.31 ns	1.03	62 B

And More...

AMD Zen 5 x64 SSE2-only

This shows the perf improvement for the worst-case scenario of baseline ISAs only, which currently results in all casts going through helpers.

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3476)
Unknown processor
.NET SDK 9.0.200
  [Host]     : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT SSE2
  Job-REPAKK : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT SSE2
  Job-TGKOOP : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT SSE2

Method	Toolchain	Mean	Error	Ratio	Code Size
FloatToInt32	\x64_Main\corerun.exe	34,066.4 ns	51.09 ns	1.00	53 B
FloatToInt32	\x64_PR\corerun.exe	406.6 ns	0.28 ns	0.01	66 B

FloatToUInt32	\x64_Main\corerun.exe	37,660.0 ns	75.95 ns	1.00	53 B
FloatToUInt32	\x64_PR\corerun.exe	406.6 ns	0.51 ns	0.01	65 B

FloatToInt64	\x64_Main\corerun.exe	37,929.7 ns	123.98 ns	1.00	55 B
FloatToInt64	\x64_PR\corerun.exe	406.8 ns	0.33 ns	0.01	70 B

FloatToUInt64	\x64_Main\corerun.exe	38,531.0 ns	117.54 ns	1.00	55 B
FloatToUInt64	\x64_PR\corerun.exe	672.9 ns	2.55 ns	0.02	100 B

DoubleToInt32	\x64_Main\corerun.exe	33,941.7 ns	60.67 ns	1.00	53 B
DoubleToInt32	\x64_PR\corerun.exe	407.0 ns	0.60 ns	0.01	68 B

DoubleToUInt32	\x64_Main\corerun.exe	37,075.2 ns	76.08 ns	1.00	53 B
DoubleToUInt32	\x64_PR\corerun.exe	409.6 ns	0.40 ns	0.01	66 B

DoubleToInt64	\x64_Main\corerun.exe	34,529.2 ns	60.82 ns	1.00	55 B
DoubleToInt64	\x64_PR\corerun.exe	406.8 ns	0.61 ns	0.01	72 B

DoubleToUInt64	\x64_Main\corerun.exe	41,012.6 ns	162.60 ns	1.00	55 B
DoubleToUInt64	\x64_PR\corerun.exe	673.9 ns	0.98 ns	0.02	101 B

Intel Skylake x86

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5608/22H2/2022Update)
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.202
  [Host]     : .NET 9.0.3 (9.0.325.11113), X86 RyuJIT AVX2
  Job-IFOWKZ : .NET 10.0.0 (42.42.42.42424), X86 RyuJIT AVX2
  Job-MTSNPU : .NET 10.0.0 (42.42.42.42424), X86 RyuJIT AVX2

Method	Toolchain	Mean	Error	Ratio	Code Size
FloatToInt32	\x86_Main\corerun.exe	95,745.2 ns	507.47 ns	1.000	56 B
FloatToInt32	\x86_PR\corerun.exe	745.2 ns	2.60 ns	0.008	65 B

FloatToUInt32	\x86_Main\corerun.exe	101,492.8 ns	233.97 ns	1.000	56 B
FloatToUInt32	\x86_PR\corerun.exe	996.4 ns	6.03 ns	0.010	86 B

FloatToInt64	\x86_Main\corerun.exe	101,038.6 ns	567.50 ns	1.00	77 B
FloatToInt64	\x86_PR\corerun.exe	100,588.1 ns	250.48 ns	1.00	77 B

FloatToUInt64	\x86_Main\corerun.exe	100,698.7 ns	470.29 ns	1.00	77 B
FloatToUInt64	\x86_PR\corerun.exe	100,957.7 ns	301.72 ns	1.00	77 B

DoubleToInt32	\x86_Main\corerun.exe	93,668.0 ns	170.48 ns	1.000	56 B
DoubleToInt32	\x86_PR\corerun.exe	748.6 ns	4.83 ns	0.008	65 B

DoubleToUInt32	\x86_Main\corerun.exe	101,885.3 ns	315.39 ns	1.00	56 B
DoubleToUInt32	\x86_PR\corerun.exe	1,400.2 ns	12.42 ns	0.01	92 B

DoubleToInt64	\x86_Main\corerun.exe	101,416.3 ns	330.06 ns	1.00	77 B
DoubleToInt64	\x86_PR\corerun.exe	101,028.3 ns	364.28 ns	1.00	77 B

DoubleToUInt64	\x86_Main\corerun.exe	100,405.6 ns	306.54 ns	1.00	77 B
DoubleToUInt64	\x86_PR\corerun.exe	100,552.7 ns	409.43 ns	1.00	77 B

AMD Zen 5 x86

Casts to long/ulong can be accelerated with AVX-512 as well. (Waiting until #113930 lands to do this, due to confilcts)

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3476)
Unknown processor
.NET SDK 9.0.200
  [Host]     : .NET 9.0.4 (9.0.425.16305), X86 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-MVRGFJ : .NET 10.0.0 (42.42.42.42424), X86 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-TKZXLB : .NET 10.0.0 (42.42.42.42424), X86 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method	Toolchain	Mean	Error	Ratio	Code Size
FloatToInt32	\x86_Main\corerun.exe	46,022.5 ns	98.03 ns	1.000	56 B
FloatToInt32	\x86_PR\corerun.exe	359.1 ns	1.30 ns	0.008	65 B

FloatToUInt32	\x86_Main\corerun.exe	46,534.7 ns	50.23 ns	1.000	56 B
FloatToUInt32	\x86_PR\corerun.exe	223.9 ns	2.19 ns	0.005	64 B

FloatToInt64	\x86_Main\corerun.exe	47,206.6 ns	48.96 ns	1.00	77 B
FloatToInt64	\x86_PR\corerun.exe	47,028.1 ns	44.93 ns	1.00	77 B

FloatToUInt64	\x86_Main\corerun.exe	52,660.9 ns	83.77 ns	1.00	77 B
FloatToUInt64	\x86_PR\corerun.exe	51,163.5 ns	138.14 ns	0.97	77 B

DoubleToInt32	\x86_Main\corerun.exe	45,949.0 ns	87.72 ns	1.000	56 B
DoubleToInt32	\x86_PR\corerun.exe	355.9 ns	1.03 ns	0.008	65 B

DoubleToUInt32	\x86_Main\corerun.exe	46,553.0 ns	138.63 ns	1.000	56 B
DoubleToUInt32	\x86_PR\corerun.exe	225.0 ns	1.81 ns	0.005	64 B

DoubleToInt64	\x86_Main\corerun.exe	46,140.4 ns	54.25 ns	1.00	77 B
DoubleToInt64	\x86_PR\corerun.exe	47,139.3 ns	172.54 ns	1.02	77 B

DoubleToUInt64	\x86_Main\corerun.exe	51,542.3 ns	90.22 ns	1.00	77 B
DoubleToUInt64	\x86_PR\corerun.exe	51,415.1 ns	102.06 ns	1.00	77 B

saucecontrol

This is ready for review.
cc @tannergooding @dotnet/jit-contrib

saucecontrol · 2025-04-10T03:18:31Z

src/coreclr/jit/lower.cpp

-            nextNode = LowerCast(node);
-            if (nextNode != nullptr)
-            {
-                return nextNode;
-            }


This is reverting a change from #97529. The new implementation always preserves the original cast node and does all IR manipulation ahead of it.

saucecontrol · 2025-04-10T03:22:19Z

src/coreclr/jit/lowerarmarch.cpp

-//    GT_CAST(float/double, sbyte)    =   GT_CAST(GT_CAST(float/double, int32), sbyte)
-//    GT_CAST(float/double, int16)    =   GT_CAST(GT_CAST(double/double, int32), int16)
-//    GT_CAST(float/double, uint16)   =   GT_CAST(GT_CAST(double/double, int32), uint16)
-//


This comment was copied from xarch, where lowering used to handle the intermediate int cast. That never applied here, and it no longer applies on xarch. I've removed the notes from all the headers and commented the asserts that check the assumptions instead.

saucecontrol · 2025-04-10T03:23:37Z

src/coreclr/jit/lowerxarch.cpp

-        // converted it to float -> double -> long conversion.
-        assert((dstType != TYP_LONG) || (srcType != TYP_FLOAT));
+        // If we don't have AVX10v2 saturating conversion instructions for
+        // floating->integral, we have to handle the saturation logic here.


This implementation is a complete rewrite, so it's best read top to bottom rather than compared against the current code.

BruceForstall · 2025-04-20T20:56:16Z

@dotnet/intel

BruceForstall · 2025-04-21T18:54:01Z

@khushal1996 You implemented the original code; it would be useful for you to comment on this change.

BruceForstall · 2025-04-21T18:58:13Z

/azp run runtime-coreclr libraries-jitstress, runtime-coreclr outerloop, runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-avx512, Fuzzlyn

azure-pipelines · 2025-04-21T18:58:47Z

Azure Pipelines successfully started running 5 pipeline(s).

khushal1996 · 2025-04-23T20:00:27Z

@saucecontrol I am running the benchmark on IceLake to verify the change and then review the changes but so far, the change seems to improve perf.

BruceForstall · 2025-04-25T19:05:09Z

Looks like the additional test runs didn't find anything new.

BruceForstall · 2025-04-25T19:05:33Z

@saucecontrol I am running the benchmark on IceLake to verify the change and then review the changes but so far, the change seems to improve perf.

@khushal1996 Any findings yet?

khushal1996 · 2025-04-25T23:35:58Z

@saucecontrol I am running the benchmark on IceLake to verify the change and then review the changes but so far, the change seems to improve perf.

@khushal1996 Any findings yet?

Overall the change looks good. I ran some tests on ICX and looks like vpfixup slows things down on scalar conversions vs packed conversions. Also tried the same changes on packed conversions which showed packed conversions performing faster with vpfixup.

ICX benchmarks

Avx512

Avx2

SSE2

khushal1996 · 2025-04-25T23:41:01Z

src/coreclr/jit/lowerxarch.cpp

@@ -990,14 +976,14 @@ void Lowering::LowerCast(GenTree* tree)
                    maxIntegralValue = comp->gtNewIconNode(static_cast<ssize_t>(UINT32_MAX));
                    if (srcType == TYP_FLOAT)
                    {
-                        maxFloatSimdVal->f32[0] = static_cast<float>(UINT32_MAX);
+                        maxFloatSimdVal->f32[0] = 4294967296.0f;


Should we be using global constants for these numbers?

saucecontrol · 2025-04-26T15:13:41Z

Thanks for the extra numbers @khushal1996!

Also tried the same changes on packed conversions which showed packed conversions performing faster with vpfixup.

Yeah, since the intrinsic expansion for vector conversion happens in the JIT front end, the table load can get hoisted out of the loop in benchmarks like this one, which helps quite a bit.

There are a couple of improvements we can make to the vector convert codegen, but I'll do them in another PR.

speed up floating to integer conversion

b20b5c1

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 8, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 8, 2025

fix linux build

ae81204

saucecontrol added 2 commits April 9, 2025 18:21

fix double->uint SSE, don't use fixupimm

537d3d9

Merge remote-tracking branch 'upstream/main' into x86convert2

ae30563

saucecontrol marked this pull request as ready for review April 10, 2025 03:16

saucecontrol commented Apr 10, 2025

View reviewed changes

build-analysis bot mentioned this pull request Apr 10, 2025

Android emulator not booting completely on Helix queue dotnet/dnceng#1448

Open

3 tasks

saucecontrol mentioned this pull request Apr 20, 2025

Fix typing of BlendVariableMask condition when optimizing TernaryLogic #114837

Merged

BruceForstall requested review from tannergooding and BruceForstall April 20, 2025 20:55

khushal1996 reviewed Apr 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Speed up floating to integer casts on x86/x64 #114410

JIT: Speed up floating to integer casts on x86/x64 #114410

saucecontrol commented Apr 8, 2025 •

edited

Loading

dotnet-policy-service bot commented Apr 8, 2025

saucecontrol commented Apr 8, 2025 •

edited

Loading

saucecontrol left a comment

saucecontrol Apr 10, 2025

saucecontrol Apr 10, 2025

saucecontrol Apr 10, 2025

BruceForstall commented Apr 20, 2025

BruceForstall commented Apr 21, 2025

BruceForstall commented Apr 21, 2025

azure-pipelines bot commented Apr 21, 2025

khushal1996 commented Apr 23, 2025

BruceForstall commented Apr 25, 2025

BruceForstall commented Apr 25, 2025

khushal1996 commented Apr 25, 2025

khushal1996 Apr 25, 2025

saucecontrol commented Apr 26, 2025

JIT: Speed up floating to integer casts on x86/x64 #114410

Are you sure you want to change the base?

JIT: Speed up floating to integer casts on x86/x64 #114410

Conversation

saucecontrol commented Apr 8, 2025 • edited Loading

dotnet-policy-service bot commented Apr 8, 2025

saucecontrol commented Apr 8, 2025 • edited Loading

Intel Skylake x64

AMD Zen 5 x64

And More...

saucecontrol left a comment

Choose a reason for hiding this comment

saucecontrol Apr 10, 2025

Choose a reason for hiding this comment

saucecontrol Apr 10, 2025

Choose a reason for hiding this comment

saucecontrol Apr 10, 2025

Choose a reason for hiding this comment

BruceForstall commented Apr 20, 2025

BruceForstall commented Apr 21, 2025

BruceForstall commented Apr 21, 2025

azure-pipelines bot commented Apr 21, 2025

khushal1996 commented Apr 23, 2025

BruceForstall commented Apr 25, 2025

BruceForstall commented Apr 25, 2025

khushal1996 commented Apr 25, 2025

khushal1996 Apr 25, 2025

Choose a reason for hiding this comment

saucecontrol commented Apr 26, 2025

saucecontrol commented Apr 8, 2025 •

edited

Loading

saucecontrol commented Apr 8, 2025 •

edited

Loading