Skip to content

Poor register allocation with hardware intrinsic (x86) #37216

@ebfortin

Description

@ebfortin

Description

I'm porting an algorithm from scalar double arithmetics to SIMD using the Hardware Intrinsics. After some testing I concluded that the performance of the SIMD version is worst. Now it can be that I'm just not good at using SIMD instructions. However looking at the asm produced by the JIT, I think there may be a problem.

Configuration

.NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT

Regression?

Data

Look at one example:

                 var v02 = Avx.HorizontalAdd(v00, v01); // r = (xh + yh) | - xl - yl)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 00007ffc`ca6fdca4 c5f9284dd0      vmovapd xmm1,xmmword ptr [rbp-30h]
 00007ffc`ca6fdca9 c5f17c4dc0      vhaddpd xmm1,xmm1,xmmword ptr [rbp-40h]
 00007ffc`ca6fdcae c5f9294db0      vmovapd xmmword ptr [rbp-50h],xmm1
                 var v03 = Avx.Subtract(v00, v02);
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 00007ffc`ca6fdcb3 c5f9284dd0      vmovapd xmm1,xmmword ptr [rbp-30h]
 00007ffc`ca6fdcb8 c5f15c4db0      vsubpd  xmm1,xmm1,xmmword ptr [rbp-50h]
 00007ffc`ca6fdcbd c5f9294da0      vmovapd xmmword ptr [rbp-60h],xmm1

This comes from the disassembly of BenchmarkDotet.

Also benchmark results:

Method EnvironmentVariables Mean Error StdDev Median Max
AdditionDouble COMPlus_EnableHWIntrinsic=0 1.112 ns 0.0658 ns 0.1187 ns 1.100 ns 1.400 ns
AdditionDouble2 COMPlus_EnableHWIntrinsic=0 104.808 ns 2.8832 ns 6.0183 ns 102.300 ns 124.500 ns
AdditionDouble COMPlus_EnableHWIntrinsic=1 1.065 ns 0.0645 ns 0.0985 ns 1.100 ns 1.200 ns
AdditionDouble2 COMPlus_EnableHWIntrinsic=1 196.530 ns 9.3874 ns 26.4772 ns 178.950 ns 268.100 ns

Analysis

If you look closely you see that each instruction seem to be taken in isolation, with its own register allocation, instead of being global to the method. This means a LOT more memory load/store than seem necessary. There is a lot of register to play with beside xmm1...

The documentation on Hardware Intrinsics states that for some time in the compilation tree intrinsics are seen as method. Maybe they are seen as method for a bit too long and so each "method" see some register allocation but only in its own local "method" context.

category:cq
theme:register-allocator
skill-level:expert
cost:medium

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issue

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions