-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Description
I'm porting an algorithm from scalar double arithmetics to SIMD using the Hardware Intrinsics. After some testing I concluded that the performance of the SIMD version is worst. Now it can be that I'm just not good at using SIMD instructions. However looking at the asm produced by the JIT, I think there may be a problem.
Configuration
.NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT
Regression?
Data
Look at one example:
var v02 = Avx.HorizontalAdd(v00, v01); // r = (xh + yh) | - xl - yl)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffc`ca6fdca4 c5f9284dd0 vmovapd xmm1,xmmword ptr [rbp-30h]
00007ffc`ca6fdca9 c5f17c4dc0 vhaddpd xmm1,xmm1,xmmword ptr [rbp-40h]
00007ffc`ca6fdcae c5f9294db0 vmovapd xmmword ptr [rbp-50h],xmm1
var v03 = Avx.Subtract(v00, v02);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffc`ca6fdcb3 c5f9284dd0 vmovapd xmm1,xmmword ptr [rbp-30h]
00007ffc`ca6fdcb8 c5f15c4db0 vsubpd xmm1,xmm1,xmmword ptr [rbp-50h]
00007ffc`ca6fdcbd c5f9294da0 vmovapd xmmword ptr [rbp-60h],xmm1
This comes from the disassembly of BenchmarkDotet.
Also benchmark results:
| Method | EnvironmentVariables | Mean | Error | StdDev | Median | Max |
|---|---|---|---|---|---|---|
| AdditionDouble | COMPlus_EnableHWIntrinsic=0 | 1.112 ns | 0.0658 ns | 0.1187 ns | 1.100 ns | 1.400 ns |
| AdditionDouble2 | COMPlus_EnableHWIntrinsic=0 | 104.808 ns | 2.8832 ns | 6.0183 ns | 102.300 ns | 124.500 ns |
| AdditionDouble | COMPlus_EnableHWIntrinsic=1 | 1.065 ns | 0.0645 ns | 0.0985 ns | 1.100 ns | 1.200 ns |
| AdditionDouble2 | COMPlus_EnableHWIntrinsic=1 | 196.530 ns | 9.3874 ns | 26.4772 ns | 178.950 ns | 268.100 ns |
Analysis
If you look closely you see that each instruction seem to be taken in isolation, with its own register allocation, instead of being global to the method. This means a LOT more memory load/store than seem necessary. There is a lot of register to play with beside xmm1...
The documentation on Hardware Intrinsics states that for some time in the compilation tree intrinsics are seen as method. Maybe they are seen as method for a bit too long and so each "method" see some register allocation but only in its own local "method" context.
category:cq
theme:register-allocator
skill-level:expert
cost:medium