Poor register allocation with hardware intrinsic (x86)

### Description

I'm porting an algorithm from scalar double arithmetics to SIMD using the Hardware Intrinsics. After some testing I concluded that the performance of the SIMD version is worst. Now it can be that I'm just not good at using SIMD instructions. However looking at the asm produced by the JIT, I think there may be a problem. 



### Configuration

.NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT

### Regression?



### Data

Look at one example:

```
                 var v02 = Avx.HorizontalAdd(v00, v01); // r = (xh + yh) | - xl - yl)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 00007ffc`ca6fdca4 c5f9284dd0      vmovapd xmm1,xmmword ptr [rbp-30h]
 00007ffc`ca6fdca9 c5f17c4dc0      vhaddpd xmm1,xmm1,xmmword ptr [rbp-40h]
 00007ffc`ca6fdcae c5f9294db0      vmovapd xmmword ptr [rbp-50h],xmm1
                 var v03 = Avx.Subtract(v00, v02);
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 00007ffc`ca6fdcb3 c5f9284dd0      vmovapd xmm1,xmmword ptr [rbp-30h]
 00007ffc`ca6fdcb8 c5f15c4db0      vsubpd  xmm1,xmm1,xmmword ptr [rbp-50h]
 00007ffc`ca6fdcbd c5f9294da0      vmovapd xmmword ptr [rbp-60h],xmm1
```

This comes from the disassembly of BenchmarkDotet. 

Also benchmark results:

|          Method |        EnvironmentVariables |       Mean |     Error |     StdDev |     Median |        Max |
|---------------- |---------------------------- |-----------:|----------:|-----------:|-----------:|-----------:|
|  AdditionDouble | COMPlus_EnableHWIntrinsic=0 |   1.112 ns | 0.0658 ns |  0.1187 ns |   1.100 ns |   1.400 ns |
| AdditionDouble2 | COMPlus_EnableHWIntrinsic=0 | 104.808 ns | 2.8832 ns |  6.0183 ns | 102.300 ns | 124.500 ns |
|  AdditionDouble | COMPlus_EnableHWIntrinsic=1 |   1.065 ns | 0.0645 ns |  0.0985 ns |   1.100 ns |   1.200 ns |
| AdditionDouble2 | COMPlus_EnableHWIntrinsic=1 | 196.530 ns | 9.3874 ns | 26.4772 ns | 178.950 ns | 268.100 ns |

### Analysis

If you look closely you see that each instruction seem to be taken in isolation, with its own register allocation, instead of being global to the method. This means a LOT more memory load/store than seem necessary. There is a lot of register to play with beside xmm1...

The documentation on Hardware Intrinsics states that for some time in the compilation tree intrinsics are seen as method. Maybe they are seen as method for a bit too long and so each "method" see some register allocation but only in its own local "method" context.


category:cq
theme:register-allocator
skill-level:expert
cost:medium

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor register allocation with hardware intrinsic (x86) #37216

Description

Configuration

Regression?

Data

Analysis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Method	EnvironmentVariables	Mean	Error	StdDev	Median	Max
AdditionDouble	COMPlus_EnableHWIntrinsic=0	1.112 ns	0.0658 ns	0.1187 ns	1.100 ns	1.400 ns
AdditionDouble2	COMPlus_EnableHWIntrinsic=0	104.808 ns	2.8832 ns	6.0183 ns	102.300 ns	124.500 ns
AdditionDouble	COMPlus_EnableHWIntrinsic=1	1.065 ns	0.0645 ns	0.0985 ns	1.100 ns	1.200 ns
AdditionDouble2	COMPlus_EnableHWIntrinsic=1	196.530 ns	9.3874 ns	26.4772 ns	178.950 ns	268.100 ns

Poor register allocation with hardware intrinsic (x86) #37216

Description

Description

Configuration

Regression?

Data

Analysis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions