Skip to content

Vectorized code in loop performance problems due to excessive stack <-> vector register moves #85359

Closed
@MadProbe

Description

@MadProbe

Description

I am creating an implementation of adler32 checksum using C# and I am getting significant performance regression in comparison with implementation written in rust - simd-adler32 by nearly 2.5 times.
benchmark repository with source code

Configuration

See benchmark results

Regression?

Reproducible in .NET 7.0.

Data

BenchmarkDotNet=v0.13.5, OS=Windows 10 (10.0.19043.2251/21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=8.0.100-preview.3.23178.7
  [Host]     : .NET 8.0.0 (8.0.23.17408), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.17408), X64 RyuJIT AVX2


|     Method |         data |       Mean |     Error |    StdDev | Ratio | Code Size |
|----------- |------------- |-----------:|----------:|----------:|------:|----------:|
| Vectorized | Byte[102400] |   8.366 us | 0.0691 us | 0.0646 us |  0.02 |     902 B |
|      Naive | Byte[102400] | 412.368 us | 2.9653 us | 2.7738 us |  1.00 |     215 B |

adler32 simd implementation written in rust benchmark for avx2 using criterion.rs:

avx2/random-100k        time:   [3.5483 us 3.5684 us 3.5922 us]
                        thrpt:  [25.926 GiB/s 26.099 GiB/s 26.247 GiB/s]
--- some stuff is skipped
avx2/random-100k        time:   [3.5438 us 3.5655 us 3.5892 us]
                        thrpt:  [25.948 GiB/s 26.121 GiB/s 26.280 GiB/s]
                 change:
                        time:   [-1.5116% +0.2112% +2.0511%] (p = 0.82 > 0.05)
                        thrpt:  [-2.0099% -0.2107% +1.5348%]
                        No change in performance detected.

Analysis

Comparing disassemblied code for rust and my c# implementation, the main problem lies in computation loop: there are too much memory <-> vector moves there and it can be much more efficient if more YMM registers were used:

My C# code disassembly:

--- omitted for brevity
M00_L01:
       vmovups   ymm1,[rcx]
       vmovups   ymm2,[r10]
       vpaddq    ymm2,ymm2,[r8]
       vmovups   [r10],ymm2
       vpmaddubsw ymm2,ymm1,ymm0
       vpmaddwd  ymm2,ymm2,[7FFBA0BC7820]
       vpaddd    ymm2,ymm2,[r9]
       vmovups   [r9],ymm2
       vxorps    ymm2,ymm2,ymm2
       vpsadbw   ymm1,ymm1,ymm2
       vpaddq    ymm1,ymm1,[r8]
       vmovups   [r8],ymm1
       add       rcx,20
       cmp       rcx,r11
       jb        short M00_L01
--- omitted for brevity

And rust code loop:
source: https://github.com/mcountryman/simd-adler32/blob/main/src/imp/avx2.rs#L123-L132

--- Decompiled using Ghidra
                             LAB_14002ada0                                   XREF[1]:     14002adf1(j)  
       14002ada0 c5 fe 6f        VMOVDQU    YMM6,ymmword ptr [RBX + RAX*0x1 + 0x1580]
                 b4 03 80 
                 15 00 00
       14002ada9 c5 dd fe db     VPADDD     YMM3,YMM4,YMM3
       14002adad c5 cd f6 f8     VPSADBW    YMM7,YMM6,YMM0
       14002adb1 c5 dd fe e7     VPADDD     YMM4,YMM4,YMM7
       14002adb5 c4 e2 4d        VPMADDUBSW YMM6,YMM6,YMM1
                 04 f1
       14002adba c5 cd f5 f2     VPMADDWD   YMM6,YMM6,YMM2
       14002adbe c5 cd fe ed     VPADDD     YMM5,YMM6,YMM5
       14002adc2 48 85 c0        TEST       RAX,RAX
       14002adc5 0f 84 25        JZ         LAB_14002acf0
                 ff ff ff
       14002adcb c5 fe 6f        VMOVDQU    YMM6,ymmword ptr [RBX + RAX*0x1 + 0x15a0]
                 b4 03 a0 
                 15 00 00
       14002add4 c5 dd fe db     VPADDD     YMM3,YMM4,YMM3
       14002add8 c5 cd f6 f8     VPSADBW    YMM7,YMM6,YMM0
       14002addc c5 dd fe e7     VPADDD     YMM4,YMM4,YMM7
       14002ade0 c4 e2 4d        VPMADDUBSW YMM6,YMM6,YMM1
                 04 f1
       14002ade5 c5 cd f5 f2     VPMADDWD   YMM6,YMM6,YMM2
       14002ade9 c5 cd fe ed     VPADDD     YMM5,YMM6,YMM5
       14002aded 48 83 c0 40     ADD        RAX,0x40
       14002adf1 eb ad           JMP        LAB_14002ada0

Rust loop is cloned once, but you can see that it does almost no memory moves, and so it's 2.5x times faster than my C# implementation.

To optimize this case I think JIT should in optimization phase assess the code to not do any direct manipulation with vector data allocated in stack and, if so, move the vector variables into YMM registers, especially in loops, and move the data from vector registers back into stack if necessary.

Metadata

Metadata

Assignees

Labels

Priority:2Work that is important, but not critical for the releasearea-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issue

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions