Description
Description
I am creating an implementation of adler32 checksum using C# and I am getting significant performance regression in comparison with implementation written in rust - simd-adler32 by nearly 2.5 times.
benchmark repository with source code
Configuration
See benchmark results
Regression?
Data
BenchmarkDotNet=v0.13.5, OS=Windows 10 (10.0.19043.2251/21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=8.0.100-preview.3.23178.7
[Host] : .NET 8.0.0 (8.0.23.17408), X64 RyuJIT AVX2
DefaultJob : .NET 8.0.0 (8.0.23.17408), X64 RyuJIT AVX2
| Method | data | Mean | Error | StdDev | Ratio | Code Size |
|----------- |------------- |-----------:|----------:|----------:|------:|----------:|
| Vectorized | Byte[102400] | 8.366 us | 0.0691 us | 0.0646 us | 0.02 | 902 B |
| Naive | Byte[102400] | 412.368 us | 2.9653 us | 2.7738 us | 1.00 | 215 B |
adler32 simd implementation written in rust benchmark for avx2 using criterion.rs:
avx2/random-100k time: [3.5483 us 3.5684 us 3.5922 us]
thrpt: [25.926 GiB/s 26.099 GiB/s 26.247 GiB/s]
--- some stuff is skipped
avx2/random-100k time: [3.5438 us 3.5655 us 3.5892 us]
thrpt: [25.948 GiB/s 26.121 GiB/s 26.280 GiB/s]
change:
time: [-1.5116% +0.2112% +2.0511%] (p = 0.82 > 0.05)
thrpt: [-2.0099% -0.2107% +1.5348%]
No change in performance detected.
Analysis
Comparing disassemblied code for rust and my c# implementation, the main problem lies in computation loop: there are too much memory <-> vector moves there and it can be much more efficient if more YMM registers were used:
My C# code disassembly:
--- omitted for brevity
M00_L01:
vmovups ymm1,[rcx]
vmovups ymm2,[r10]
vpaddq ymm2,ymm2,[r8]
vmovups [r10],ymm2
vpmaddubsw ymm2,ymm1,ymm0
vpmaddwd ymm2,ymm2,[7FFBA0BC7820]
vpaddd ymm2,ymm2,[r9]
vmovups [r9],ymm2
vxorps ymm2,ymm2,ymm2
vpsadbw ymm1,ymm1,ymm2
vpaddq ymm1,ymm1,[r8]
vmovups [r8],ymm1
add rcx,20
cmp rcx,r11
jb short M00_L01
--- omitted for brevity
And rust code loop:
source: https://github.com/mcountryman/simd-adler32/blob/main/src/imp/avx2.rs#L123-L132
--- Decompiled using Ghidra
LAB_14002ada0 XREF[1]: 14002adf1(j)
14002ada0 c5 fe 6f VMOVDQU YMM6,ymmword ptr [RBX + RAX*0x1 + 0x1580]
b4 03 80
15 00 00
14002ada9 c5 dd fe db VPADDD YMM3,YMM4,YMM3
14002adad c5 cd f6 f8 VPSADBW YMM7,YMM6,YMM0
14002adb1 c5 dd fe e7 VPADDD YMM4,YMM4,YMM7
14002adb5 c4 e2 4d VPMADDUBSW YMM6,YMM6,YMM1
04 f1
14002adba c5 cd f5 f2 VPMADDWD YMM6,YMM6,YMM2
14002adbe c5 cd fe ed VPADDD YMM5,YMM6,YMM5
14002adc2 48 85 c0 TEST RAX,RAX
14002adc5 0f 84 25 JZ LAB_14002acf0
ff ff ff
14002adcb c5 fe 6f VMOVDQU YMM6,ymmword ptr [RBX + RAX*0x1 + 0x15a0]
b4 03 a0
15 00 00
14002add4 c5 dd fe db VPADDD YMM3,YMM4,YMM3
14002add8 c5 cd f6 f8 VPSADBW YMM7,YMM6,YMM0
14002addc c5 dd fe e7 VPADDD YMM4,YMM4,YMM7
14002ade0 c4 e2 4d VPMADDUBSW YMM6,YMM6,YMM1
04 f1
14002ade5 c5 cd f5 f2 VPMADDWD YMM6,YMM6,YMM2
14002ade9 c5 cd fe ed VPADDD YMM5,YMM6,YMM5
14002aded 48 83 c0 40 ADD RAX,0x40
14002adf1 eb ad JMP LAB_14002ada0
Rust loop is cloned once, but you can see that it does almost no memory moves, and so it's 2.5x times faster than my C# implementation.
To optimize this case I think JIT should in optimization phase assess the code to not do any direct manipulation with vector data allocated in stack and, if so, move the vector variables into YMM registers, especially in loops, and move the data from vector registers back into stack if necessary.