Description
I've been taking a go at porting the XXH3 hash algorithm including SSE & AVX versions. My current AVX2 code for the hot loop is here: https://github.com/Zhentar/xxHash3.NET/blob/ee6a626e87f2a829ec786690d4dfa560d876dda7/xxHash3/xxHash3_AVX2.cs#L103
So far I've gotten it up to 36GB/s, against the clang compiled native version's ~40GB/s.
One sub-piece by clang looks like this:
vmovdqu ymm3, ymmword ptr [rax-360h]
vpaddd ymm4, ymm3, cs:ymmword_40BDC0
vpshufd ymm6, ymm4, 31h
vpmuludq ymm4, ymm6, ymm4
vpaddq ymm3, ymm5, ymm3
vpaddq ymm0, ymm0, ymm3
While my version looks like this:
vmovupd ymm8,ymmword ptr [r10+88h]
vmovupd ymm9,ymmword ptr [r11+360h]
vpaddd ymm8,ymm9,ymm8
vpshufd ymm10,ymm8,31h
vpmuludq ymm8,ymm8,ymm10
vpaddq ymm1,ymm9,ymm1
vpaddq ymm1,ymm8,ymm1
Or, if I arrange the code such that folding kicks in (uncommenting the in
for the ProcessStripePiece_AVX2
key argument), this:
lea r14,[rax+20h]
vmovupd ymm4,ymmword ptr [rbp+100h]
vpaddd ymm5,ymm4,ymmword ptr [r14]
vpshufd ymm6,ymm5,31h
vpmuludq ymm5,ymm5,ymm6
vpaddq ymm0,ymm4,ymm0
vpaddq ymm0,ymm5,ymm0
However, the folded version performs worse, because the lea
competes with the add/shuf/mul instructions for an integer ALU port instead of a load port.
Is there any way to get an immediate address folded into the vpaddd
instead of an execution time calculated displacement? I've tried a static readonly field, but that still resulted in an lea displacement calculation.
category:cq
theme:hardware-intrinsics
skill-level:expert
cost:medium