Skip to content

HWIntrinsics: Load folding to immediate address? #12308

Open
@Zhentar

Description

@Zhentar

I've been taking a go at porting the XXH3 hash algorithm including SSE & AVX versions. My current AVX2 code for the hot loop is here: https://github.com/Zhentar/xxHash3.NET/blob/ee6a626e87f2a829ec786690d4dfa560d876dda7/xxHash3/xxHash3_AVX2.cs#L103

So far I've gotten it up to 36GB/s, against the clang compiled native version's ~40GB/s.

One sub-piece by clang looks like this:

vmovdqu ymm3, ymmword ptr [rax-360h]
vpaddd  ymm4, ymm3, cs:ymmword_40BDC0
vpshufd ymm6, ymm4, 31h
vpmuludq ymm4, ymm6, ymm4
vpaddq  ymm3, ymm5, ymm3
vpaddq  ymm0, ymm0, ymm3

While my version looks like this:

vmovupd ymm8,ymmword ptr [r10+88h]
vmovupd ymm9,ymmword ptr [r11+360h]
vpaddd  ymm8,ymm9,ymm8
vpshufd ymm10,ymm8,31h
vpmuludq ymm8,ymm8,ymm10
vpaddq  ymm1,ymm9,ymm1
vpaddq  ymm1,ymm8,ymm1

Or, if I arrange the code such that folding kicks in (uncommenting the in for the ProcessStripePiece_AVX2 key argument), this:

lea     r14,[rax+20h]
vmovupd ymm4,ymmword ptr [rbp+100h]
vpaddd  ymm5,ymm4,ymmword ptr [r14]
vpshufd ymm6,ymm5,31h
vpmuludq ymm5,ymm5,ymm6
vpaddq  ymm0,ymm4,ymm0
vpaddq  ymm0,ymm5,ymm0

However, the folded version performs worse, because the lea competes with the add/shuf/mul instructions for an integer ALU port instead of a load port.

Is there any way to get an immediate address folded into the vpaddd instead of an execution time calculated displacement? I've tried a static readonly field, but that still resulted in an lea displacement calculation.

category:cq
theme:hardware-intrinsics
skill-level:expert
cost:medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIoptimizationtenet-performancePerformance related issue

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions