Skip to content

Codegen weirdness for sum of count_ones over an array #101060

Open
@alion02

Description

@alion02

(Issue loosely owned by @wesleywiser and @pnkfelix monitoring llvm/llvm-project#57476 )

Original Description below

pub fn f(arr: [u64; 2]) -> u32 {
    arr.into_iter().map(u64::count_ones).sum()
}

Before 1.62.0, this code correctly compiled to two popcounts and an addition on a modern x86-64 target.

example::f:
        popcnt  rcx, qword ptr [rdi]
        popcnt  rax, qword ptr [rdi + 8]
        add     eax, ecx
        ret

Since 1.62.0 (up to latest nightly), the codegen is... baffling at best.

.LCPI0_0:
        .zero   16,15
.LCPI0_1:
        .byte   0
        .byte   1
        .byte   1
        .byte   2
        .byte   1
        .byte   2
        .byte   2
        .byte   3
        .byte   1
        .byte   2
        .byte   2
        .byte   3
        .byte   2
        .byte   3
        .byte   3
        .byte   4
example::f:
        sub     rsp, 40
        vmovups xmm0, xmmword ptr [rdi]
        vmovdqa xmm1, xmmword ptr [rip + .LCPI0_0]
        vmovdqa xmm3, xmmword ptr [rip + .LCPI0_1]
        vmovaps xmmword ptr [rsp], xmm0
        vmovdqa xmm0, xmmword ptr [rsp]
        vpand   xmm2, xmm0, xmm1
        vpsrlw  xmm0, xmm0, 4
        vpand   xmm0, xmm0, xmm1
        vpshufb xmm2, xmm3, xmm2
        vpxor   xmm1, xmm1, xmm1
        vpshufb xmm0, xmm3, xmm0
        vpaddb  xmm0, xmm0, xmm2
        vpsadbw xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 170
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        add     rsp, 40
        ret

The assembly for the original function is now a terribly misguided autovectorization. And, just to make sure (even though it's pretty obvious), I did run a benchmark - the autovectorized function is ~8x slower on my Zen 2 system.

Calling that function from a different function brings back normal assembly. -Cno-vectorize-slp does nothing. I don't know exactly what -Cno-vectorize-loops does, but it's not good.

If you change the length of the array to 4, both functions get autovectorized. -Cno-vectorize-slp fixes the second function now. Adding -Cno-vectorize-loops causes the passthrough function to generate the worst assembly.

Changing into_iter to iter fixes length 2, but doesn't fix length 4.

I could go on, but in short it's a whole mess.

I found a workaround that consistently works for all lengths: iter and -Cno-vectorize-slp.

@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged +A-array +A-codegen +A-iterators +A-LLVM +A-simd +I-slow +O-x86_64 +perf-regression

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.A-autovectorizationArea: Autovectorization, which can impact perf or code sizeA-codegenArea: Code generationC-bugCategory: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.O-x86_64Target: x86-64 processors (like x86_64-*) (also known as amd64 and x64)P-highHigh priorityT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.WG-llvmWorking group: LLVM backend code generationregression-from-stable-to-stablePerformance or correctness regression from one stable version to another.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions