Codegen weirdness for `sum` of `count_ones` over an array

(Issue loosely owned by @wesleywiser and @pnkfelix monitoring https://github.com/llvm/llvm-project/issues/57476 )

### Original Description below

```rust
pub fn f(arr: [u64; 2]) -> u32 {
    arr.into_iter().map(u64::count_ones).sum()
}
```

Before 1.62.0, this code correctly compiled to two popcounts and an addition on a modern x86-64 target.

```asm
example::f:
        popcnt  rcx, qword ptr [rdi]
        popcnt  rax, qword ptr [rdi + 8]
        add     eax, ecx
        ret
```

Since 1.62.0 (up to latest nightly), the codegen is... [baffling at best.](https://godbolt.org/z/G9fa4Y8T7)

```asm
.LCPI0_0:
        .zero   16,15
.LCPI0_1:
        .byte   0
        .byte   1
        .byte   1
        .byte   2
        .byte   1
        .byte   2
        .byte   2
        .byte   3
        .byte   1
        .byte   2
        .byte   2
        .byte   3
        .byte   2
        .byte   3
        .byte   3
        .byte   4
example::f:
        sub     rsp, 40
        vmovups xmm0, xmmword ptr [rdi]
        vmovdqa xmm1, xmmword ptr [rip + .LCPI0_0]
        vmovdqa xmm3, xmmword ptr [rip + .LCPI0_1]
        vmovaps xmmword ptr [rsp], xmm0
        vmovdqa xmm0, xmmword ptr [rsp]
        vpand   xmm2, xmm0, xmm1
        vpsrlw  xmm0, xmm0, 4
        vpand   xmm0, xmm0, xmm1
        vpshufb xmm2, xmm3, xmm2
        vpxor   xmm1, xmm1, xmm1
        vpshufb xmm0, xmm3, xmm0
        vpaddb  xmm0, xmm0, xmm2
        vpsadbw xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 170
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        add     rsp, 40
        ret
```

The assembly for the original function is now a terribly misguided autovectorization. And, just to make sure (even though it's pretty obvious), I did run a benchmark - the autovectorized function is ~8x slower on my Zen 2 system.

Calling that function from a different function brings back normal assembly. `-Cno-vectorize-slp` does nothing. I don't know exactly what `-Cno-vectorize-loops` does, but it's not good.

If you change the length of the array to 4, both functions get autovectorized. `-Cno-vectorize-slp` fixes the second function now. Adding `-Cno-vectorize-loops` causes the passthrough function to generate the worst assembly.

Changing `into_iter` to `iter` fixes length 2, but doesn't fix length 4.

I could go on, but in short it's a whole mess.

I found a workaround that consistently works for all lengths: `iter` and `-Cno-vectorize-slp`.

@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged +A-array +A-codegen +A-iterators +A-LLVM +A-simd +I-slow +O-x86_64 +perf-regression


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Codegen weirdness for `sum` of `count_ones` over an array #101060

Original Description below

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Codegen weirdness for sum of count_ones over an array #101060

Description

Original Description below

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Codegen weirdness for `sum` of `count_ones` over an array #101060