Skip to content

Function operating on arrays performs a lot worse for arrays of length 2 #102274

Open
@Canleskis

Description

@Canleskis

Hello, after some discussion on Stack Overflow, I was suggested to file an issue here.

The following function performs unexpectedly poorly with N = 2.

fn fold<const N: usize>(v: Vec<Array<N>>) -> Vec<Array<N>> {
    let result = v.iter().map(|a1| {
        v.iter().fold(Array::default(), |acc, a2| {
            let d = *a2 - *a1;

            acc + d
        })
    });

    result.collect()
}

Array is a simple wrapper implementing Add and Sub around an array of f32s. The results are the same without that wrapper.

Here is a graph generated with criterion of the benchmark of that function:

xNzO9

Someone on the Stack Overflow discussion suggested that this was because:
"The shl, shr and or operations on rdx and rsi suggest that for N = 2 the two floats are stored in one 64 bit general purpose register, whereas in the other cases the value of a1 is persisted in N separate xmm registers".
Another user suggested that this wrong optimization was caused by opt-level=3, so on the LLVM side.

Here is the assembly code where a user marked important areas.

Here is a repository of the full code for the benchmark.

This was observed both in stable and nightly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.A-arrayArea: `[T; N]`A-codegenArea: Code generationC-bugCategory: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions