Skip to content

Slow code generated for _mm256_mulhi_epi16 #130782

Open
@turalcar

Description

@turalcar

I suspect this is an issue in upstream LLVM. The sse2 version and the unsigned version (_mm256_mulhi_epu16) show the same problem. If a wider register is available (xmm -> ymm -> zmm) that will be used instead of splitting the values between 2 different ones.

Code

https://godbolt.org/z/9Eqb45Keq

I tried this code:

pub unsafe fn bad(a: __m256i) -> __m256i {
    let a = _mm256_and_si256(a, _mm256_set1_epi16(0x7FFF));
    _mm256_mulhi_epi16(a, _mm256_set1_epi16(1000))
}

I expected to see this happen: more or less the same codegen as with a -1000 in multiplier

Instead, this happened: it looks like the vector is widened to i32 for no good reason.

Version it worked on

It most recently worked on: Rust 1.74

Version with regression

I checked on godbolt with 1.75-1.81 and whatever beta and nightly are today.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchE-needs-testCall for participation: An issue has been fixed and does not reproduce, but no test has been added.I-slowIssue: Problems and improvements with respect to performance of generated code.P-mediumMedium priorityregression-untriagedUntriaged performance or correctness regression.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions