Slow code generated for _mm256_mulhi_epi16

I suspect this is an issue in upstream LLVM. The sse2 version and the unsigned version (`_mm256_mulhi_epu16`) show the same problem. If a wider register is available (xmm -> ymm -> zmm) that will be used instead of splitting the values between 2 different ones.

### Code

https://godbolt.org/z/9Eqb45Keq

I tried this code:

```rust
pub unsafe fn bad(a: __m256i) -> __m256i {
    let a = _mm256_and_si256(a, _mm256_set1_epi16(0x7FFF));
    _mm256_mulhi_epi16(a, _mm256_set1_epi16(1000))
}
```

I expected to see this happen: more or less the same codegen as with a -1000 in multiplier

Instead, this happened: it looks like the vector is widened to i32 for no good reason.

### Version it worked on

It most recently worked on: Rust 1.74

### Version with regression

I checked on godbolt with 1.75-1.81 and whatever beta and nightly are today.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow code generated for _mm256_mulhi_epi16 #130782

Code

Version it worked on

Version with regression

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow code generated for _mm256_mulhi_epi16 #130782

Description

Code

Version it worked on

Version with regression

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions