Open
Description
Originally reported here
rust-lang/rust-clippy#12826
Related PR spawned from that issue
#125455
cc @blyxyas
Clamping and casting from i32
to u8
, using clamp(0, 255) as u8
produces unnecessary instructions compared to .max(0).min(255) as u8
when a loop is autovectorized.
clippy
's manual_clamp
lint in the beta toolchain warns on this pattern to use clamp
instead which can regress performance.
Minimal example
#[inline(never)]
pub fn manual_clamp(input: &[i32], output: &mut [u8]) {
for (&i, o) in input.iter().zip(output.iter_mut()) {
*o = i.max(0).min(255) as u8;
}
}
#[inline(never)]
pub fn clamp(input: &[i32], output: &mut [u8]) {
for (&i, o) in input.iter().zip(output.iter_mut()) {
*o = i.clamp(0, 255) as u8;
}
}
https://rust.godbolt.org/z/zf73jsqjq
Manual clamp
.LBB0_4:
movdqu xmm0, xmmword ptr [rdi + 4*r8]
packssdw xmm0, xmm0
packuswb xmm0, xmm0
movdqu xmm1, xmmword ptr [rdi + 4*r8 + 16]
packssdw xmm1, xmm1
packuswb xmm1, xmm1
movd dword ptr [rdx + r8], xmm0
movd dword ptr [rdx + r8 + 4], xmm1
add r8, 8
cmp rsi, r8
jne .LBB0_4
`Ord::clamp`
.LBB0_4:
movdqu xmm6, xmmword ptr [rdi + 4*r8]
movdqu xmm5, xmmword ptr [rdi + 4*r8 + 16]
pxor xmm3, xmm3
pcmpgtd xmm3, xmm6
packssdw xmm3, xmm3
packsswb xmm3, xmm3
pxor xmm4, xmm4
pcmpgtd xmm4, xmm5
packssdw xmm4, xmm4
packsswb xmm4, xmm4
movdqa xmm7, xmm6
pxor xmm7, xmm0
movdqa xmm8, xmm1
pcmpgtd xmm8, xmm7
pand xmm6, xmm8
pandn xmm8, xmm2
por xmm8, xmm6
packuswb xmm8, xmm8
packuswb xmm8, xmm8
pandn xmm3, xmm8
movdqa xmm6, xmm5
pxor xmm6, xmm0
movdqa xmm7, xmm1
pcmpgtd xmm7, xmm6
pand xmm5, xmm7
pandn xmm7, xmm2
por xmm7, xmm5
packuswb xmm7, xmm7
packuswb xmm7, xmm7
pandn xmm4, xmm7
movd dword ptr [rdx + r8], xmm3
movd dword ptr [rdx + r8 + 4], xmm4
add r8, 8
cmp rsi, r8
jne .LBB0_4
Real code examples from functions in the image-webp
crate
https://rust.godbolt.org/z/3rnY8d94v
https://rust.godbolt.org/z/53T7n9PGx
Metadata
Metadata
Assignees
Labels
Area: Autovectorization, which can impact perf or code sizeArea: Code generationCategory: An issue highlighting optimization opportunities or PRs implementing suchStatus: A Minimal Complete and Verifiable Example has been found for this issueRelevant to the compiler team, which will review and decide on the PR/issue.