Closed
Description
See: https://llvm.godbolt.org/z/4KdejfEsG
The following two functions:
declare <4 x i16> @llvm.smax.v4i16(<4 x i16>, <4 x i16>)
declare <4 x i16> @llvm.smin.v4i16(<4 x i16>, <4 x i16>)
declare <8 x i16> @llvm.smax.v8i16(<8 x i16>, <8 x i16>)
declare <8 x i16> @llvm.smin.v8i16(<8 x i16>, <8 x i16>)
define <4 x i8> @saturate4(<4 x i16> %x) {
%1 = tail call <4 x i16> @llvm.smax.v4i16(<4 x i16> %x, <4 x i16> zeroinitializer)
%2 = tail call <4 x i16> @llvm.smin.v4i16(<4 x i16> %1, <4 x i16> <i16 255, i16 255, i16 255, i16 255>)
%3 = trunc <4 x i16> %2 to <4 x i8>
ret <4 x i8> %3
}
define <8 x i8> @saturate8(<8 x i16> %x) {
%1 = tail call <8 x i16> @llvm.smax.v8i16(<8 x i16> %x, <8 x i16> zeroinitializer)
%2 = tail call <8 x i16> @llvm.smin.v8i16(<8 x i16> %1, <8 x i16> <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>)
%3 = trunc <8 x i16> %2 to <8 x i8>
ret <8 x i8> %3
}
produce the following:
.LCPI0_0:
.short 255 # 0xff
.short 255 # 0xff
.short 255 # 0xff
.short 255 # 0xff
.zero 2
.zero 2
.zero 2
.zero 2
saturate4: # @saturate4
pxor xmm1, xmm1
pmaxsw xmm0, xmm1
pminsw xmm0, xmmword ptr [rip + .LCPI0_0]
packuswb xmm0, xmm0
ret
saturate8: # @saturate8
packuswb xmm0, xmm0
ret
The saturate4
function produces extra min/max. I believe the trunc
followed by shufflevector
is being optimized before the saturating truncation could be detected.
Discovered in rust-lang/portable-simd#369 (comment)