Description
Bugzilla Link | 31872 |
Version | trunk |
OS | Linux |
CC | @hyp,@compnerd,@lesshaste,@hfinkel,@joker-eph,@RKSimon,@rotateright,@TNorthover |
Extended Description
Consider:
#include <complex.h>
complex float f(complex float x, complex float y) {
return x/y;
}
clang trunk with -O3 -march=core-avx2 but with or without -ffast-math gives:
f: # @f
vmovaps xmm2, xmm1
vmovshdup xmm1, xmm0 # xmm1 = xmm0[1,1,3,3]
vmovshdup xmm3, xmm2 # xmm3 = xmm2[1,1,3,3]
jmp __divsc3 # TAILCALL
However both gcc and ICC attempt to optimise this code when -ffast-math (or equivalent) is enabled.
ICC appears to give the fastest code which is:
f:
vcvtps2pd xmm2, xmm1 #3.12
vcvtps2pd xmm4, xmm0 #3.12
vmulpd xmm8, xmm2, xmm2 #3.12
vunpckhpd xmm3, xmm2, xmm2 #3.12
vmulpd xmm6, xmm3, xmm4 #3.12
vmovddup xmm7, xmm2 #3.12
vshufpd xmm5, xmm4, xmm4, 1 #3.12
vshufpd xmm9, xmm8, xmm8, 1 #3.12
vfmaddsub213pd xmm7, xmm5, xmm6 #3.12
vaddpd xmm11, xmm8, xmm9 #3.12
vshufpd xmm10, xmm7, xmm7, 1 #3.12
vdivpd xmm12, xmm10, xmm11 #3.12
vcvtpd2ps xmm0, xmm12 #3.12
ret