Inefficient x64 codegen for conversion instructions #190
Description
Certain SIMD conversions seem to have inefficient lowerings in x64. f32x4.convert_i32x4_u
is lowered to 8 instruction by v8. The signed version, f32x4.convert_i32x4_s
, on the other hand, is lowered to a single instruction.
I can't find the v8 implementation for [edit: this is incorrect, see #173 for a more correct discussion of this inefficiency]i32x4.trunc_sat_f32x4_s
and i32x4.trunc_sat_f32x4_u
but I think the situation is the same: the signed version should have a single instruction lowering to CVTTPS2DQ
and the unsigned version will require some longer sequence.
The 64x2 versions of these instructions were dropped in #178. For similar reasons (@ngzhian: "because it is uncommon for such instructions to be used, and hardware support is not widespread"), should we remove the unsigned versions?