Inefficient x64 codegen for swizzle #93
Description
Looking at https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#swizzling-using-variable-indices I discovered that it would take me more than one instruction to implement v128.swizzle
on x86. I had assumed, like @stoklund in #11, that I would be able to use PSHUFB as-is. However, I am now convinced that the assumptions of #11 may be incorrect:
Lanes with an out-of-range selector become 0 in the output vector.
According to the Intel manual (and some experiments I ran), PSHUFB
uses the four least significant bits to decide which lane to grab from a vector. If the most significant bit is one (e.g. 0b10000000
), then the result is zeroed. But index values in between 0x0f
and 0x80
will use the four least significant bits as an index and will not zero the value. To correctly implement the spec as it currently reads we would need to copy the swizzle mask to another register, do a greater-than comparison to get a bit in the most significant position, and OR
this with the original swizzle mask before using the PSHUFB
instruction--four instructions instead of one.
Should v128.swizzle
change to allow more optimal implementations? Are there considerations for other architectures that I am not aware of?