Description
I'm resurrecting the shuffle part of WebAssembly/simd#255. (Broadcasts have *.splat
.)
v128.shuffle
instructions are always 18 bytes long, which is excessive in most cases. For comparison, native shuffle instructions on x86-64 are at most five bytes long, even though they are more restricted in shuffle patterns.
For an example here's an implementation of 4x4 matrix transpose in WebAssembly SIMD and in SSE intrinsics:
#include <wasm_simd128.h>
void transpose4x4(float matrix[16]) {
const v128_t row0 = wasm_v128_load(matrix);
const v128_t row1 = wasm_v128_load(matrix + 4);
const v128_t row2 = wasm_v128_load(matrix + 8);
const v128_t row3 = wasm_v128_load(matrix + 12);
const v128_t tmp0 = wasm_v32x4_shuffle(row0, row1, 0, 4, 1, 5);
const v128_t tmp1 = wasm_v32x4_shuffle(row2, row3, 0, 4, 1, 5);
const v128_t tmp2 = wasm_v32x4_shuffle(row0, row1, 2, 6, 3, 7);
const v128_t tmp3 = wasm_v32x4_shuffle(row2, row3, 2, 6, 3, 7);
const v128_t col0 = wasm_v32x4_shuffle(tmp0, tmp1, 0, 1, 4, 5);
const v128_t col1 = wasm_v32x4_shuffle(tmp0, tmp1, 2, 3, 6, 7);
const v128_t col2 = wasm_v32x4_shuffle(tmp2, tmp3, 0, 1, 4, 5);
const v128_t col3 = wasm_v32x4_shuffle(tmp2, tmp3, 2, 3, 6, 7);
wasm_v128_store(matrix, col0);
wasm_v128_store(matrix + 4, col1);
wasm_v128_store(matrix + 8, col2);
wasm_v128_store(matrix + 12, col3);
}
#include <xmmintrin.h>
void transpose(float matrix[16]) {
const __m128 row0 = _mm_loadu_ps(matrix);
const __m128 row1 = _mm_loadu_ps(matrix + 4);
const __m128 row2 = _mm_loadu_ps(matrix + 8);
const __m128 row3 = _mm_loadu_ps(matrix + 12);
const __m128 tmp0 = _mm_unpacklo_ps(row0, row1);
const __m128 tmp2 = _mm_unpacklo_ps(row2, row3);
const __m128 tmp1 = _mm_unpackhi_ps(row0, row1);
const __m128 tmp3 = _mm_unpackhi_ps(row2, row3);
const __m128 col0 = _mm_movelh_ps(tmp0, tmp2);
const __m128 col1 = _mm_movehl_ps(tmp2, tmp0);
const __m128 col2 = _mm_movelh_ps(tmp1, tmp3);
const __m128 col3 = _mm_movehl_ps(tmp3, tmp1);
_mm_storeu_ps(matrix, col0);
_mm_storeu_ps(matrix + 4, col1);
_mm_storeu_ps(matrix + 8, col2);
_mm_storeu_ps(matrix + 12, col3);
}
The x86-64 SSE version compiles to 67 bytes of code, but the WebAssembly version compiles to 231 bytes of code, 3.5x larger, according to the linked issue.
Admittedly, this is a bad example. That should be using contiguous loads and strides stores. But it works well enough for show.
I propose the following new opcodes:
- Permute:
i8x16.permute lanes:(byte)^8 | (v128) -> (v128)
, with 8-bit lane indicesi32x4.permute lanes:byte | (v128) -> (v128)
, with 2-bit lane indices- No
i64x2.permute
as it could be done in terms of this.
- Lane shifts:
i8x16.lane_shift_up count:8 | (v128) -> (v128)
, zero-fills from first lanei8x16.lane_shift_down count:8 | (v128) -> (v128)
, zero-fills from last lane
- Compact shuffles:
i32x4.shuffle_pairs lanes:byte | (v128, v128) -> (v128)
, with 2-bit lane indicesi64x2.shuffle lanes:byte | (v128, v128) -> (v128)
, with 2-bit lane indicesi32x4.shuffle4 lanes:bytes^2 | (v128, v128, v128, v128) -> (v128)
, with four 4-bit lane indices
- Multi-byte swizzle:
i16x8.swizzle | (v128, v128, v128) -> (v128)
i32x4.swizzle | (v128, v128, v128) -> (v128)
i64x2.swizzle | (v128, v128, v128) -> (v128)
All of these are exceptionally common, common enough to merit opcodes to reduce generated code size.
- The permutes replace
local.tee $v; local.get $v; v8x16.shuffle lanes...
andlocal.get $v; local.get $v; v8x16.shuffle lanes...
sequences. - The compact shuffles and lane shifts reduce code size requirements for very common cases.
- Lane shuffles have somewhat broad support. Worst case, shift the value by lane size and add 0/1/2/3/... to resolve byte indices.
- 4x4 32-bit matrix transpose is extremely common. This can be fully represented in a single opcode.
And a few ISAs even have optimized sequences for these in particular.
- x86-64 has direct equivalents for most of this:
i32x4.permute
->shufps a, a, imm8
i8x16.lane_shift_up
->pslldq a, imm
i8x16.lane_shift_down
->psrldqa, imm
i32x4.shuffle_pairs
->shufps a, b, imm8
i64x2.shuffle
->shufpd a, b, imm8
i16x8.swizzle
->pshufw a, b
i32x4.swizzle
->pshufd a, b
i64x2.swizzle
->pshufq a, b
- 32-bit ARM and AArch64 both have efficient ways of most of them, too. Notably, it can do VTBL/TBL and VLD4/LD4 to implement
i32x4.shuffle4
in two instructions. - x86-64, 32-bit ARM, and AArch64 all can build a 128-bit vector from two scalar doubles in a single opcode.
- While RISC-V's V extension has no opcode for simple 32x4-bit mixes or shuffles (my biggest criticism of the V extension's design), every single shuffle and permute of 8 registers or less can be fused into a single
vsetvli; vrgather.vv
sequence with a constant lane selector operand forvrgather.vv
, as permutes index multi-register group as a single contiguous array when LMUL>1. And even processors with just Zve64d can implementv32x4.shuffle4
efficiently by doing this.