Skip to content

The current sparse encoding of shuffles blows up code size in graphics-heavy and audio-heavy code #1559

Open
@dead-claudia

Description

@dead-claudia

I'm resurrecting the shuffle part of WebAssembly/simd#255. (Broadcasts have *.splat.)

v128.shuffle instructions are always 18 bytes long, which is excessive in most cases. For comparison, native shuffle instructions on x86-64 are at most five bytes long, even though they are more restricted in shuffle patterns.

For an example here's an implementation of 4x4 matrix transpose in WebAssembly SIMD and in SSE intrinsics:

#include <wasm_simd128.h>

void transpose4x4(float matrix[16]) {
	const v128_t row0 = wasm_v128_load(matrix);
	const v128_t row1 = wasm_v128_load(matrix + 4);
	const v128_t row2 = wasm_v128_load(matrix + 8);
	const v128_t row3 = wasm_v128_load(matrix + 12);

	const v128_t tmp0 = wasm_v32x4_shuffle(row0, row1, 0, 4, 1, 5);
	const v128_t tmp1 = wasm_v32x4_shuffle(row2, row3, 0, 4, 1, 5);
	const v128_t tmp2 = wasm_v32x4_shuffle(row0, row1, 2, 6, 3, 7);
	const v128_t tmp3 = wasm_v32x4_shuffle(row2, row3, 2, 6, 3, 7);

	const v128_t col0 = wasm_v32x4_shuffle(tmp0, tmp1, 0, 1, 4, 5);
	const v128_t col1 = wasm_v32x4_shuffle(tmp0, tmp1, 2, 3, 6, 7);
	const v128_t col2 = wasm_v32x4_shuffle(tmp2, tmp3, 0, 1, 4, 5);
	const v128_t col3 = wasm_v32x4_shuffle(tmp2, tmp3, 2, 3, 6, 7);

	wasm_v128_store(matrix, col0);
	wasm_v128_store(matrix + 4, col1);
	wasm_v128_store(matrix + 8, col2);
	wasm_v128_store(matrix + 12, col3);
}
#include <xmmintrin.h>

void transpose(float matrix[16]) {
	const __m128 row0 = _mm_loadu_ps(matrix);
	const __m128 row1 = _mm_loadu_ps(matrix + 4);
	const __m128 row2 = _mm_loadu_ps(matrix + 8);
	const __m128 row3 = _mm_loadu_ps(matrix + 12);

	const __m128 tmp0 = _mm_unpacklo_ps(row0, row1);
	const __m128 tmp2 = _mm_unpacklo_ps(row2, row3);
	const __m128 tmp1 = _mm_unpackhi_ps(row0, row1);
	const __m128 tmp3 = _mm_unpackhi_ps(row2, row3);
	
	const __m128 col0 = _mm_movelh_ps(tmp0, tmp2);
	const __m128 col1 = _mm_movehl_ps(tmp2, tmp0);
	const __m128 col2 = _mm_movelh_ps(tmp1, tmp3);
	const __m128 col3 = _mm_movehl_ps(tmp3, tmp1);

	_mm_storeu_ps(matrix, col0);
	_mm_storeu_ps(matrix + 4, col1);
	_mm_storeu_ps(matrix + 8, col2);
	_mm_storeu_ps(matrix + 12, col3);
}

The x86-64 SSE version compiles to 67 bytes of code, but the WebAssembly version compiles to 231 bytes of code, 3.5x larger, according to the linked issue.

Admittedly, this is a bad example. That should be using contiguous loads and strides stores. But it works well enough for show.

I propose the following new opcodes:

  • Permute:
    • i8x16.permute lanes:(byte)^8 | (v128) -> (v128), with 8-bit lane indices
    • i32x4.permute lanes:byte | (v128) -> (v128), with 2-bit lane indices
    • No i64x2.permute as it could be done in terms of this.
  • Lane shifts:
    • i8x16.lane_shift_up count:8 | (v128) -> (v128), zero-fills from first lane
    • i8x16.lane_shift_down count:8 | (v128) -> (v128), zero-fills from last lane
  • Compact shuffles:
    • i32x4.shuffle_pairs lanes:byte | (v128, v128) -> (v128), with 2-bit lane indices
    • i64x2.shuffle lanes:byte | (v128, v128) -> (v128), with 2-bit lane indices
    • i32x4.shuffle4 lanes:bytes^2 | (v128, v128, v128, v128) -> (v128), with four 4-bit lane indices
  • Multi-byte swizzle:
    • i16x8.swizzle | (v128, v128, v128) -> (v128)
    • i32x4.swizzle | (v128, v128, v128) -> (v128)
    • i64x2.swizzle | (v128, v128, v128) -> (v128)

All of these are exceptionally common, common enough to merit opcodes to reduce generated code size.

  • The permutes replace local.tee $v; local.get $v; v8x16.shuffle lanes... and local.get $v; local.get $v; v8x16.shuffle lanes... sequences.
  • The compact shuffles and lane shifts reduce code size requirements for very common cases.
  • Lane shuffles have somewhat broad support. Worst case, shift the value by lane size and add 0/1/2/3/... to resolve byte indices.
  • 4x4 32-bit matrix transpose is extremely common. This can be fully represented in a single opcode.

And a few ISAs even have optimized sequences for these in particular.

  • x86-64 has direct equivalents for most of this:
    • i32x4.permute -> shufps a, a, imm8
    • i8x16.lane_shift_up -> pslldq a, imm
    • i8x16.lane_shift_down -> psrldqa, imm
    • i32x4.shuffle_pairs -> shufps a, b, imm8
    • i64x2.shuffle -> shufpd a, b, imm8
    • i16x8.swizzle -> pshufw a, b
    • i32x4.swizzle -> pshufd a, b
    • i64x2.swizzle -> pshufq a, b
  • 32-bit ARM and AArch64 both have efficient ways of most of them, too. Notably, it can do VTBL/TBL and VLD4/LD4 to implement i32x4.shuffle4 in two instructions.
  • x86-64, 32-bit ARM, and AArch64 all can build a 128-bit vector from two scalar doubles in a single opcode.
  • While RISC-V's V extension has no opcode for simple 32x4-bit mixes or shuffles (my biggest criticism of the V extension's design), every single shuffle and permute of 8 registers or less can be fused into a single vsetvli; vrgather.vv sequence with a constant lane selector operand for vrgather.vv, as permutes index multi-register group as a single contiguous array when LMUL>1. And even processors with just Zve64d can implement v32x4.shuffle4 efficiently by doing this.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions