movemask instruction #131
Description
SSE2 exposes the movemask instruction (pmovmskb) that extracts 16 most significant bits from each byte of the vector, and returns a 16-bit integer with all the bits combined.
This instruction is very useful for certain types of processing. For example, when performing byte-wise processing on strings, such as scanning the string for a specific character (memchr), pmovmskb can be used to produce a mask with occurrences of the character in a string; bsf/bsr (in WebAssembly that's equivalent to clz/ctz) can be used to quickly iterate over the bits set in this mask (see "Regular expression search" in https://zeux.io/2019/04/20/qgrep-internals/ as one example).
It is also used in fast integer decoding in my vertex data decompressor; in its absence I have to emulate it using scalar math, see https://github.com/zeux/meshoptimizer/blob/master/src/vertexcodec.cpp#L712 - on x64 using the same fallback results in a ~15% performance penalty to the overall benchmark despite the fact that the instruction is not dominating the execution cost otherwise.
On x86/x64, movemask directly maps to pmovmskb (SSE2).
On PowerPC movemask can be implemented with vbpermq instruction, typically either as lvlsl+vector shift+vbpermq or as load+vbpermq.
On NEON, movemask isn't available natively but it can be easily synthesized with horizontal adds from AArch64 - you need to take the mask, replace each byte with a high bit set with a power of two corresponding to the byte index (this takes a couple of vector shifts) and use vaddv_u8 for each half of the vector. On ARMv7 with NEON you can emulate two vaddv_u8 with three vpadd_u8 so the cost is still somewhat reasonable (6 vector instructions + a couple of scalar instructions to create 16-bit mask from two 8-bit lanes).
I'm not sure what the emulation strategy would be on MIPS / RISC-V.
I wanted to file this to get a sense of whether this meets the balance of "performance cliffs" available on various architectures.
I'm generally happy with WASM SIMD but the problem with movemask is that there are no other SIMD instructions in WASM SIMD that provide a reasonable emulation path (in particular, no horizontal adds - they are a bit less exotic than vbpermq). Of course horizontal adds also have a non-trivial cost on various architectures, including x64, so emulating movemask through horizontal adds on x64 is bound to result in worse performance on x64 compared to a natively supported instruction.