Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Inefficient x64 codegen for swizzle #93

Closed
@abrown

Description

@abrown

Looking at https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#swizzling-using-variable-indices I discovered that it would take me more than one instruction to implement v128.swizzle on x86. I had assumed, like @stoklund in #11, that I would be able to use PSHUFB as-is. However, I am now convinced that the assumptions of #11 may be incorrect:

Lanes with an out-of-range selector become 0 in the output vector.

According to the Intel manual (and some experiments I ran), PSHUFB uses the four least significant bits to decide which lane to grab from a vector. If the most significant bit is one (e.g. 0b10000000), then the result is zeroed. But index values in between 0x0f and 0x80 will use the four least significant bits as an index and will not zero the value. To correctly implement the spec as it currently reads we would need to copy the swizzle mask to another register, do a greater-than comparison to get a bit in the most significant position, and OR this with the original swizzle mask before using the PSHUFB instruction--four instructions instead of one.

Should v128.swizzle change to allow more optimal implementations? Are there considerations for other architectures that I am not aware of?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions