Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Accelerated shuffle masks #199

Open
Open
@penzn

Description

@penzn

In #196 there was a discussion on what shuffle patterns get accelerated by hardware on various platforms. Right now those are handled inconsistently by the toolchain, and relgardless of what remedy we pick it is good to know what gets accelerated and what does not.

Tentative list from #196 (comment) and #196 (comment):

  • Shuffle with wider lanes
  • Pack/Unpack (ABCDABCD -> AABBCCDD, reverse and the like)
  • Byte shift (shift 128-bit value by a number of bytes with or without wraparound)
  • Blends between two vectors with 32/16/8 masks; equivalent to bitselect with a constant mask but bitselect is slower (constant load + 3 instructions)
  • Restricted shuffle with first two components coming from first vector and second two from the second vector (SSE2)

@zeux, thank you for your list.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions