Skip to content

Maximal vectorization #1548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 81 commits into
base: develop
Choose a base branch
from
Draft

Maximal vectorization #1548

wants to merge 81 commits into from

Conversation

maddyscientist
Copy link
Member

This PR is a significant cleanup and reworking of the native QUDA accessors.

  • Gone are explicit FLOAT2, FLOAT4, FLOAT8 data orderings, we now just have NATIVE ordering
  • The data ordering is now set for all fields by CMake parameters
    • QUDA_ORDER_DOUBLE, QUDA_ORDER_SINGLE, QUDA_ORDER_HALF, QUDA_ORDER_QUARTER
    • The values of these correspond to the inner vector length desired, e.g., 4 would be a FLOAT4 accessor
  • For fields, whose degrees of freedom are not a multiple of the vector length we deal with the remainder explicitly
    • E.g., for a SU(3) field with 18 real numbers, and a FLOAT4 accessor, we would have 4x FLOAT4 ld/st instructions and a 1x FLOAT2 remainder.
  • Vector lengths 2, 4, 8, and 16 are supported (up to 256-bit in total length)

The motivation for this work is to increase the use of vectorized load and stores, to improve performance on more recent GPUs.

@maddyscientist maddyscientist requested a review from a team as a code owner April 21, 2025 19:59
@weinbe2 weinbe2 marked this pull request as draft April 28, 2025 21:28
… for all load/store to shared memory to be done using immediates. Left disabled for now
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant