Maximal vectorization #1548

maddyscientist · 2025-03-26T20:54:20Z

This PR is a significant cleanup and reworking of the native QUDA accessors.

Gone are explicit FLOAT2, FLOAT4, FLOAT8 data orderings, we now just have NATIVE ordering
The data ordering is now set for all fields by CMake parameters
- QUDA_ORDER_DOUBLE, QUDA_ORDER_SINGLE, QUDA_ORDER_HALF, QUDA_ORDER_QUARTER
- The values of these correspond to the inner vector length desired, e.g., 4 would be a FLOAT4 accessor
For fields, whose degrees of freedom are not a multiple of the vector length we deal with the remainder explicitly
- E.g., for a SU(3) field with 18 real numbers, and a FLOAT4 accessor, we would have 4x FLOAT4 ld/st instructions and a 1x FLOAT2 remainder.
Vector lengths 2, 4, 8, and 16 are supported (up to 256-bit in total length)

The motivation for this work is to increase the use of vectorized load and stores, to improve performance on more recent GPUs.

…essors. Remove any explicit use of gauge::FloatNOrder in "user code" to ensure that the order is not hard coded anywhere. Maximal vectorization not enable anywhere yet (since not applied to fine-grained accessor yet)

… allow us to remove the type punning in the accessors

…rder used for double and single precision as CMake parameters (QUDA_ORDER_DOUBLE and QUDA_ORDER_SINGLE). Removed some casting of stack variables

…ed accessors)

…s portable

…esent). To enable, set QUDA_ORDER to 0

…he same information

…eCB, to reduce IMAD overhead. Remove DslashArg::ghostFace

…o packer

…se shared memory

…CIe bus id to ensure consistency

… for all load/store to shared memory to be done using immediates. Left disabled for now

…ure/vectorize

maddyscientist added 11 commits March 13, 2025 09:36

Add array-variants of vector_load / vector_store functions. This will…

e14982e

… allow us to remove the type punning in the accessors

Fix bug in mom accessor

fb9bcf7

Some fixes to maximal vectorization for gauge::FloatNOrder. Exposed o…

e943fed

…rder used for double and single precision as CMake parameters (QUDA_ORDER_DOUBLE and QUDA_ORDER_SINGLE). Removed some casting of stack variables

Rewrite of colorspinor::FloatNOrder to remove casting of stack variables

78d5559

Rewrite of clover::FloatNOrder to remove casting of stack variables

38dfe2b

Maximal vectorization now complete and verified (including fine-grain…

95b3c0b

…ed accessors)

Merge branch 'develop' of github.com:lattice/quda into feature/vectorize

c0fc78e

Maximal vectorization applied to colorspinor

8e60631

Apply maximal vectorization to clover fields

b2cd5f9

QUDA now supports 256-bit ld/st

f39f0b9

maddyscientist added feature clean-up labels Mar 26, 2025

maddyscientist requested review from a team as code owners March 26, 2025 20:54

maddyscientist added 2 commits March 26, 2025 16:22

Fix CI warning

99a4f8e

Add inline ptx support for 256-bit ld/st on Blackwell+

39ef1f0

maddyscientist requested a review from a team as a code owner April 21, 2025 19:59

maddyscientist added 3 commits April 21, 2025 13:04

Blackwell is cc 1000 not 10000

58225f9

Remove all use of make_double / make_float calls: unnecessary and les…

851e663

…s portable

Cleanup legacy code

d5e9f7d

weinbe2 marked this pull request as draft April 28, 2025 21:28

maddyscientist added 8 commits April 29, 2025 11:55

Qualify blas vector lengths

c7f322b

Add support for legacy data vector ordering (where no remainder is pr…

7950c91

…esent). To enable, set QUDA_ORDER to 0

DslashArg::nFace is now a static member

e692ceb

Remove unnecessary DslashArg::dim (DslashArg::DslashConstant::X has t…

3222df2

…he same information

Fix bug with blas kernels that use site unrolling

404fa1c

Absorb nFace factor info DslashArg::ghostFaceCB and PackArg::ghostFac…

17b7352

…eCB, to reduce IMAD overhead. Remove DslashArg::ghostFace

Fix long-standing bug in Ndeg and DWF fermions when using inlined hal…

af9b6eb

…o packer

Remove unused file

908643f

maddyscientist added 30 commits June 3, 2025 14:03

Fix bug with shared memory carve out tuning for dslash kernels that u…

51437f8

…se shared memory

Fix alt I2F for large arrays

502c5c3

WAR for compiler error

27b12de

Fix bug in HIP math helper

8b2e77e

Fix minor bug in tests/CMakeLists.txt

ea0122a

Try again to silence compiler bugs

b737fae

Add new byte_array type which is a indexable length-4 array of bytes.

3e4aca4

Replace use of thread_array with byte_array

a2cd51f

Use byte_array for gauge_heatbath: this fixes the stackframe

14296d1

Move constexpr_for to its own header

0021c24

Fix stack frame with DeGrandRossi contraction

e3af673

llfat compute staple kernel now uses byte_array

2a97890

Add float8 type and associated load/store functions

1e5856c

Fix clang warning

fd605c9

Fix clang warning

4fec599

Merge branch 'develop' of github.com:lattice/quda into feature/vectorize

d75f05b

Disable flush denormals to zero for nvcc (temporary)

93399bd

Merge branch 'develop' of github.com:lattice/quda into feature/vectorize

fe8e385

Fixes for CUDA clang

8a7e34d

Merge branch 'develop' into feature/vectorize

fa6c257

Fix mismatch between device used for CUDA and device monitored: use P…

b84dbe1

…CIe bus id to ensure consistency

Add experimental warp_stride option to SharedMemoryCache which allows…

41ea54f

… for all load/store to shared memory to be done using immediates. Left disabled for now

Merge branch 'develop' of github.com:lattice/quda into feature/vectorize

a005d96

Merge branch 'hotfix/deprecated' of github.com:lattice/quda into feat…

f45f962

…ure/vectorize

__maxnreg__ require CUDA 12.4 or above

16a54ba

Merge branch 'develop' of github.com:lattice/quda into feature/vectorize

44e003e

Fix warnings in nvc++

7bd5126

Suppress warning from nvc++ resulting from misdiagnosis

785f5b5

nvc++ does not support __maxnreg__ at present

2e47468

Add sm_100 to list of SMs

cfcec2f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Maximal vectorization #1548

Maximal vectorization #1548

Uh oh!

maddyscientist commented Mar 26, 2025

Uh oh!

Uh oh!

Maximal vectorization #1548

Are you sure you want to change the base?

Maximal vectorization #1548

Uh oh!

Conversation

maddyscientist commented Mar 26, 2025

Uh oh!

Uh oh!