Feature/prefetch2 #1604

maddyscientist · 2025-12-03T01:11:22Z

This work is latest towards optimizing QUDA for Blackwell:

Adds supports for "spatial prefetching", where we over fetch data to L2 when issuing a global load. Exposed as an optional template parameter to vector_load. At present, not deployed anywhere.
Add support for prefetching instructions, in the form of both per-thread prefetching (which works on all CUDA architectures), and TMA-based prefetching, which is Hopper+ only. Prefetching type is set using QUDA_DSLASH_PREFETCH CMake parameter, with 0=per-thread, 1=TMA bulk, and 2=TMA descriptor
Add an experimental L1 prefetch (using LDGSTS). Disabled, but left for future experiments.
Add single-threaded execution region helper function target::is_thread_zero() which should be used for TMA issuance.
Optionally store the backward shifted gauge field. This simplifies all dslash indexing, as all spatial indices thus correspond to "this" site. Enabled with QUDA_DSLASH_DOUBLE_STORE=ON which is required for TMA-based prefetching (for alignment reasons).* Prefetching is exposed for both ColorSpinorFields and GaugeFields, though only latter actually used at present.
Added prefetching support to both Wilson and Staggered dslash kernels, parameterized using QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED CMake parameters.
Optimization of the neighbor indexing for the dslash kernels. This reduced integer instruction overheads.
Reduction in pointer arithmetic overhead (use more 32-bit integer operations where possible). Added three operand and four operand variants of vector_load and vector_store to this end (respectively).
Optimize FFMA2 issuance to reduce total number of floating point instructions on Blackwell
Optimization of short <-> float conversation to reduce instruction overheads
Optimization of staggered packing kernels (replace division by int with division by fast_intdiv)
Extends some OpenMP parallelization on the host that was missing.

The end result of this work is that both Staggered and Wilson dslash kernels can saturate over 90% memory bandwidth for most variants. Outstanding are half precision variants using reconstruction, that are still lagging. These will be the focus of a subsequent PR.

…tions for CUDA

…tead of logic operations when computing the neighboring index; this is branch free and less operations

…d by default

…quarter precision support

…for executing single-thread regions of code. On CUDA install the latest version of CCCL via CPM since we need some new features

…slash kernels. Disabled by default (set with with Arg::prefetch_distance parameter), and TMA prefetch will be added in next push

…ith QUDA_DSLASH_PREFETCH_BULK=ON). Prefetch distance is now set via CMake (QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED)

…ble on CUDA platform

…ants of vector_load and vector_store: these allow for hte pointer offset and the index to be computed together first in 32-bit, before accumulation to the pointer in 64-bit, reducing pointer arithmetic overheads

…d and vector_store to reduce indexing overheads

…tNOrder uses optimized 3-operand indexing

TMA (Tensor Memory Accelerator) is only available on Hopper (sm_90+) and later architectures. This commit wraps the cuTensorMapEncodeTiled calls with a compile-time guard to prevent runtime errors on Volta/Ampere GPUs.

…configs should be chosen when doing full dslash (whether or not TMA is used)

maddyscientist · 2025-12-10T20:36:31Z

lib/targets/cuda/target_cuda.cmake

+CPMAddPackage(
+    NAME CCCL
+    GITHUB_REPOSITORY nvidia/cccl
+    GIT_TAG main # Fetches the latest commit on the main branch


Fix this with a specific tag

…nd legacy architectures. Updated some deprecated calls to modern equivalents

…t can lead to catestrophic cancelation

…fresh, not copying or moving it

…er tests should all now pass

… not include RHS dimension). Remove legacy dslash constants no longer used

…ssors

…100, e.g., we only tune over max L1 or max shared mem. No observed effect on performance, and default can be overriden with an envarg

maddyscientist added 30 commits September 19, 2025 16:42

Initial support for prefetching (over fetching) added to load instruc…

63b7ff4

…tions for CUDA

Fix for half precision

191105b

Apply some missing OMP parallelization to host functions

5b41229

Fix for fine-grained accessor vector loads

a2efb44

Add prefetching instructions for CUDA

c815076

Optimizaiton of neighbor indexing for dslash kernels: use bitwise ins…

177c18b

…tead of logic operations when computing the neighboring index; this is branch free and less operations

Add support for creating a backward gauge field

eae953d

Some small improvedments to shift(GaugeField) function

2540a1b

Gauge shift should encode shift value in aux_string

e686437

Add support for experimental double storage of gauge fields - disable…

676c643

…d by default

Fix some issues with gauge shift: fix single-GPU builds and add half/…

9c2025b

…quarter precision support

make doBulk and doHalo constexpr

721fbd5

Add target::is_thread_zero and target::is_lane_zero helper functions …

02a4cb9

…for executing single-thread regions of code. On CUDA install the latest version of CCCL via CPM since we need some new features

Expose prefetching instructions

33b5f2f

Add prefetching support to gauge and colorspinor fields

ccf7a55

Add L2 gauge-field prefetching support to both Wilson and staggered d…

0642f63

…slash kernels. Disabled by default (set with with Arg::prefetch_distance parameter), and TMA prefetch will be added in next push

QUDA_DSLASH_DOUBLE_STORE is now a CMake parameter

72a001f

Add TMA prefetch support for Wilson and staggered fermions (enabled w…

02e7bc3

…ith QUDA_DSLASH_PREFETCH_BULK=ON). Prefetch distance is now set via CMake (QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED)

Add target::uniform helper which is used to create warp-uniform varia…

7bb5cdc

…ble on CUDA platform

Fix typo in last commit

f42a507

Fix bug with non-double-store staggered dslash

e2df25f

Fix bug with parity setting

3010aa6

Fix bulk prefetch of phase

acfaf5b

Add 3-d and 4-d TMA prefetch instructions

67f8ce4

first version of tensor descriptor TMA prefetch - almost certainly buggy

946bed0

Fix some warnings and set Uback tensor descriptor for wilson dslash

d772d5f

colorspinor::FloatNOrder load/save functions use 3-operand vector_loa…

9910869

…d and vector_store to reduce indexing overheads

Continued improvements to tensor TMA prefetch variant and gauge::Floa…

b9a4d5f

…tNOrder uses optimized 3-operand indexing

Forbid NVSHMEM and TMA prefetching. Fix autotuner so that only valid …

9de5021

…configs should be chosen when doing full dslash (whether or not TMA is used)

maddyscientist commented Dec 10, 2025

View reviewed changes

maddyscientist added 28 commits December 11, 2025 11:45

Fix ambiguity from multi-inheritance with fused DWF kernel

30ae502

Cleanup of abstraction of TMA to allow for clean building on modern a…

79934bb

…nd legacy architectures. Updated some deprecated calls to modern equivalents

Merge branch 'develop' of github.com:lattice/quda into feature/prefetch2

573d0be

We should only be aligning the stride with native gauge fields

04b4fae

Remove FMA optimied I2F, as it introduces floating point rounding tha…

0cf1286

…t can lead to catestrophic cancelation

We only ever need to resize the pad when creating a gauge field from …

aaa629d

…fresh, not copying or moving it

Tweak block CG tolerance for staggered eigensovler. Laplace eigensolv…

5653947

…er tests should all now pass

Fix issue with MRHS Shamir DWF operator (pre-computed constant should…

c5cd669

… not include RHS dimension). Remove legacy dslash constants no longer used

Fix warning

20a70e4

Fix bug in mdw_dslash5_tensor_core (was ignorant of the reworked acce…

74dd488

…ssors

Minor optimization mdw_dslash5_tensor_core.cuh and fix quarter precision

b2e6e88

Reduce carve-out autotuner overhead - default carve out step size is …

9b5545f

…100, e.g., we only tune over max L1 or max shared mem. No observed effect on performance, and default can be overriden with an envarg

Backwards gauge tensor descriptor copy only done if double store enabled

d7568e6

Hopefully fix compiler warning

c92f3cd

Fix HIP compilation

35da04f

Always use ::cuda::maximum() now that we install our own CCCL

6041ec6

Always use ::cuda::maximum() now that we install our own CCCL

982f41b

Update cub block interfaces

60a746b

Fix HIP load_store.h

4918c98

Fix compilation warning with CUDA clang

af2be33

Add missing target_device.h

02baeaa

Fix clang warning

4b8352c

Fix HIP function call

13a192b

Fix TMA instruction exposure

274cbad

Fix clang warning

89e8886

Fix clang error

866a389

Fix another clang error

63b97b9

Hopefully the last clang error

bcfaa50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/prefetch2 #1604

Feature/prefetch2 #1604

maddyscientist commented Dec 3, 2025

Uh oh!

maddyscientist Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feature/prefetch2 #1604

Are you sure you want to change the base?

Feature/prefetch2 #1604

Conversation

maddyscientist commented Dec 3, 2025

Uh oh!

maddyscientist Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants