Skip to content

Conversation

@maddyscientist
Copy link
Member

This work is latest towards optimizing QUDA for Blackwell:

  • Adds supports for "spatial prefetching", where we over fetch data to L2 when issuing a global load. Exposed as an optional template parameter to vector_load. At present, not deployed anywhere.
  • Add support for prefetching instructions, in the form of both per-thread prefetching (which works on all CUDA architectures), and TMA-based prefetching, which is Hopper+ only. Prefetching type is set using QUDA_DSLASH_PREFETCH CMake parameter, with 0=per-thread, 1=TMA bulk, and 2=TMA descriptor
  • Add an experimental L1 prefetch (using LDGSTS). Disabled, but left for future experiments.
  • Add single-threaded execution region helper function target::is_thread_zero() which should be used for TMA issuance.
  • Optionally store the backward shifted gauge field. This simplifies all dslash indexing, as all spatial indices thus correspond to "this" site. Enabled with QUDA_DSLASH_DOUBLE_STORE=ON which is required for TMA-based prefetching (for alignment reasons).* Prefetching is exposed for both ColorSpinorFields and GaugeFields, though only latter actually used at present.
  • Added prefetching support to both Wilson and Staggered dslash kernels, parameterized using QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED CMake parameters.
  • Optimization of the neighbor indexing for the dslash kernels. This reduced integer instruction overheads.
  • Reduction in pointer arithmetic overhead (use more 32-bit integer operations where possible). Added three operand and four operand variants of vector_load and vector_store to this end (respectively).
  • Optimize FFMA2 issuance to reduce total number of floating point instructions on Blackwell
  • Optimization of short <-> float conversation to reduce instruction overheads
  • Optimization of staggered packing kernels (replace division by int with division by fast_intdiv)
  • Extends some OpenMP parallelization on the host that was missing.

The end result of this work is that both Staggered and Wilson dslash kernels can saturate over 90% memory bandwidth for most variants. Outstanding are half precision variants using reconstruction, that are still lagging. These will be the focus of a subsequent PR.

…tead of logic operations when computing the neighboring index; this is branch free and less operations
…for executing single-thread regions of code. On CUDA install the latest version of CCCL via CPM since we need some new features
…slash kernels. Disabled by default (set with with Arg::prefetch_distance parameter), and TMA prefetch will be added in next push
…ith QUDA_DSLASH_PREFETCH_BULK=ON). Prefetch distance is now set via CMake (QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED)
…ants of vector_load and vector_store: these allow for hte pointer offset and the index to be computed together first in 32-bit, before accumulation to the pointer in 64-bit, reducing pointer arithmetic overheads
…d and vector_store to reduce indexing overheads
TMA (Tensor Memory Accelerator) is only available on Hopper (sm_90+) and
later architectures. This commit wraps the cuTensorMapEncodeTiled calls
with a compile-time guard to prevent runtime errors on Volta/Ampere GPUs.
…configs should be chosen when doing full dslash (whether or not TMA is used)
CPMAddPackage(
NAME CCCL
GITHUB_REPOSITORY nvidia/cccl
GIT_TAG main # Fetches the latest commit on the main branch
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix this with a specific tag

…nd legacy architectures. Updated some deprecated calls to modern equivalents
… not include RHS dimension). Remove legacy dslash constants no longer used
…100, e.g., we only tune over max L1 or max shared mem. No observed effect on performance, and default can be overriden with an envarg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants