Open
1 of 2 issues completedDescription
🚀 The feature, motivation and pitch
Blackwell's CUDA toolkit has been released and we're working on rapidly upstream fixes/upgrades that are required to support Blackwell (e.g., SM 10.0, SM 12.0).
Build fixes (these are needed to prevent kernels from crashing or enable existing backend support):
- enable compute capabilities in build [NVIDIA] Full Family Blackwell Support codegen #145436 Add support for blackwell codegen #141724
- gate sm90 specific kernels to sm90 for now [ATen][Native][CUDA][SCALED_MM] limit f8f8bf16 rowwise scaled matmul to sm_90 #145728
- limit number of threads in avgpool_2d backward to prevent crash on launch [CUDA][B200] Update the number of threads in
avg_pool2d
backward for SM 10.0 #145669 - SDPA kernel SM gating [ATen][CUDA][Transformers] Add Blackwell support to SDPA #145602
- CUDA 12.8 upgrade incl. CI Add CUDA 12.8 installation and manylinux-cuda12.8 #145567
Library upgrades (these are needed to enable Blackwell support on math libraries):
- cuDNN upgrade to 9.7.0+
- cuBLAS upgrade (will implicitly happen with upgrade to CUDA 12.8+)
- NCCL upgrade to 2.25.1 Update to NCCL 2.25.1 for 12.8 #145776
- CUTLASS upgrade to 3.8.0 [BE]: Update Cutlass submodule to 3.8 candidate for SM100+ support #145741
- Triton upgrade to main/old pin w/ Blackwell support Triton pin update for PyTorch 2.7 / Triton 3.3: Upgrading PyTorch-Triton to a version that Supports Blackwell #146518 CC @drisspg
Performance upgrades (existing kernels w/ improved implementation on Blackwell):
- 128-bit vectorization [ATen][CUDA] Implement 128 bit vectorization v2 #145746