releaase 2.11 (NVIDIA#703)

BAfsharmanesh · Nov 19, 2022 · c975e2c · c975e2c
1 parent 3c90f6a
commit c975e2c
Show file tree

Hide file tree

Showing 329 changed files with 47,268 additions and 10,543 deletions.
diff --git a/.github/labeler.yml b/.github/labeler.yml
diff --git a/.github/workflows/labeler.yml b/.github/workflows/labeler.yml
@@ -5,8 +5,7 @@ on:
 jobs:
   triage:
     runs-on: ubuntu-latest
-    permissions: read-all|write-all
     steps:
-    - uses: actions/labeler@master
+    - uses: actions/labeler@main
       with:
         repo-token: "${{ secrets.GITHUB_TOKEN }}"
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,27 @@
 # NVIDIA CUTLASS Changelog
 
+## [2.11.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.11.0) (2022-11-19)
+* Stream-K, which is a new general way to do split-K.  It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one. 
+* [Fused multi-head attention Kernel](/examples/41_fused_multi_head_attention).  It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length.  Both versions just need one kernel.
+* [Dual GEMM](/examples/45_dual_gemm), which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.
+* Hopper improves [double precision matrix multiplication](/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm90.cu) by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.
+* [BLAS3](/test/unit/gemm/device/hemm_cf64_cf64_cf64_tensor_op_f64_sm90.cu) functions with Hoppers new double precision matrix multiplication instructions.
+* [ELL Block Sparse GEMM](/examples/43_ell_block_sparse_gemm), which uses an [ELL matrix](https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/) to describe the sparsity of A matrix.  B and output matrices are still dense. The block size can be arbitary.
+* Optimized [Group Conv](/examples/42_ampere_tensorop_group_conv) for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.
+* [Optimized DepthWise Conv](/examples/46_depthwise_simt_conv2dfprop/depthwise_simt_conv2dfprop.cu).  Two new modes are added
+  * [kOptimized](/test/unit/conv/device/depthwise_conv2d_fprop_direct_conv_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu) - use direct conv to compute instead of implicit GEMM. 
+    *  The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
+  * [kFixedStrideDilation](/test/unit/conv/device/depthwise_conv2d_fprop_direct_conv_fixed_stride_dilation_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu) - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
+    * The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration. 
+* [Scripts](/examples/44_multi_gemm_ir_and_codegen) to fuse multiple back-to-back GEMM.  Its implementation was discussed in a GTC'22 Spring [talk](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41606/).
+* [FP8 data type definition](/include/cutlass/float8.h) and [conversion routines](/include/cutlass/numeric_conversion.h#L1274-2115).
+* Updates and bugfixes from the community (thanks!).  Big shout out to Meta's [xFormers](https://github.com/facebookresearch/xformers).
+
+* **Deprecation announcement:** CUTLASS plans to deprecate the following:
+  * Maxwell and Pascal GPU architectures
+  * Ubuntu 16.04
+  * CUDA 10.2
+
 ## [2.10.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.10.0) (2022-08-23)
 * [CUTLASS Python](/examples/40_cutlass_py) now supports GEMM, CONV, Group GEMM for different data types as well as different epilogue flavours.  
 * Optimizations for CUTLASS's [Grouped GEMM](examples/24_gemm_grouped/gemm_grouped.cu) kernel.  Threadblock scheduling part is improved.  Some computation can be moved to the host side if applicable.  [Grouped Syr2k](examples/38_syr2k_grouped/syr2k_grouped.cu) kernels are added, too.
@@ -16,11 +38,6 @@
 * Optimal performance using [**CUDA 11.6u2**](https://developer.nvidia.com/cuda-downloads)
 * Updates and bugfixes from the community (thanks!)
 
-* **Deprecation announcement:** CUTLASS plans to deprecate the following:
-  * Maxwell and Pascal GPU architectures
-  * Ubuntu 16.04
-  * CUDA 10.2
-
 ## [2.9.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.9.0) (2022-04-21)
 
 * [First layer Convolution kernels](/test/unit/conv/device/conv2d_fprop_fixed_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu) specialized for small channel counts and reduced alignment

diff --git a/CITATION.cff b/CITATION.cff
@@ -73,10 +73,10 @@ abstract: >-
 keywords:
   - 'cutlass, tensor cores, cuda'
 license: BSD-3-Clause
-license-url: https://github.com/NVIDIA/cutlass/blob/v2.10.0/LICENSE.txt
-version: '2.10.0'
-date-released: '2022-09-15'
+license-url: https://github.com/NVIDIA/cutlass/blob/v2.11.0/LICENSE.txt
+version: '2.11.0'
+date-released: '2022-11-19'
 identifiers:
   - type: url
-    value: "https://github.com/NVIDIA/cutlass/tree/v2.10.0"
-    description: The GitHub release URL of tag 2.10.0
+    value: "https://github.com/NVIDIA/cutlass/tree/v2.11.0"
+    description: The GitHub release URL of tag 2.11.0
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -38,7 +38,7 @@ endif()
 
 message(STATUS "CMake Version: ${CMAKE_VERSION}")
 
-project(CUTLASS VERSION 2.10.0 LANGUAGES CXX)
+project(CUTLASS VERSION 2.11.0 LANGUAGES CXX)
 include(${CMAKE_CURRENT_SOURCE_DIR}/CUDA.cmake)
 
 if (CUDA_VERSION VERSION_LESS 10.2)
@@ -87,6 +87,7 @@ set(CUTLASS_ENABLE_EXAMPLES ${CUTLASS_ENABLE_EXAMPLES_INIT} CACHE BOOL "Enable C
 set(CUTLASS_ENABLE_TOOLS ${CUTLASS_ENABLE_TOOLS_INIT} CACHE BOOL "Enable CUTLASS Tools")
 set(CUTLASS_ENABLE_LIBRARY ${CUTLASS_ENABLE_LIBRARY_INIT} CACHE BOOL "Enable CUTLASS Library")
 set(CUTLASS_ENABLE_PROFILER ${CUTLASS_ENABLE_LIBRARY} CACHE BOOL "Enable CUTLASS Profiler")
+set(CUTLASS_ENABLE_PERFORMANCE ${CUTLASS_ENABLE_PROFILER} CACHE BOOL "Enable CUTLASS Proformance")
 
 if(${CMAKE_PROJECT_NAME} STREQUAL ${PROJECT_NAME})
   set(CUTLASS_ENABLE_TESTS_INIT ${CUTLASS_ENABLE_LIBRARY}})
@@ -122,6 +123,9 @@ endif()
 if (NOT CUDA_VERSION VERSION_LESS 11.1 AND NOT CUDA_COMPILER MATCHES "[Cc]lang")
   list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 86)
 endif()
+if (NOT CUDA_VERSION VERSION_LESS 11.8 AND NOT CUDA_COMPILER MATCHES "[Cc]lang")
+  list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 90)
+endif()
 set(CUTLASS_NVCC_ARCHS ${CUTLASS_NVCC_ARCHS_SUPPORTED} CACHE STRING "The SM architectures requested.")
 set(CUTLASS_NVCC_ARCHS_ENABLED ${CUTLASS_NVCC_ARCHS} CACHE STRING "The SM architectures to build code for.")
 
@@ -569,6 +573,9 @@ install(DIRECTORY DESTINATION ${CUTLASS_TEST_INSTALL_PREFIX}/ctest)
 
 ################################################################################
 
+set(CUTLASS_ENABLE_CUBLAS OFF CACHE BOOL "cuBLAS usage for tests")
+set(CUTLASS_ENABLE_CUDNN OFF CACHE BOOL "cuDNN usage for tests")
+
 include(${CMAKE_CURRENT_SOURCE_DIR}/cuBLAS.cmake)
 
 if (CUTLASS_ENABLE_CUBLAS)
@@ -732,7 +739,7 @@ if (CUTLASS_ENABLE_TOOLS)
   add_subdirectory(tools)
   if (CUTLASS_ENABLE_PROFILER)
     add_dependencies(test_all test_profiler)
-  endif()  
+  endif()
 endif()
 if (CUTLASS_ENABLE_EXAMPLES)
   add_subdirectory(examples)

diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md
@@ -7,10 +7,10 @@
 This is the official list of CUTLASS developers and contributors.
 
 ## DEVELOPERS
-Andrew Kerr  
-Haicheng Wu  
-Manish Gupta  
-Dustyn Blasig  
+Andrew Kerr 
+Haicheng Wu 
+Manish Gupta
+Dustyn Blasig
 Pradeep Ramani
 Cris Cecka
 Vijay Thakkar
@@ -20,52 +20,50 @@ Ethan Yan
 Zhaodong Chen
 Jack Kosaian
 Yujia Zhai
-Naila Farooqui  
-Piotr Majcher  
-Paul Springer  
-Jin Wang   
-Chinmay Talegaonkar  
-Shang Zhang   
-Scott Yokim      
-Markus Hohnerbach  
-Aditya Atluri  
-David Tanner  
-Manikandan Ananth  
+Naila Farooqui
+Piotr Majcher
+Paul Springer
+Jin Wang
+Chinmay Talegaonkar
+Shang Zhang
+Scott Yokim
+Markus Hohnerbach
+Aditya Atluri
+David Tanner
+Manikandan Ananth
 
 ## CUTLASS Product Manager
 Matthew Nicely
 
 ## CONTRIBUTORS
-Timothy Costa  
-Julien Demouth  
-Brian Fahs  
-Michael Goldfarb  
-Mostafa Hagog  
-Fei Hu  
-Alan Kaatz  
-Tina Li  
-Timmy Liu  
-Duane Merrill  
-Kevin Siu  
-Markus Tavenrath  
-John Tran  
-Vicki Wang  
-Junkai Wu  
-Fung Xie  
-Albert Xu  
-Jack Yang  
-Xiuxia Zhang  
-Nick Zhao  
+Timothy Costa
+Julien Demouth
+Brian Fahs
+Michael Goldfarb
+Mostafa Hagog
+Fei Hu
+Alan Kaatz
+Tina Li
+Timmy Liu
+Duane Merrill
+Kevin Siu
+Markus Tavenrath
+John Tran
+Vicki Wang
+Junkai Wu
+Fung Xie
+Albert Xu
+Jack Yang
+Xiuxia Zhang
+Nick Zhao
 
 ## ACKNOWLEDGEMENTS
 
-Girish Bharambe  
-Luke Durant  
-Olivier Giroux  
-Stephen Jones  
-Rishkul Kulkarni  
-Bryce Lelbach  
-Joel McCormack  
-Kyrylo Perelygin  
-
-
+Girish Bharambe
+Luke Durant
+Olivier Giroux
+Stephen Jones
+Rishkul Kulkarni
+Bryce Lelbach
+Joel McCormack
+Kyrylo Perelygin