Skip to content

Commit

Permalink
CUTLASS 2.10 (NVIDIA#615)
Browse files Browse the repository at this point in the history
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
  • Loading branch information
ANIKET-SHIVAM and ANIKET-SHIVAM authored Sep 3, 2022
1 parent ca23ff7 commit b72cbf9
Show file tree
Hide file tree
Showing 289 changed files with 43,705 additions and 2,510 deletions.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ A clear and concise description of what you expected to happen.
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]

**Additional context**
Add any other context about the problem here.
Add any other context about the problem here.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/documentation_request.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,4 @@ A clear and concise description of what documentation you believe it is needed a
A clear and concise description of what you want to happen.

**Steps taken to search for needed documentation**
List any steps you have taken:
List any steps you have taken:
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/submit_question.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ assignees: ''

---

**What is your question?**
**What is your question?**
2 changes: 1 addition & 1 deletion .github/workflows/labeler.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ jobs:
steps:
- uses: actions/labeler@main
with:
repo-token: "${{ secrets.GITHUB_TOKEN }}"
repo-token: "${{ secrets.GITHUB_TOKEN }}"
2 changes: 1 addition & 1 deletion .github/workflows/new-issues-to-triage-projects.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,4 @@ jobs:
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_PROJECT_URL: https://github.com/NVIDIA/cutlass
GITHUB_PROJECT_COLUMN_NAME: 'Needs prioritizing'
GITHUB_PROJECT_COLUMN_NAME: 'Needs prioritizing'
2 changes: 1 addition & 1 deletion .github/workflows/stale.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,4 @@ jobs:
exempt-pr-labels: "0 - Blocked,0 - Backlog,good first issue"
days-before-pr-stale: 90
days-before-pr-close: -1
operations-per-run: 50
operations-per-run: 50
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# NVIDIA CUTLASS Changelog

## [2.10.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.10.0) (2022-08-23)
* [Grouped convolution targeting implicit GEMM](test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu)
* [Depthwise separable convolution](test/unit/conv/device/depthwise_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu)
* Optimizations for CUTLASS's [Grouped GEMM](examples/24_gemm_grouped/gemm_grouped.cu) kernel
* [Grouped GEMM for Multihead Attention](examples/50_multi_head_attention)
* [GEMM + Layer norm fusion for Ampere](examples/37_gemm_layernorm_gemm_fusion/)
* Updates and bugfixes from the community (thanks!)

* **Deprecation announcement:** CUTLASS plans to deprecate the following:
* Maxwell and Pascal GPU architectures
* Ubuntu 16.04
* CUDA 10.2

## [2.9.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.9.0) (2022-04-21)

* [First layer Convolution kernels](/test/unit/conv/device/conv2d_fprop_fixed_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu) specialized for small channel counts and reduced alignment
Expand Down Expand Up @@ -37,6 +50,7 @@
* Optimal performance using [**CUDA 11.7**](https://developer.nvidia.com/cuda-downloads)
* Updates and bugfixes from the community (thanks!)


## [2.8.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.8.0) (2021-11-19)

* **TF32x3:** emulated single-precision using Tensor Cores
Expand Down
82 changes: 0 additions & 82 deletions CITATION.cff

This file was deleted.

2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ endif()

message(STATUS "CMake Version: ${CMAKE_VERSION}")

project(CUTLASS VERSION 2.9.0 LANGUAGES CXX)
project(CUTLASS VERSION 2.10.0 LANGUAGES CXX)
include(${CMAKE_CURRENT_SOURCE_DIR}/CUDA.cmake)

if (CUDA_VERSION VERSION_LESS 10.2)
Expand Down
12 changes: 9 additions & 3 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,19 @@ Andrew Kerr
Haicheng Wu
Manish Gupta
Dustyn Blasig
Pradeep Ramani
Pradeep Ramani
Cris Cecka
Vijay Thakkar
Aniket Shivam
Honghao Lu
Ethan Yan
Zhaodong Chen
Jack Kosaian
Yujia Zhai
Naila Farooqui
Piotr Majcher
Paul Springer
Jin Wang
Aniket Shivam
Chinmay Talegaonkar
Shang Zhang
Scott Yokim
Expand Down Expand Up @@ -53,7 +60,6 @@ Nick Zhao
## ACKNOWLEDGEMENTS

Girish Bharambe
Cris Cecka
Luke Durant
Olivier Giroux
Stephen Jones
Expand Down
44 changes: 17 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")

# CUTLASS 2.9
# CUTLASS 2.10

_CUTLASS 2.9 - April 2022_
_CUTLASS 2.10 - August 2022_

CUTLASS is a collection of CUDA C++ template abstractions for implementing
high-performance matrix-multiplication (GEMM) and related computations at all levels
Expand All @@ -18,7 +18,9 @@ To support a wide variety of applications, CUTLASS provides extensive support fo
mixed-precision computations, providing specialized data-movement and
multiply-accumulate abstractions for half-precision floating
point (FP16), BFloat16 (BF16), Tensor Float 32 (TF32),
single-precision floating point (FP32), double-precision floating
single-precision floating point (FP32),
[FP32 emulation via tensor core instruction](/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm),
double-precision floating
point (FP64) types, integer data types (4b and 8b), and binary data types (1b).
CUTLASS demonstrates warp-synchronous matrix multiply operations
targeting the programmable, high-throughput _Tensor Cores_ implemented by
Expand All @@ -34,26 +36,14 @@ See the [Quick Start Guide](/media/docs/quickstart.md) to get started quickly.
See the [functionality listing](/media/docs/functionality.md) for the list of operations
supported at each level of the execution model hierarchy.

# What's New in CUTLASS 2.9

CUTLASS 2.9 is an update to CUTLASS adding:
- [First layer Convolution kernels](/test/unit/conv/device/conv2d_fprop_fixed_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu) specialized for small channel counts and reduced alignment
- [BLAS3](https://docs.nvidia.com/cuda/cublas/index.html#cublas-level-3-function-reference) operators accelerated by Tensor Cores
- [SYRK](/test/unit/gemm/device/syrk_f32n_f32t_tensor_op_fast_f32_sm80.cu), [HERK](/test/unit/gemm/device/herk_cf32h_cf32n_tensor_op_fast_f32_sm80.cu),
- [SYR2K](/test/unit/gemm/device/syr2k_f32n_f32n_tensor_op_fast_f32_sm80.cu), [HER2K](/test/unit/gemm/device/her2k_cf32h_cf32n_tensor_op_fast_f32_sm80.cu),
- [Out-of-place TRMM](/test/unit/gemm/device/trmm_f32n_f32t_f32t_tensor_op_fast_f32_ls_sm80.cu), and
- [SYMM](/test/unit/gemm/device/symm_f32n_f32n_tensor_op_fast_f32_ls_sm80.cu), [HEMM](/test/unit/gemm/device/hemm_cf32h_cf32n_tensor_op_fast_f32_ls_sm80.cu)
- [CUTLASS Python](/examples/40_cutlass_py) demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using [CUDA Python](https://developer.nvidia.com/cuda-python)
- [GEMM + Softmax example](/examples/35_gemm_softmax)
- [Gather and Scatter Fusion with GEMM](/examples/36_gather_scatter_fusion) can gather inputs and scatters outputs based on indices vectors in the same GEMM kernel.
- [Back-to-back GEMM/CONV](examples/13_two_tensor_op_fusion) fully supports buffering the first GEMM/CONV results in the shared memory for the latter one to use. Bias Vector add is also supported in the first GEMM/CONV.
- [Transposed Convolution](/examples/34_transposed_conv2d) (a.k.a Deconvolution) support which reuses Dgrad implementation.
- [Utility functions](/tools/util/include/cutlass/util) that can pad NHWC and convert between NCHW and NHWC.
- [Small alignment implicit gemm](https://github.com/NVIDIA/cutlass/issues/242) support for Fprop/Dgrad/Wgrad so that padding is no longer mandated to use tensor cores.
- Epilogue enhancement with performance improvement, more activation functions, and more fusion patterns.
- [Group GEMM](/examples/24_gemm_grouped) thread block number calculation fix.
- Optimal performance using [CUDA 11.7](https://developer.nvidia.com/cuda-downloads)
- [Parallel GEMM splitk](https://github.com/NVIDIA/cutlass/pull/277) support in the CUTLASS profiler.
# What's New in CUTLASS 2.10

CUTLASS 2.10 is an update to CUTLASS adding:
- [Grouped convolution targeting implicit GEMM](test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu)
- [Depthwise separable convolution](test/unit/conv/device/depthwise_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu)
- Optimizations for CUTLASS's [Grouped GEMM](examples/24_gemm_grouped/gemm_grouped.cu) kernel
- [Grouped GEMM for Multihead Attention](examples/50_multi_head_attention)
- [GEMM + Layer norm fusion for Ampere](examples/37_gemm_layernorm_gemm_fusion/)
- Updates and bugfixes from the community (thanks!)
- **Deprecation announcement:** CUTLASS plans to deprecate the following:
- Maxwell and Pascal GPU architectures
Expand Down Expand Up @@ -249,15 +239,15 @@ examples/
12_gemm_bias_relu/ # example demonstrating GEMM fused with bias and relu
13_fused_two_gemms/ # example demonstrating two GEMms fused in one kernel
13_fused_two_gemms/ # example demonstrating two GEMMs fused in one kernel
22_ampere_tensorop_conv2dfprop/ # example demonstrating integer implicit GEMM convolution (forward propagation) using Ampere Tensor Cores
31_basic_syrk # example demonstrating Symetric rank-K update
31_basic_syrk # example demonstrating Symmetric Rank-K update
32_basic_trmm #
32_basic_trmm # example demonstrating Triangular Matrix-Matrix multiplication
33_ampere_3xtf32_tensorop_symm #
33_ampere_3xtf32_tensorop_symm # example demonstrating Symmetric Matrix-Matrix multiplication with FP32 emulation
35_gemm_softmax # example demonstrating GEMM fused with Softmax in mixed precision using Ampere Tensor Cores
Expand Down
5 changes: 2 additions & 3 deletions examples/12_gemm_bias_relu/gemm_bias_relu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -54,12 +54,11 @@ using ElementInputA = cutlass::half_t; // <- data type of elements
using ElementInputB = cutlass::half_t; // <- data type of elements in input matrix B
using ElementOutput = float; // <- data type of elements in output matrix D

// The code section below describes matrix layout of input and output matrices.
// Column Major for Matrix A, B and C.

// Note that if the output is column major, the bias has to be per row. i.e. every row has different bias.
// If the output is row major, the bias has to be per column, i.e. every column has different bias.
// Below list some other notices:
//
// Note this example only works for ColumnMajor output because
// 1) we only have row major epilogue.
// 2) we swap A and B if the output is column major then we can still use the
// row major epilogue.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -457,9 +457,13 @@ Result profile_convolution(Options const &options) {
ElementInputB(-8),
0);

// Fill tensor C on host with zeros
cutlass::reference::host::TensorFill(
tensor_c.host_view());
// Fill tensor C on host with uniform-distribution random data
cutlass::reference::host::TensorFillRandomUniform(
tensor_c.host_view(),
1,
ElementOutput(7),
ElementOutput(-8),
0);

// Fill tensor D on host with zeros
cutlass::reference::host::TensorFill(
Expand Down Expand Up @@ -686,7 +690,7 @@ int main(int argc, char const **args) {
cudaDeviceProp props;
CUDA_CHECK(cudaGetDeviceProperties(&props, 0));

if (!(props.major > 8 || (props.major == 8 && props.minor >= 0))) {
if (!(props.major >= 8)) {
std::cerr << "Ampere Tensor Ops must be run on a machine with compute capability at least 80."
<< std::endl;
notSupported = true;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -290,7 +290,7 @@ int main(int argc, char const **args) {
cudaDeviceProp props;
CUDA_CHECK(cudaGetDeviceProperties(&props, 0));

if (!(props.major > 8 || (props.major == 8 && props.minor >= 0))) {
if (!(props.major >= 8)) {
std::cerr << "Ampere Tensor Ops must be run on a machine with compute capability at least 80."
<< std::endl;
notSupported = true;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ int main(int argc, char const **args) {
cudaDeviceProp props;
CUDA_CHECK(cudaGetDeviceProperties(&props, 0));

if (!(props.major > 8 || (props.major == 8 && props.minor >= 0))) {
if (!(props.major >= 8)) {
std::cerr << "Ampere Tensor Ops must be run on a machine with compute capability at least 80."
<< std::endl;
notSupported = true;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
/**
The example demenstrates how to reduce one of the operands of the GEMM along the k-dimension when
computing GEMM. So the output also contains either a Mx1 or 1XN vector. It only works with Ampere
HMMA 16x8x16 FP16 tensor cores, though it is not difficult to apply to other Turing/Ampere tensor
16x8x16 FP16/BF16 tensor cores, though it is not difficult to apply to other Turing/Ampere tensor
core instructions.
Most of the reduction is done in gemm/warp level, see gemm/warp/mma_with_reduction_tensor_op.h
Expand Down Expand Up @@ -67,9 +67,9 @@ epilogue/threadblock/epilogue_gemm_k_reduction.h
// elements
using ElementAccumulator = float; // Data type of accumulator
using ElementComputeEpilogue = ElementAccumulator; // Data type of epilogue computation
using ElementInputA = cutlass::half_t; // Data type of elements in input tensor
using ElementInputB = cutlass::half_t; // Data type of elements in input tensor
using ElementOutput = cutlass::half_t; // Data type of elements in output tensor
using ElementInputA = cutlass::bfloat16_t; // Data type of elements in input tensor
using ElementInputB = cutlass::bfloat16_t; // Data type of elements in input tensor
using ElementOutput = cutlass::bfloat16_t; // Data type of elements in output tensor

using LayoutInputA = cutlass::layout::ColumnMajor;
using LayoutInputB = cutlass::layout::RowMajor;
Expand Down Expand Up @@ -369,22 +369,22 @@ Result profile(Options const &options) {
cutlass::reference::host::TensorFillRandomUniform(
tensor_a.host_view(),
1,
ElementInputA(4),
ElementInputA(-4),
ElementInputA(2),
ElementInputA(-2),
0); // <- Fill tensor A on host with uniform-distribution random data

cutlass::reference::host::TensorFillRandomUniform(
tensor_b.host_view(),
1,
ElementInputB(4),
ElementInputB(-4),
ElementInputB(2),
ElementInputB(-2),
0); // <- Fill tensor B on host with uniform-distribution random data

cutlass::reference::host::TensorFillRandomUniform(
tensor_c.host_view(),
1,
ElementOutput(4),
ElementOutput(-4),
ElementOutput(2),
ElementOutput(-2),
0); // <- Fill matrix C on host with uniform-distribution random data
cutlass::reference::host::TensorFill(
tensor_d.host_view()); // <- fill matrix D on host with zeros
Expand Down Expand Up @@ -612,10 +612,10 @@ Result profile(Options const &options) {

if (options.reference_check) {
output_workspace << "Reference D = \n" << tensor_ref_d.host_view() << "\n\n";
output_workspace << "Reference reduction vector= \n" << tensor_ref_reduction.host_view() << "\n\n";
output_workspace << "Reference reduction vector = \n" << tensor_ref_reduction.host_view() << "\n\n";
}

output_workspace << "Computed = \n" << tensor_d.host_view() << std::endl;
output_workspace << "Computed D = \n" << tensor_d.host_view() << std::endl;
output_workspace << "Computed reduction vector = \n" << tensor_reduction.host_view() << std::endl;

std::cout << "Results written to '" << ss.str() << "'." << std::endl;
Expand Down Expand Up @@ -699,7 +699,7 @@ int main(int argc, char const **args) {
cudaDeviceProp props;
CUDA_CHECK(cudaGetDeviceProperties(&props, 0));

if (!(props.major > 8 || (props.major == 8 && props.minor >= 0))) {
if (!(props.major >= 8)) {
std::cerr << "Ampere Tensor Ops must be run on a machine with compute capability at least 80."
<< std::endl;
notSupported = true;
Expand Down
Loading

0 comments on commit b72cbf9

Please sign in to comment.