Skip to content

Commit

Permalink
Fix some typos (NVIDIA#791)
Browse files Browse the repository at this point in the history
* fix typo

* fix a deadlink to code
  • Loading branch information
MARD1NO authored Feb 16, 2023
1 parent 9fb38ac commit a101ac2
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion media/docs/cutlass_3x_backwards_compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -293,7 +293,7 @@ mapping of 2.x layout tags to corresponding M-major, N-major, or K-major strides
| Matrix | CUTLASS 2.x layout | 2.x Shape | Logical major mode| 3.x Shape/Stride | Major ordinal |
| --- | --- | --- | --- | --- | --- |
| A | `ColumnMajor` | M x K | M major | M x K x L | 0 (outer) |
| A | `RowMajor` | M x K | K major | N x K x L | 1 (inner) |
| A | `RowMajor` | M x K | K major | M x K x L | 1 (inner) |
| B | `RowMajor` | K x N | N major | N x K x L | 0 (outer) |
| B | `ColumnMajor` | K x N | K major | N x K x L | 1 (inner) |
| C | `ColumnMajor` | M x N | M major | M x N x L | 0 (outer) |
Expand Down
2 changes: 1 addition & 1 deletion media/docs/efficient_gemm.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@ as part of the kernel design. A thread block is partitioned into two sets of war
**Warp-Specialized Persistent kernel design**

Another flavor of Warp Specialized kernel design being introduced starting with Hopper is the [*Warp-Specialized Persistent*](/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_persistent.hpp) kernel. Like Warp Specialized kernel the concepts of warp groups and barrier synchronization between warp groups remain the same in the persistent design. The distinctive feature of the Warp-Specialized Persistent kernel are the following :
* Persistent thread blocks launched to occupy as many SMs as mentioned in the [KernelHardwareInfo](include/cutlass/kernel_hardware_info.hpp) struct. These persistent thread blocks are used to tile the output and thus (potentially) compute multiple output tiles through their lifetime. The main benefit this adds is amortization of the thread-block launch and kernel prologue overheads which are typical of all kernels.
* Persistent thread blocks launched to occupy as many SMs as mentioned in the [KernelHardwareInfo](/include/cutlass/kernel_hardware_info.hpp) struct. These persistent thread blocks are used to tile the output and thus (potentially) compute multiple output tiles through their lifetime. The main benefit this adds is amortization of the thread-block launch and kernel prologue overheads which are typical of all kernels.
* Presence of one two *consumer* warp groups which allows for *epilogue* of one *consumer* warp group to be overlapped with the math operations of the other *consumer* warp group - thus maximizing tensor core utilization.

Each *consumer* warp group is assigned a different output tile. The *producer* warp group synchronizes using the [Ordered Sequence Barrier](/include/cutlass/pipeline.hpp) to fill buffers of the two *consumer* warp groups one after the other in order. Since each thread block now computes multiple output tiles, the shape of the grid launch and the scheduling of tiles to the thread blocks is managed using the new [*Tile Scheduler*](/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp). The *Tile Scheduler* considers the shape of the *clusters* as well as the available number of available SMs to compute a valid scheduling of the output tiles to launched thread blocks.
Expand Down

0 comments on commit a101ac2

Please sign in to comment.