Skip to content

Commit b94fccd

Browse files
committed
fix numbering
Signed-off-by: Chris Abraham <cjyabraham@gmail.com>
1 parent b3cb205 commit b94fccd

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

_posts/2024-07-11-flashattention-3.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,17 +27,17 @@ While FlashAttention-2 can achieve up to 70% theoretical max FLOPS on Ampere (A1
2727

2828

2929

30-
1. WGMMA (Warpgroup Matrix Multiply-Accumulate). This new feature makes use of the new Tensor Cores on Hopper, with much higher throughput[^1] than the older mma.sync instruction in Ampere (image from the [H100 white paper)](https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper?ncid=no-ncid).
30+
1\. WGMMA (Warpgroup Matrix Multiply-Accumulate). This new feature makes use of the new Tensor Cores on Hopper, with much higher throughput[^1] than the older mma.sync instruction in Ampere (image from the [H100 white paper)](https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper?ncid=no-ncid).
3131

3232
![image from the H100 white paper](/assets/images/flashattention-3/fg2.png){:style="width:100%"}
3333

3434

35-
2. TMA (Tensor Memory Accelerator). This is a special hardware unit that accelerates the transfer of data between global memory and shared memory, taking care of all index calculation and out-of-bound predication. This frees up registers, which is a valuable resource to increase tile size and efficiency.
35+
2\. TMA (Tensor Memory Accelerator). This is a special hardware unit that accelerates the transfer of data between global memory and shared memory, taking care of all index calculation and out-of-bound predication. This frees up registers, which is a valuable resource to increase tile size and efficiency.
3636

3737
![block diagram](/assets/images/flashattention-3/fg3.png){:style="width:100%"}
3838

3939

40-
3. Low-precision with FP8. This doubles the Tensor Core throughput (e.g. 989 TFLOPS with FP16 and 1978 TFLOPS with FP8), but trades off accuracy by using fewer bits to represent floating point numbers.
40+
3\. Low-precision with FP8. This doubles the Tensor Core throughput (e.g. 989 TFLOPS with FP16 and 1978 TFLOPS with FP8), but trades off accuracy by using fewer bits to represent floating point numbers.
4141

4242

4343
![6x throughput](/assets/images/flashattention-3/fg4.png){:style="width:100%"}

0 commit comments

Comments
 (0)