Skip to content

Performance: Larger BlockTile optimizations enable 1470+ TF FP8 on the "H800"-SXM #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Mar 25, 2025

Conversation

sazczmh
Copy link
Collaborator

@sazczmh sazczmh commented Mar 25, 2025

By leveraging Large BlockTile optimization to alleviate L2 cache pressure and maximize data reuse, the H800-SXM achieves peak FP8 compute performance of 1470+ TFLOPS.
@LyricZhao

Normal GEMMs for dense models

M N K Base BMxBN Computation Opti BMxBN Computation Speedup
4096 24576 1536 128x128 999 TF 128x160 1166 TF 16.72%
4096 32768 512 128x128 591 TF 128x160 748 TF 26.57%
4096 7168 16384 128x128 1404 TF 128x160 1470 TF 4.70%
4096 7168 2048 128x128 1031 TF 128x160 1204 TF 16.78%

Grouped GEMMs for MoE models (contiguous layout)

Groups M N K Base BMxBN Computation Opti BMxBN Computation Speedup
4 8192 4096 7168 128x128 1317 TF 128x160 1381 TF 4.86%
4 8192 7168 2048 128x128 1114 TF 128x160 1262 TF 13.29%
8 4096 4096 7168 128x128 1317 TF 128x160 1383 TF 5.01%
8 4096 7168 2048 128x128 1107 TF 128x160 1259 TF 13.73%

Test on “H800”-SXM && CUDA 12.8.1

@sazczmh sazczmh added the enhancement New feature or request label Mar 25, 2025
@LyricZhao LyricZhao merged commit a5645d7 into main Mar 25, 2025
@LyricZhao
Copy link
Collaborator

Introduced a bug (wrong TMA multicast condition for grouped contiguous GEMM) in this PR, fixed in b4ecf9c.

@LyricZhao LyricZhao deleted the larger-block branch April 11, 2025 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants