Skip to content

Performance: BlockTile 256x128 optimizations enable 1500+ TF FP8 #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 9, 2025

Conversation

sazczmh
Copy link
Collaborator

@sazczmh sazczmh commented Apr 8, 2025

By resuing the Accumulator registers of Tensor Cores to implement a 256x128 BlockTile structure, this approach significantly increases data reuse, reduces the demand for L2 Cache and HBM memory accesses, and enhances the SM's computational frequency, ultimately achieving FP8 performance exceeding 1,500+ TFLOPS.

M N K Base BMxBN Computation Opti BMxBN Computation Speedup
4096 24576 1536 128x160 1162 TF 256x128 1204 TF 3.61%
4096 32768 512 128x160 801 TF 256x128 777 TF -3.00%
4096 7168 16384 128x160 1451 TF 256x128 1500 TF 3.38%
4096 4096 7168 128x160 1304 TF 256x128 1377 TF 5.60%
4096 7168 2048 128x160 1185 TF 256x128 1159 TF -2.19%

Test on “H800”-SXM && CUDA 12.8.1

@sazczmh sazczmh added the perf label Apr 8, 2025
@sazczmh sazczmh self-assigned this Apr 8, 2025
@LyricZhao LyricZhao force-pushed the blocktile-256x128 branch from 1eeb98a to 48a5f07 Compare April 9, 2025 02:01
@LyricZhao LyricZhao requested a review from zheanxu April 9, 2025 03:10
@LyricZhao LyricZhao merged commit fed3e4d into main Apr 9, 2025
@LyricZhao LyricZhao deleted the blocktile-256x128 branch April 11, 2025 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants