Skip to content

Commit 6d766a9

Browse files
committed
Add matmul benchmark documentation and fix swizzle setup
1 parent 17bd0a6 commit 6d766a9

File tree

4 files changed

+49
-8
lines changed

4 files changed

+49
-8
lines changed

benchmark/matmul/README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# FP16 Matmul Benchmark (8192×8192)
2+
3+
This document records the throughput achieved by `benchmark_matmul.py` when multiplying FP16 matrices sized `M = N = 8192` across different `K` dimensions using the default autotuning search space.
4+
5+
## Environment
6+
7+
- Repository commit: `17bd0a6c651f599bec1397e0b91830c3ddc93076`
8+
- GPUs: `NVIDIA H800 SXM` on driver `560.35.05`
9+
10+
## How to Reproduce
11+
12+
```bash
13+
cd benchmark/matmul
14+
python - <<'PY'
15+
from benchmark_matmul import matmul
16+
17+
M = 8192
18+
N = 8192
19+
for K in [256, 512, 1024, 2048, 4096, 8192, 16384]:
20+
res = matmul(M, N, K, False)
21+
tflops = 2 * M * N * K / res.latency * 1e-12
22+
print(f"K={K:5d} latency={res.latency:.6f}s TFlops={tflops:.3f}")
23+
PY
24+
```
25+
26+
## Results
27+
28+
| K | Latency (s) | Throughput (TFLOPs) |
29+
|-------|-------------|---------------------|
30+
| 256 | 0.089056 | 386 |
31+
| 512 | 0.132064 | 520 |
32+
| 1024 | 0.218816 | 628 |
33+
| 2048 | 0.390112 | 705 |
34+
| 4096 | 0.746752 | 736 |
35+
| 8192 | 1.449888 | 758 |
36+
| 16384 | 2.871168 | 766 |

benchmark/matmul/benchmark_matmul.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
import itertools
33
import logging
44

5+
import tilelang
56
import tilelang.language as T
67
from tilelang.autotuner import autotune
78
from tilelang import jit
@@ -187,6 +188,8 @@ def main(
187188

188189
# Enable (or disable) swizzling optimization
189190
T.use_swizzle(panel_size=10, enable=enable_rasteration)
191+
# to utilize swizzle tma layout
192+
T.annotate_layout({C_shared: tilelang.layout.make_swizzled_layout(C_shared)})
190193

191194
# Clear out the accumulation buffer
192195
T.clear(C_local)

benchmark/matmul_fp8/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,10 @@ PY
2727

2828
| K | Latency (s) | Throughput (TFLOPs) |
2929
|-------|-------------|---------------------|
30-
| 256 | 0.091488 | 376 |
31-
| 512 | 0.110496 | 622 |
32-
| 1024 | 0.148256 | 927 |
33-
| 2048 | 0.234080 | 1174 |
34-
| 4096 | 0.398944 | 1378 |
35-
| 8192 | 0.752416 | 1461 |
36-
| 16384 | 1.443808 | 1523 |
30+
| 256 | 0.060352 | 569 |
31+
| 512 | 0.080096 | 858 |
32+
| 1024 | 0.121696 | 1129 |
33+
| 2048 | 0.204672 | 1343 |
34+
| 4096 | 0.374816 | 1467 |
35+
| 8192 | 0.729664 | 1507 |
36+
| 16384 | 1.427264 | 1541 |

benchmark/matmul_fp8/benchmark_matmul.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import argparse
22
import itertools
33
import logging
4-
4+
import tilelang
55
import tilelang.language as T
66
from tilelang.autotuner import autotune
77
from tilelang import jit
@@ -190,6 +190,8 @@ def main(
190190

191191
# Enable (or disable) swizzling optimization
192192
T.use_swizzle(panel_size=10, enable=enable_rasteration)
193+
# to utilize swizzle tma layout
194+
T.annotate_layout({C_shared: tilelang.layout.make_swizzled_layout(C_shared)})
193195

194196
# Clear out the accumulation buffer
195197
T.clear(C_local)

0 commit comments

Comments
 (0)