File tree Expand file tree Collapse file tree 1 file changed +6
-11
lines changed Expand file tree Collapse file tree 1 file changed +6
-11
lines changed Original file line number Diff line number Diff line change @@ -10,17 +10,12 @@ High-performance Diffusion Transformer (DiT) implementation from scratch using C
1010- Memory coalescing for Q, K, V matrix operations
1111
1212### Attention Kernel Performance Results
13- --------------------------------------------------
14- Best Latency (over 5 trials):
15- CUDA Implementation: 0.058 ms
16- PyTorch Reference: 0.096 ms
17- Speedup: 1.66x
18- Performance Gain: 39.6%
19-
20- Throughput:
21- CUDA Implementation: 550.8k tokens/sec
22- PyTorch Reference: 332.6k tokens/sec
23- Throughput Ratio: 1.66x
13+ I'll convert this into a clean markdown table format.
14+
15+ | Metric | CUDA Implementation | PyTorch Reference | Improvement |
16+ | --------| -------------------| ------------------| -------------|
17+ | Best Latency | 0.058 ms | 0.096 ms | 1.66x (39.6%) |
18+ | Throughput | 550.8k tokens/sec | 332.6k tokens/sec | 1.66x |
2419
2520### MLP Block
2621- Matrix multiplications using shared memory and warp-level tiling
You can’t perform that action at this time.
0 commit comments