Skip to content

Commit 56de92d

Browse files
adding table to model perf
1 parent 72630e8 commit 56de92d

File tree

1 file changed

+6
-11
lines changed

1 file changed

+6
-11
lines changed

README.md

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,17 +10,12 @@ High-performance Diffusion Transformer (DiT) implementation from scratch using C
1010
- Memory coalescing for Q, K, V matrix operations
1111

1212
### Attention Kernel Performance Results
13-
--------------------------------------------------
14-
Best Latency (over 5 trials):
15-
CUDA Implementation: 0.058 ms
16-
PyTorch Reference: 0.096 ms
17-
Speedup: 1.66x
18-
Performance Gain: 39.6%
19-
20-
Throughput:
21-
CUDA Implementation: 550.8k tokens/sec
22-
PyTorch Reference: 332.6k tokens/sec
23-
Throughput Ratio: 1.66x
13+
I'll convert this into a clean markdown table format.
14+
15+
| Metric | CUDA Implementation | PyTorch Reference | Improvement |
16+
|--------|-------------------|------------------|-------------|
17+
| Best Latency | 0.058 ms | 0.096 ms | 1.66x (39.6%) |
18+
| Throughput | 550.8k tokens/sec | 332.6k tokens/sec | 1.66x |
2419

2520
### MLP Block
2621
- Matrix multiplications using shared memory and warp-level tiling

0 commit comments

Comments
 (0)