adding table to model perf

codingwithsurya · web-flow · commit 56de92ddb5b5 · 2025-02-15T20:02:20.000-05:00
diff --git a/README.md b/README.md
@@ -10,17 +10,12 @@ High-performance Diffusion Transformer (DiT) implementation from scratch using C
 - Memory coalescing for Q, K, V matrix operations
 
 ### Attention Kernel Performance Results
---------------------------------------------------
-Best Latency (over 5 trials):
-  CUDA Implementation:  0.058 ms
-  PyTorch Reference:    0.096 ms
-  Speedup:             1.66x
-  Performance Gain:    39.6%
-
-Throughput:
-  CUDA Implementation:  550.8k tokens/sec
-  PyTorch Reference:    332.6k tokens/sec
-  Throughput Ratio:    1.66x
+I'll convert this into a clean markdown table format.
+
+| Metric | CUDA Implementation | PyTorch Reference | Improvement |
+|--------|-------------------|------------------|-------------|
+| Best Latency | 0.058 ms | 0.096 ms | 1.66x (39.6%) |
+| Throughput | 550.8k tokens/sec | 332.6k tokens/sec | 1.66x |
 
 ### MLP Block
 - Matrix multiplications using shared memory and warp-level tiling