You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ One of the primary objectives for this project is to develop a solution that can
12
12
13
13
### **Current Implementation:**
14
14
15
-
This currently supports unconditional diffusion model training, and the end-to-end training loop is currently running at about 55% the speed of PyTorch with `torch.compile` when run on a single H100. Our main bottleneck is memory bandwidth saturation during shared memory loads for convolutions, but we can potentially optimize it by tweaking tiling, exploring register blocking, and mayb e even leveraging H100's Transformer Engine and FP8 precision.
15
+
This currently supports unconditional diffusion model training, and the end-to-end training loop is currently running at about 55% the speed of PyTorch with `torch.compile` when run on a single H100. Our main bottleneck is memory bandwidth saturation during shared memory loads for convolutions, but i think we can bypass that by writing data directly to a global memory without non-atomic writes. Then, we'll use a reduction kernel to combine the data and finish the calculations.
0 commit comments