You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+15-17Lines changed: 15 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,36 +2,34 @@
2
2
3
3
This project is a from-scratch implementation of diffusion model training in C++/CUDA. Inspired by Andrej Karpathy's [llm.c](https://github.com/karpathy/llm.c). The implementation is based on the U-Net architecture in the paper [Diffusion Models Beat GANs on Image Synthesis](https://arxiv.org/abs/2105.05233).
4
4
5
-
**My Motivation:**
5
+
### **My Motivation:**
6
6
7
7
As a Python programmer, I was fascinated by diffusion models but found the math and implementation details challenging. Meanwhile, because of my interest in ML systems and infrastructure, I also wanted to learn CUDA, and understand how to get the most out of GPUs. This project was born out of my desire to learn by doing, and to see if I could achieve performance comparable to, or even exceeding, PyTorch. Python can be slow, especially for computationally intensive tasks like training diffusion models, so the appeal of C++/CUDA's speed was undeniable.
8
8
9
-
**My Goal: Beating `torch.compile`**
9
+
### **My Goal: Beating `torch.compile`**
10
10
11
11
One of the primary objectives for this project is to develop a solution that can potentially surpass the performance of PyTorch's torch.compile feature. torch.compile leverages advanced optimization techniques such as just-in-time (JIT) graph compilation, operator fusion, and low-level kernel optimizations to enhance the execution efficiency of PyTorch models, particularly on NVIDIA GPUs. These optimizations significantly improve runtime performance by reducing overhead and maximizing hardware resource utilization. In fact, PyTorch runs heuristics directly on your hardware to squeeze out every bit of performance. This results in significantly faster execution, especially on NVIDIA GPUs. It's a tough challenge, but I'm excited to see how close I can get!
12
12
13
-
**Current Implementation:**
13
+
### **Current Implementation:**
14
14
15
15
This currently supports unconditional diffusion model training, and the end-to-end training loop is currently running at about 55% the speed of PyTorch with `torch.compile` when run on a single H100. Our main bottleneck is memory bandwidth saturation during shared memory loads for convolutions, but we can potentially optimize it by tweaking tiling, exploring register blocking, and mayb e even leveraging H100's Transformer Engine and FP8 precision.
16
16
17
-
**Learning Resources That Helped Me:**
17
+
### Learning Resources That Helped Me:
18
18
19
19
If you're interested in learning more about diffusion models and CUDA programming, here are some resources that I found incredibly helpful:
20
20
21
-
***Understanding Diffusion Models:**
22
-
[https://www.youtube.com/watch?v=W-O7AZNzbzQ](https://www.youtube.com/watch?v=W-O7AZNzbzQ) - This video provides a great explanation of the research paper.
23
-
[https://www.youtube.com/watch?v=HoKDTa5jHvg](https://www.youtube.com/watch?v=HoKDTa5jHvg) - If you're struggling with the math behind diffusion models, like I was, this video is a lifesaver.
24
-
25
-
***My Journey into CUDA:**
21
+
***Understanding Diffusion Models:**
22
+
-[https://www.youtube.com/watch?v=W-O7AZNzbzQ](https://www.youtube.com/watch?v=W-O7AZNzbzQ) - This video provides a great explanation of the research paper.
23
+
-[https://www.youtube.com/watch?v=HoKDTa5jHvg](https://www.youtube.com/watch?v=HoKDTa5jHvg) - If you're struggling with the math behind diffusion models, like I was, this video is a lifesaver.
24
+
25
+
***GPU Programming:**
26
26
***Programming Massively Parallel Processors (Book & Lecture Series):**[https://www.youtube.com/playlist?list=PLRRuQYjFhpmubuwx-w8X964ofVkW1T8O4](https://www.youtube.com/playlist?list=PLRRuQYjFhpmubuwx-w8X964ofVkW1T8O4) - This is a great starting point for learning the fundamentals of GPU Programming/HPC.
27
27
***Getting Started with CUDA for Python Programmers:**[https://www.youtube.com/watch?v=nOxKexn3iBo](https://www.youtube.com/watch?v=nOxKexn3iBo) - Great introductory YouTube series specifically for Python programmers venturing into CUDA.
28
28
***My Optimization Bible: CUDA Matrix Multiplication Optimization Tutorial:**[https://siboehm.com/articles/22/CUDA-MMM](https://siboehm.com/articles/22/CUDA-MMM) - This tutorial is where I learned the majority of the optimization techniques I used in this project. Highly recommended!
29
29
30
-
**More CUDA/GPU Programming Resources:**
31
-
32
-
## High Quality Resources on GPU Programming/Architecture
-[Getting Started With CUDA for Python Programmers](https://youtu.be/nOxKexn3iBo?si=nung2_X-TXsnK4YK)
58
56
-[Programming Massively Parallel Processors - Lecture Series by the Book Author](https://www.youtube.com/playlist?list=PLRRuQYjFhpmubuwx-w8X964ofVkW1T8O4)
@@ -64,6 +62,6 @@ If you're interested in learning more about diffusion models and CUDA programmin
Huge thanks to Andrej Karpathy for his inspiring [llm.c](https://github.com/karpathy/llm.c) project and to the authors of the research paper [Diffusion Models Beat GANs on Image Synthesis](https://arxiv.org/abs/2105.05233). Also, credit to [clu0/unet.cu](https://github.com/clu0/unet.cu) and [siboehm.com/articles/22/CUDA-MMM](https://siboehm.com/articles/22/CUDA-MMM) for providing valuable code inspiration.
0 commit comments