Skip to content

Commit c340bb4

Browse files
Update README.md
1 parent 98a033e commit c340bb4

File tree

1 file changed

+44
-17
lines changed

1 file changed

+44
-17
lines changed

README.md

Lines changed: 44 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,69 @@
11
# diffusion.cu
22

3-
This project is a from-scratch implementation of a UNet for diffusion model training in C++/CUDA. Inspired by Andrej Karpathy's [llm.c](https://github.com/karpathy/llm.c), it aims to understand diffusion models and CUDA programming while striving for performance comparable to PyTorch. The implementation is based on the research paper [Diffusion Models Beat GANs on Image Synthesis](https://arxiv.org/abs/2105.05233).
3+
This project is a from-scratch implementation of diffusion model training in C++/CUDA. Inspired by Andrej Karpathy's [llm.c](https://github.com/karpathy/llm.c). The implementation is based on the U-Net architecture in the paper [Diffusion Models Beat GANs on Image Synthesis](https://arxiv.org/abs/2105.05233).
44

55
**My Motivation:**
66

7-
As a Python programmer, I was fascinated by diffusion models but found the math and implementation details challenging. I also wanted to delve into the world of CUDA and understand how to leverage its power for faster deep learning. This project was born out of my desire to learn by doing, and to see if I could achieve performance comparable to, or even exceeding, PyTorch. Python can be slow, especially for computationally intensive tasks like training diffusion models, so the appeal of C++/CUDA's speed was undeniable.
8-
9-
**Key Features:**
10-
11-
* **Pure C++/CUDA Implementation:** The entire UNet, including the training loop, is written in C++/CUDA for maximum performance.
12-
* **Unconditional Diffusion Training:** Currently supports unconditional diffusion model training.
13-
* **Performance Benchmarks:** Includes comparisons against PyTorch with and without `torch.compile` on an RTX 4090.
7+
As a Python programmer, I was fascinated by diffusion models but found the math and implementation details challenging. Meanwhile, because of my interest in ML systems and infrastructure, I also wanted to learn CUDA, and understand how to get the most out of GPUs. This project was born out of my desire to learn by doing, and to see if I could achieve performance comparable to, or even exceeding, PyTorch. Python can be slow, especially for computationally intensive tasks like training diffusion models, so the appeal of C++/CUDA's speed was undeniable.
148

159
**My Goal: Beating `torch.compile`**
1610

17-
One of my ambitious goals for this project is to try and outperform PyTorch's `torch.compile` feature. `torch.compile` is a game-changer for PyTorch performance. It works by taking your PyTorch model and optimizing it under the hood using techniques like graph compilation, operator fusion, and more. This results in significantly faster execution, especially on NVIDIA GPUs. It's a tough challenge, but I'm excited to see how close I can get!
11+
One of the primary objectives for this project is to develop a solution that can potentially surpass the performance of PyTorch's torch.compile feature. torch.compile leverages advanced optimization techniques such as just-in-time (JIT) graph compilation, operator fusion, and low-level kernel optimizations to enhance the execution efficiency of PyTorch models, particularly on NVIDIA GPUs. These optimizations significantly improve runtime performance by reducing overhead and maximizing hardware resource utilization. In fact, PyTorch runs heuristics directly on your hardware to squeeze out every bit of performance. This results in significantly faster execution, especially on NVIDIA GPUs. It's a tough challenge, but I'm excited to see how close I can get!
1812

19-
**Current Performance:**
13+
**Current Implementation:**
2014

21-
The project is still a work in progress, but it's showing promising results! The end-to-end training loop is currently running at about 40% the speed of PyTorch with `torch.compile`.
15+
This currently supports unconditional diffusion model training, and the end-to-end training loop is currently running at about 55% the speed of PyTorch with `torch.compile` when run on a single H100. Our main bottleneck is memory bandwidth saturation during shared memory loads for convolutions, but we can potentially optimize it by tweaking tiling, exploring register blocking, and mayb e even leveraging H100's Transformer Engine and FP8 precision.
2216

2317
**Learning Resources That Helped Me:**
2418

2519
If you're interested in learning more about diffusion models and CUDA programming, here are some resources that I found incredibly helpful:
2620

2721
* **Understanding Diffusion Models:**
28-
* **Research Paper Deep Dive (Highly Recommended):** [https://www.youtube.com/watch?v=W-O7AZNzbzQ](https://www.youtube.com/watch?v=W-O7AZNzbzQ) - This video provides a great explanation of the research paper.
29-
* **Demystifying the Math:** [https://www.youtube.com/watch?v=HoKDTa5jHvg](https://www.youtube.com/watch?v=HoKDTa5jHvg) - If you're struggling with the math behind diffusion models, like I was, this video is a lifesaver.
22+
[https://www.youtube.com/watch?v=W-O7AZNzbzQ](https://www.youtube.com/watch?v=W-O7AZNzbzQ) - This video provides a great explanation of the research paper.
23+
[https://www.youtube.com/watch?v=HoKDTa5jHvg](https://www.youtube.com/watch?v=HoKDTa5jHvg) - If you're struggling with the math behind diffusion models, like I was, this video is a lifesaver.
3024

3125
* **My Journey into CUDA:**
32-
* **Programming Massively Parallel Processors (Book & Lecture Series):** [https://www.youtube.com/playlist?list=PLRRuQYjFhpmubuwx-w8X964ofVkW1T8O4](https://www.youtube.com/playlist?list=PLRRuQYjFhpmubuwx-w8X964ofVkW1T8O4) - This is a great starting point for learning the fundamentals of CUDA.
33-
* **Getting Started with CUDA for Python Programmers:** [https://www.youtube.com/watch?v=nOxKexn3iBo](https://www.youtube.com/watch?v=nOxKexn3iBo) - A fantastic introductory YouTube series specifically for Python programmers venturing into CUDA.
26+
* **Programming Massively Parallel Processors (Book & Lecture Series):** [https://www.youtube.com/playlist?list=PLRRuQYjFhpmubuwx-w8X964ofVkW1T8O4](https://www.youtube.com/playlist?list=PLRRuQYjFhpmubuwx-w8X964ofVkW1T8O4) - This is a great starting point for learning the fundamentals of GPU Programming/HPC.
27+
* **Getting Started with CUDA for Python Programmers:** [https://www.youtube.com/watch?v=nOxKexn3iBo](https://www.youtube.com/watch?v=nOxKexn3iBo) - Great introductory YouTube series specifically for Python programmers venturing into CUDA.
3428
* **My Optimization Bible: CUDA Matrix Multiplication Optimization Tutorial:** [https://siboehm.com/articles/22/CUDA-MMM](https://siboehm.com/articles/22/CUDA-MMM) - This tutorial is where I learned the majority of the optimization techniques I used in this project. Highly recommended!
3529

3630
**More CUDA/GPU Programming Resources:**
3731

38-
I've compiled a more extensive list of valuable resources in the repository's full README.
32+
## High Quality Resources on GPU Programming/Architecture
33+
34+
### Articles/Blogs
35+
36+
- [GPU Programming](https://enccs.github.io/gpu-programming/)
37+
- [The CUDA Parallel Programming Model](https://fabiensanglard.net/cuda/)
38+
- [A HISTORY OF NVIDIA STREAM MULTIPROCESSOR](https://fabiensanglard.net/cuda/index.html)
39+
- [Parallel Thread Execution](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html)
40+
- [How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog](https://siboehm.com/articles/22/CUDA-MMM)
41+
- [Making Deep Learning Go Brrrr From First Principles](https://horace.io/brrr_intro.html)
42+
- [CUDA Matrix Multiplication Optimization](https://leimao.github.io/article/CUDA-Matrix-Multiplication-Optimization/)
43+
- [What Every Developer Should Know About GPU Computing](https://codeconfessions.substack.com/p/gpu-computing)
44+
- [A minimal GPU design in Verilog to learn how GPUs work from the ground up](https://github.com/adam-maj/tiny-gpu)
45+
- [GPU Programming: When, Why and How?](https://enccs.github.io/gpu-programming/)
46+
- [Understanding GPU internals](https://cmeraki.github.io/gpu-part1.html)
47+
- [Understanding the GPU programming model](https://cmeraki.github.io/gpu-part2.html)
48+
49+
### Tutorials
50+
- [Intro to Parallel Programming](https://developer.nvidia.com/udacity-cs344-intro-parallel-programming)
51+
52+
### Notebooks
53+
- [GPU Puzzles](https://github.com/srush/GPU-Puzzles)
54+
55+
### Videos
56+
- [How GPU Computing Works](https://www.youtube.com/watch?v=3l10o0DYJXg)
57+
- [Getting Started With CUDA for Python Programmers](https://youtu.be/nOxKexn3iBo?si=nung2_X-TXsnK4YK)
58+
- [Programming Massively Parallel Processors - Lecture Series by the Book Author](https://www.youtube.com/playlist?list=PLRRuQYjFhpmubuwx-w8X964ofVkW1T8O4)
59+
- [Programming Massively Parallel Processors: A Hands-on Approach and then this YT series](https://m.youtube.com/playlist?list=PL6RdenZrxrw-zNX7uuGppWETdxt_JxdMj&si=ZqKCQgFef-v3JBv8)
60+
- [Programming Parallel Computers](https://youtube.com/playlist?list=PL2RY7P3JxZN-Pz1nwvnoJ9uEHmOmv4jmi&si=-7hc_4fQfFrMc8VZ)
61+
- [GPU Programming Lectures](https://youtube.com/playlist?list=PL3xCBlatwrsXCGW4SfEoLzKiMSUCE7S_X&si=2vIw6R0JpZjBt8pR)
62+
- [From Scratch CUDA](https://youtube.com/playlist?list=PLxNPSjHT5qvvwoy6KXzUbLaF5A8NdJvuo&si=rvc52nc-VAPVwhNh)
63+
- [CUDA Programming](https://www.youtube.com/watch?v=xwbD6fL5qC8)
64+
- [CUDA MODE Lectures](https://www.youtube.com/@CUDAMODE/videos)
65+
3966

4067
**Acknowledgments:**
4168

42-
Huge thanks to Andrej Karpathy for his inspiring [llm.c](https://github.com/karpathy/llm.c) project and to the authors of the research paper [Diffusion Models Beat GANs on Image Synthesis](https://arxiv.org/abs/2105.05233). Also, credit to [clu0/unet.cu](https://github.com/clu0/unet.cu) and [siboehm.com/articles/22/CUDA-MMM](https://siboehm.com/articles/22/CUDA-MMM) for providing valuable code inspiration.
69+
Huge thanks to Andrej Karpathy for his inspiring [llm.c](https://github.com/karpathy/llm.c) project and to the authors of the research paper [Diffusion Models Beat GANs on Image Synthesis](https://arxiv.org/abs/2105.05233). Also, credit to [clu0/unet.cu](https://github.com/clu0/unet.cu) and [siboehm.com/articles/22/CUDA-MMM](https://siboehm.com/articles/22/CUDA-MMM) for providing valuable code inspiration.

0 commit comments

Comments
 (0)