|
1 | 1 | # diffusion.cu |
2 | 2 |
|
3 | | -This project is a from-scratch implementation of diffusion model training in raw C++/CUDA. It is currently in progress, with support for both the classic UNet architecture, based on [Diffusion Models Beat GANs on Image Synthesis](https://arxiv.org/abs/2105.05233), and the transformer architecture (DiT), as detailed in [Scalable Diffusion Models with Transformers](https://arxiv.org/abs/2212.09748). My work is focused on developing the DiT model from scratch, while also enhancing Chen Lu's [unet.cu](https://github.com/clu0/unet.cu) by adding distributed training support and optimizations such as mixed precision training. The project is inspired by Andrej Karpathy's [llm.c](https://github.com/karpathy/llm.c). |
4 | | - |
5 | | -## Training |
6 | | - |
7 | | -UNet currently supports training. You can train it on the images from the ImageNet 64x64 dataset via |
8 | | - |
| 3 | +High-performance Diffusion Transformer (DiT) implementation from scratch using CUDA/C++. Features optimized CUDA kernels with: |
| 4 | + |
| 5 | +### MLP Block |
| 6 | +- Matrix multiplications using shared memory and warp-level tiling |
| 7 | +- Persistent threadblocks for efficient kernel reuse |
| 8 | +- Tensor Core acceleration via WMMA API |
| 9 | +- Kernel fusion for SiLU activation and bias addition |
| 10 | +- Mixed precision (FP16) computation |
| 11 | + |
| 12 | +### Attention Block |
| 13 | +- Optimized scaled dot-product attention using shared memory tiling |
| 14 | +- Fused softmax with numerical stability optimizations |
| 15 | +- Efficient parallel reductions for attention scores |
| 16 | +- Block-level parallelism for multi-head attention |
| 17 | +- Memory coalescing for Q, K, V matrix operations |
| 18 | + |
| 19 | +## Usage |
| 20 | + |
| 21 | +```python |
| 22 | +from src.inference import generate_images |
| 23 | + |
| 24 | +images = generate_images( |
| 25 | + prompt="A photo of a cat", # or class index (0-999) |
| 26 | + image_size=256, # or 512 |
| 27 | + num_samples=4 |
| 28 | +) |
9 | 29 | ``` |
10 | | -gunzip unet/data/elephant_train.bin.gz |
11 | | -python unet/train_diffusion.py --init_model_only True |
12 | | -make -C unet train_diffusion |
13 | | -./unet/train_diffusion |
14 | | -``` |
15 | | - |
16 | | -### **Current Implementation:** |
17 | | - |
18 | | -This currently supports unconditional diffusion model training, and the end-to-end training loop is currently running at about 42% the speed of PyTorch with `torch.compile` when run on a single H100. Further detailed benchmarks will have to be done to understand bottlenecks + adjust implementation for better performance. I do think we can incorporate mixed-precision training here, though (FP16 w/ loss scaling). |
19 | | - |
20 | | -| Platform | Time on H100 (ms) | |
21 | | -|--------------------------------------|-------------------| |
22 | | -| This repo (CUDA implementation) | 56.98 | |
23 | | -| PyTorch (w/ `torch.compile`) | 23.68 | |
24 | | - |
25 | | - |
26 | | -In Progress: |
27 | | -- support for distributed training via MPI in UNet |
28 | | -- support for mixed precision training in UNet |
29 | | -- support for DiT full fledged training |
30 | | - |
31 | | -### **My Motivation:** |
32 | | - |
33 | | -I've always been intrigued by diffusion models but found the math and implementation challenging. My interest in ML systems and GPU programming led me to start this project. Inspired by Karpathy's llm.c, I aimed to directly program the GPU for faster, more efficient training. |
34 | | - |
35 | | -My goal is to develop a solution that could potentially surpass PyTorch's torch.compile, which optimizes model execution on NVIDIA GPUs using advanced techniques like JIT compilation, operator fusion, and kernel optimizations. These optimizations significantly improve runtime performance by reducing overhead and maximizing hardware resource utilization. |
36 | | - |
37 | | - |
38 | | -### Learning Resources That Helped Me: |
39 | | - |
40 | | -If you're interested in learning more about diffusion models and CUDA programming, here are some resources that I found incredibly helpful: |
41 | | - |
42 | | - * **Understanding Diffusion Models:** |
43 | | - - [https://www.youtube.com/watch?v=W-O7AZNzbzQ](https://www.youtube.com/watch?v=W-O7AZNzbzQ) - This video provides a great explanation of the research paper. |
44 | | - - [https://www.youtube.com/watch?v=HoKDTa5jHvg](https://www.youtube.com/watch?v=HoKDTa5jHvg) - If you're struggling with the math behind diffusion models, like I was, this video is a lifesaver. |
45 | | - |
46 | | -* **GPU Programming:** |
47 | | - * **Programming Massively Parallel Processors (Book & Lecture Series):** [https://www.youtube.com/playlist?list=PLRRuQYjFhpmubuwx-w8X964ofVkW1T8O4](https://www.youtube.com/playlist?list=PLRRuQYjFhpmubuwx-w8X964ofVkW1T8O4) - This is a great starting point for learning the fundamentals of GPU Programming/HPC. |
48 | | - * **Getting Started with CUDA for Python Programmers:** [https://www.youtube.com/watch?v=nOxKexn3iBo](https://www.youtube.com/watch?v=nOxKexn3iBo) - Great introductory YouTube series specifically for Python programmers venturing into CUDA. |
49 | | - * **My Optimization Bible: CUDA Matrix Multiplication Optimization Tutorial:** [https://siboehm.com/articles/22/CUDA-MMM](https://siboehm.com/articles/22/CUDA-MMM) - This tutorial is where I learned the majority of the optimization techniques I used in this project. Highly recommended! |
50 | | - |
51 | | -### **More CUDA/GPU Programming Resources:** |
52 | 30 |
|
53 | | -Articles/Blogs |
| 31 | +Automatically downloads and runs inference using pretrained DiT-XL/2 models. |
54 | 32 |
|
55 | | -- [GPU Programming](https://enccs.github.io/gpu-programming/) |
56 | | -- [The CUDA Parallel Programming Model](https://fabiensanglard.net/cuda/) |
57 | | -- [A HISTORY OF NVIDIA STREAM MULTIPROCESSOR](https://fabiensanglard.net/cuda/index.html) |
58 | | -- [Parallel Thread Execution](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html) |
59 | | -- [How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog](https://siboehm.com/articles/22/CUDA-MMM) |
60 | | -- [Making Deep Learning Go Brrrr From First Principles](https://horace.io/brrr_intro.html) |
61 | | -- [CUDA Matrix Multiplication Optimization](https://leimao.github.io/article/CUDA-Matrix-Multiplication-Optimization/) |
62 | | -- [What Every Developer Should Know About GPU Computing](https://codeconfessions.substack.com/p/gpu-computing) |
63 | | -- [A minimal GPU design in Verilog to learn how GPUs work from the ground up](https://github.com/adam-maj/tiny-gpu) |
64 | | -- [GPU Programming: When, Why and How?](https://enccs.github.io/gpu-programming/) |
65 | | -- [Understanding GPU internals](https://cmeraki.github.io/gpu-part1.html) |
66 | | -- [Understanding the GPU programming model](https://cmeraki.github.io/gpu-part2.html) |
67 | | - |
68 | | -Tutorials |
69 | | -- [Intro to Parallel Programming](https://developer.nvidia.com/udacity-cs344-intro-parallel-programming) |
| 33 | +## Requirements |
| 34 | +- CUDA 11.0+ |
| 35 | +- PyTorch 2.0+ |
| 36 | +- NVIDIA GPU with Tensor Cores |
70 | 37 |
|
71 | | -Notebooks |
72 | | -- [GPU Puzzles](https://github.com/srush/GPU-Puzzles) |
73 | | - |
74 | | -Videos |
75 | | -- [How GPU Computing Works](https://www.youtube.com/watch?v=3l10o0DYJXg) |
76 | | -- [Getting Started With CUDA for Python Programmers](https://youtu.be/nOxKexn3iBo?si=nung2_X-TXsnK4YK) |
77 | | -- [Programming Massively Parallel Processors - Lecture Series by the Book Author](https://www.youtube.com/playlist?list=PLRRuQYjFhpmubuwx-w8X964ofVkW1T8O4) |
78 | | -- [Programming Massively Parallel Processors: A Hands-on Approach and then this YT series](https://m.youtube.com/playlist?list=PL6RdenZrxrw-zNX7uuGppWETdxt_JxdMj&si=ZqKCQgFef-v3JBv8) |
79 | | -- [Programming Parallel Computers](https://youtube.com/playlist?list=PL2RY7P3JxZN-Pz1nwvnoJ9uEHmOmv4jmi&si=-7hc_4fQfFrMc8VZ) |
80 | | -- [GPU Programming Lectures](https://youtube.com/playlist?list=PL3xCBlatwrsXCGW4SfEoLzKiMSUCE7S_X&si=2vIw6R0JpZjBt8pR) |
81 | | -- [From Scratch CUDA](https://youtube.com/playlist?list=PLxNPSjHT5qvvwoy6KXzUbLaF5A8NdJvuo&si=rvc52nc-VAPVwhNh) |
82 | | -- [CUDA Programming](https://www.youtube.com/watch?v=xwbD6fL5qC8) |
83 | | -- [CUDA MODE Lectures](https://www.youtube.com/@CUDAMODE/videos) |
| 38 | +## Acknowledgments |
84 | 39 |
|
| 40 | +This implementation is based on: |
| 41 | +- [Scalable Diffusion Models with Transformers (DiT)](https://arxiv.org/abs/2212.09748) by William Peebles and Saining Xie |
| 42 | +- [Official PyTorch DiT Implementation](https://github.com/facebookresearch/DiT) by Facebook Research, used for benchmarking and validation |
85 | 43 |
|
86 | | -### **Acknowledgments:** |
87 | | -- Original paper implementation by OpenAI: https://github.com/openai/guided-diffusion |
88 | | -- Got most of complex cuda implementations (attention, groupnorm, etc.) from here: https://github.com/karpathy/llm.c |
89 | | -- For custom unet implementation: https://github.com/clu0/unet.cu |
| 44 | +The CUDA kernels in this repository are written from scratch but validated against the official implementation to ensure correctness. |
0 commit comments