KAIST CS492d: Diffusion Models and Their Applications
Programming Assignment
Instructor: Minhyuk Sung (mhsung [at] kaist.ac.kr)
TA: Yuseung Lee (phillip0701 [at] kaist.ac.kr)
In this programming assignment, you will gain hands-on experience with two powerful techniques for training diffusion models for conditional generation: ControlNet and LoRA.
(1) ControlNet enhances text-to-image diffusion models, such as Stable Diffusion, by allowing them to incorporate additional conditions beyond text prompts, such as sketches or depth maps. The main objective of Task 1 is to implement the core mechanism of ControlNet and train a ControlNet model on a simple dataset consisting of condition inputs and corresponding images.
(2) LoRA (Low-Rank Adaptation) is an efficient fine-tuning technique for neural networks that enables the customization of diffusion models with relatively small datasets, ranging from a few images to a few thousand. The main objective of Task 2 is to become familiar with the LoRA fine-tuning process for diffusion models. Instead of implementing LoRA from scratch, you will utilize a pre-existing LoRA module available in the diffusers library. This task will allow you to creatively develop and fine-tune a diffusion model tailored to your specific data and task requirements. Moreover, you will experiment with DreamBooth to create personalized diffusion models based on a particular subject of your choice.
This assignment is heavily based on the diffusers library. You may refer to the relevent materials while working on the tasks below. However, it is strictly forbidden to simply copy, reformat, or refactor the necessary codeblocks when making your submission. You must implement the functionalities on your own with clear understanding of how your code works. As noted in the course website, we will detect such cases with a specialized tool and plagiarism in any form will result in a zero score.
Install the required package within the requirements.txt
.
NOTE: Install PyTorch according to the CUDA version of your environment (See PyTorch Previous Versions)
conda create -n cs492d python=3.8
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
ControlNet (task_1_controlnet)
task_1_controlnet/
├── diffusion
│ ├── unets
│ │ ├── unet_2d_condition.py <--- (TODO) Integrate ControlNet outputs into UNet
│ │ └── unet_2d_blocks.py <--- Basic UNet components
│ ├── controlnet.py <--- (TODO) Implement ControlNet
│ └── pipeline_controlnet.py <--- Diffusion model pipeline with ControlNet
├── train.py <--- Training code for ControlNet
├── train.sh <--- Script with hyperparameters
└── inference.ipynb <--- Inference code for ControlNet
LoRA (task_2_lora)
task_2_lora/
├── scripts
│ ├── train_lora.sh <--- Training script for LoRA
│ ├── train_lora_custom.sh <--- Training script for LoRA w/ custom data
│ └── train_dreambooth_lora.py <--- Training script for DreamBooth + LoRA
├── train_lora.py <--- Training code for LoRA
├── train_dreambooth_lora.py <--- Training code for DreamBooth + LoRA
└── inference.ipynb <--- Inference code for trained LoRA
Before diving into the main tasks, we will have a look at Hugging Face, an open-source platform that serves as a hub for machine learning applications. Diffusers a go-to library for pretrained diffusion models made by Hugging Face. As we'll be downloading the pretrained Stable Diffusion model from Hugging Face, you'll need to ensure you have access tokens.
Before starting the assignment, please do the following:
- Sign into Hugging Face.
- Obtain your Access Token at
https://huggingface.co/settings/tokens
. - In your terminal, log into Hugging Face by
$ huggingface-cli login
and enter your Access Token.
You can check whether you have access to Hugging Face using the below code, which downloads Stable Diffusion from Hugging Face and generates an image with it.
import torch
from diffusers import StableDiffusionPipeline
model_id = "CompVis/stable-diffusion-v1-4"
device = "cuda"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to(device)
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image.save("astronaut_rides_horse.png")
Let
which is equal to simply adding an extra condition
Let's first see how Stable Diffusion generate image from text prompts. With the 5 text prompts given in ./task_1_controlnet/data/test_prompts.json
, generate images using Stable Diffusion downloaded in Task 0.
Now, based on Stable Diffusion, you will implement ControlNet with Fill50K
dataset. This dataset consists of (i) images of circles, (ii) edge map conditions of those circles, and (iii) text prompts describing each image (See below image). Note that you don't have to manually download the dataset, as the load_dataset()
function in train.py
will automatically retrieve it from Hugging Face. You can check the details on the dataset here.
Your TODOs for implementing ControlNet are listed below. This assignment is heavily based on ControlNet and its implementation in diffusers library. You may refer to the relevent materials while working on the tasks below. However, it is strictly forbidden to simply copy, reformat, or refactor the necessary codeblocks when making your submission.
- Generate 5 images with Stable Diffusion using the text prompts in
data/test_prompts.json
. - Implement zero-convolution for ControlNet (
diffusion/controlnet.py
-TODO (1)
) - Initialize ControlNet using a pretrained UNet model (
diffusion/controlnet.py
-TODO (2)
) - Apply zero-convolution to the residual features of each ControlNet block (
diffusion/controlnet.py
-TODO (3)
) - Integrate the outputs from ControlNet blocks into the UNet of Stable Diffusion (
diffusion/unets/unet_2d_condition.py
-TODO (4)
) - Train ControlNet by running
$ sh train.sh
. - Generate images with 5 different condition inputs from
./data/test_conditions
and text prompts fromdata/test_prompts.json
. (Inference code isinference.ipynb
)
(Credit: ControlNet)
After successfully training your ControlNet on the Fill50K dataset, you can obtain bonus points (5pt) by training another ControlNet on a more complex dataset.
You have the flexibility to use an open-source dataset or even create your own by using image processing techniques such as edge detection modules. For examples of conditions to train ControlNet on, refer to this website.
(Credit: Medium)
Low-Rank Adaptation (LoRA) enables training specific layers within a neural network in a more efficient manner by focusing on the low-rank decomposition of changes to the pretrained weights. LoRA was first introduced by He et al. in the context of large language models (LLMs) and was later applied to diffusion models by clonesofsimo.
The key idea behind LoRA is to optimize the rank decomposition matrices of the "updates" to the neural network's pretrained weights during training, while keeping the original weights frozen. This is particularly effective because pretrained models often have a low "intrinsic dimension," meaning they can still learn efficiently even after their parameter space is reduced by projecting it onto a smaller subspace.
Consider a pretrained weight matrix
where
During training, the pretrained weight matrix
This approach significantly reduces the number of parameters that need to be trained, making the training process more efficient while still leveraging the knowledge encoded in the pretrained model.
Your goal in this task is to create a customized diffusion model by training two distinct types of LoRA models, each using a different dataset:
[Task 2-1] Train LoRA on a Specific Visual Style
Train a LoRA for a specific "style". It could be artistic, cinematic, photographic, or any other visual style. A sample dataset in ./sample_data/artistic-custom
can be used for testing, but you should use a different dataset for your submission.
[Task 2-2] Train DreamBooth with LoRA on a Specific Identity
Train DreamBooth with LoRA to capture and reproduce the identity of a specific subject. This could be your own face, your dog, your doll, or any other identifiable subject.
Train a LoRA for a specific "style". It could be artistic, cinematic, photographic, or any other visual style. You have two options to choose a dataset for training LoRA.
(Option 1) Use an open-source dataset
You can utilize various open-source image-caption datasets for LoRA training, many of which are available on Hugging Face. For instance, you can explore datasets listed here. By replacing the DATASET_NAME
argument with the desired dataset, you can seamlessly train LoRA with new data. Additionally, you are welcome to use any other open-source datasets, provided that you clearly cite the appropriate references.
We provide the code for LoRA training of Stable Diffusion 1.4 in train_lora.py
. You can simply run the code using:
$ sh scripts/train_lora.sh
The default training dataset is set as:
$ export DATASET_NAME="lambdalabs/naruto-blip-captions"
which consists of Naruto images with synthetic captions generated with BLIP-2. The below image shows the outputs of Stable Diffusion before and after LoRA training on this dataset. You can first check if LoRA works properly based on this dataset. The validation images after each epoch will be stored in {$output_dir}/validation/
.
A simple inference code for Stable Diffusion with LoRA is provided at inference_lora.ipynb
.
(Credit: lambdalabs/naruto-blip-captions)
(Option 2) Prepare your own dataset
We highly encourage you to try training LoRA on your own creative dataset! To do so, refer to src/train_lora_custom.sh
for the necessary arguments to train on a custom dataset. Be sure to update the TRAIN_DATA_DIR
to point to the directory containing your data. We provide a sample dataset in ./sample_data/artistic-custom
consisting of four images, which were generated using GPT-4. The image below showcases the results of LoRA training on this sample dataset.
Refer to this link for how to organize the dataset folder. Then, train the LoRA using:
$ sh scripts/train_lora_custom.sh
Note that both during training and inference phases of DreamBooth, an "indicator" is used in the text prompts. For instance, in our train_dreambooth_lora.sh
script, the token sks
serves as the indicator for the specific cat. Refer to the original DreamBooth paper for details.
Your objective is to train a DreamBooth model with LoRA to accurately capture and reproduce the identity of a specific subject. This subject could be anything personally meaningful to you, such as your own face, your pet, a favorite doll, or any other identifiable entity.
DreamBooth is a fine-tuning technique that allows diffusion models to learn a unique identifier for a specific subject, enabling the generation of images featuring that exact subject post-training. By combining DreamBooth with LoRA, you can achieve this fine-tuning with only a small number of images and in a relatively short time frame.
You can run DreamBooth + LoRA using:
$ sh scripts/train_dreambooth_lora.sh
We provide a sample dataset for training DreamBooth in ./sample_data/dreambooth-cat/
. To use your own data, simply update the INSTANCE_DIR
to point to your data directory.
- Train LoRA on a specific visual style.
- Generate 5 images with different text prompts.
- Train DreamBooth with LoRA on a specific identity.
- Generate 5 images with different text prompts.
For Task 1, you are required to submit the code. For Task 2, you should submit the checkpoints (pytorch_lora_weights.safetensors
) of each LoRA model (it should be about 3MB each).
Moreover, you are required to submit a maximum 2-page PDF report that includes the following sections:
-
Task 1:
- 5 different condition inputs, corresponding text prompts, and the generated images.
- A brief analysis of the results for each condition.
- (Optional Task 1-1) 5 different condition inputs, corresponding text prompts, and the generated images.
- (Optional Task 1-1) A brief explanation about the training dataset and the training results.
-
Task 2:
- (Task 2-1) Decription on the dataset used, including its source.
- (Task 2-1) Visualization of training images and and generated image with the corresponding text prompts.
- (Task 2-2) Decription on the dataset used, including its source.
- (Task 2-2) Visualization of training images and and generated image with the corresponding text prompts.
Submit a zip file named {NAME}_{STUDENT_ID}.zip
containing the implemented ControlNet code and LoRA checkpoints and the report PDF. Submit the zip file on GradeScope.
You should submit a .zip
file with name {STUDENT_ID}_{NAME}.zip
that contains the following contents:
.
├── 2024XXXX.pdf <-- report (max. 2 pages)
├── task_1_controlnet <-- code for Task 1
├── lora_1 <-- checkpoints for Task 2
│ └── pytorch_lora_weights.safetensors
└── lora_2
└── pytorch_lora_weights.safetensors
Submission Item List
- Code for Task 1
- Checkpoints for Task 2
- Report
You will receive a zero score if:
- you do not submit,
- your code is not executable in the Python environment we provided, or
- you modify any code outside of the section marked with
TODO
.
The scores for each task are detailed as follows:
-
Task 1 (10pt):
- [0pt] Either the code or the report is not submitted.
- [5pt] Generated images do not align with the input conditions.
- [10pt] Generated images accurately align with the input conditions.
-
Task 2 (10pt):
- [0pt] The report is not submitted.
- [5pt] Outputs of either one of the LoRAs do not align with the training data.
- [10pt] Outputs of both LoRAs accurately align with the training data.
While you are encouraged to explore creative possibilities using the above methods, it is crucial that you do not use these personalization techniques for harmful purposes, such as generating content that includes nudity, violence, or targets specific identities. It is your responsibility to ensure that this method is applied ethically.
If you are interested in this topic, we encourage you to check out the materials below.
- Adding Conditional Control to Text-to-Image Diffusion Models
- ControlNet Github
- ControlNet Hugging Face Documentation
- GLIGEN: Open-Set Grounded Text-to-Image Generation
- Style Aligned Image Generation via Shared Attention
- StyleDrop: Text-to-Image Generation in Any Style
- LoRA: Low-Rank Adaptation of Large Language Models
- Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
- Multi-Concept Customization of Text-to-Image Diffusion