A conditional Denoising Diffusion Probabilistic Model (DDPM) for generating 16x16 pixel art sprites with class-based control and real-time visualization.
Try the Live Demo on Hugging Face | View Training Notebook on Kaggle
This project operates in two phases: a training phase (detailed in Training.ipynb) and an inference/application phase (detailed in app.py). The model from the first phase is loaded into the second to create an interactive application for generating pixel art sprites.
The core of this project is a conditional Denoising Diffusion Probabilistic Model (DDPM). The process can be broken down into data handling, model architecture, training, and inference.
- Data Handling: The model is trained on 16x16 pixel art sprites. The
PixelArtDatasetclass in the training notebook is custom-built for this data. - Noise Schedule: A
DiffusionScheduleclass implements a cosine noise schedule. This defines how noise is added to an image overT=1000timesteps. The model's job is to learn how to reverse this process, starting from pure noise and gradually denoising it back to a clean image.
The model's "brain" is the ContextUNet. This architecture is specifically designed to handle and be controlled by external information.
- U-Net Structure: It is a standard U-Net with a downsampling path, a bottleneck, and an upsampling path. Skip-connections link the downsampling layers to the upsampling layers, which helps the model preserve fine details (crucial for pixel art).
- Context Injection: This is the "Context" part of the name. The model is given three pieces of information at every step:
- The Noisy Image (
x_t): The current image at timestept. - The Timestep (
t): The model needs to know how much noise is in the image to remove the correct amount. The timesteptis passed through its own small neural network (time_mlp) to create a "time embedding". - The Class Condition (
c): This is the control mechanism. The desired class (e.g., "Characters" or "Monsters") is provided as an integer ID. This ID is passed through annn.Embeddinglayer (label_emb) to create a "class embedding".
- The Noisy Image (
- Embedding Combination: The time embedding and class embedding are added together (
emb = t_emb + c_emb). This combined context vector is then injected into every singleResidualBlockthroughout the U-Net. This means at every stage of processing, the model is constantly reminded of what it is supposed to be drawing and how much denoising it needs to do.
The training loop in Training.ipynb teaches the model its core task.
- A clean image
xand its labelcare loaded from the dataset. - A random timestep
t(from 1 to 1000) is chosen. - The correct amount of noise for timestep
t(defined by the cosine schedule) is added to the clean imagex, creating the noisy imagex_t. - The noisy image
x_t, the timestept, and the labelcare all fed into theContextUNet. - The model's goal is to predict the original noise that was added.
- The loss is a simple Mean Squared Error (
F.mse_loss) between the model's predicted noise and the actual noise.
The app.py file uses the trained model to generate new images. This is where Classifier-Free Guidance (CFG) comes into play, a technique that allows for explicit control over the generation.
- Start: The process begins with a 16x16 tensor of pure random noise (
x = torch.randn(...)). - Denoising Loop: The model iterates backward from timestep
T-1down to0. - CFG at each step: For each step in the loop, the model runs twice:
- Conditional Run: It predicts the noise using the user's chosen category (e.g., "Characters"). This is
eps_cond. - Unconditional Run: It predicts the noise using a special "null" class. This is
eps_uncond.
- Conditional Run: It predicts the noise using the user's chosen category (e.g., "Characters"). This is
- Guidance: The final noise prediction is a guided combination of the two:
eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond). Theguidance_scale(the slider in the UI) determines how strongly the model "sticks" to the category. A high value forces the model to strictly follow the prompt, while a low value allows for more creative (but less accurate) results. - Step: The model uses this guided
epsto clean the image by a small amount, producing the image for the next, less-noisy step. - Finish: After all 1000 steps, the noise is gone, and the final clean image remains.
Several specific design choices in these files contribute to the model's success, especially for this specific task.
-
Cosine Noise Schedule: Instead of a simple linear schedule, the training uses
make_cosine_schedule. A cosine schedule adds noise more gradually and is known to improve sample quality and training stability, especially for smaller diffusion models. -
Classifier-Free Guidance (CFG): This is the most significant improvement for usability. To make CFG work, the model was trained to handle it. In
Training.ipynb, 10% of the time (p_uncond = 0.1), the true class label was randomly replaced with aNULL_CLASS_IDX. This forced the model to learn how to denoise both with and without a class, enabling the guided inference method inapp.py. -
Exponential Moving Average (EMA): Training can be noisy, and the model's weights at the very last step might not be the best. The
EMAclass inTraining.ipynbkeeps a "shadow" copy of the model's weights, which is a slowly-updating average. Thisema_shadow.pthfile, which is loaded byapp.py, contains these averaged weights, which are less "jumpy" and almost always produce higher-quality, more stable-looking final images. -
Appropriate Interpolation: When loading the data in
Training.ipynb, theT.Resizetransform explicitly usesinterpolation=Image.NEAREST. For pixel art, using standard (bilinear) interpolation would create blurry, averaged colors, corrupting the data.NEARESTpreserves the sharp, blocky nature of pixel art, leading to a much better-trained model. This same method is used inapp.pyto scale the 16x16 output to 256x256 for viewing. -
Attention Blocks: The
ContextUNetisn't justConv2dlayers. In its deeper (smaller resolution) layers, it usesAttentionBlockmodules. This allows the model to learn long-range spatial relationships—for example, to understand that a pixel on the left side of the image (e.g., a "hand") is related to a pixel on the right side (e.g., a "shoulder"). -
Live-Updating Generator: For the
app.py, the inference loopsample_loop_generatoris a Python generator (it usesyield). Instead of only returning the final image after 1000 steps, it yields its prediction of the clean image every 20 steps. The Gradio UI catches these yielded images, allowing the user to see the image "fade in" from noise in real-time. This is a major user experience improvement that makes the underlying process visible.
- Architecture: Conditional U-Net with attention blocks
- Training Steps: 1000 diffusion timesteps
- Resolution: 16x16 pixels (upscaled to 256x256 for display)
- Guidance Method: Classifier-Free Guidance (CFG)
- Noise Schedule: Cosine schedule for improved quality
This project is licensed under the MIT License - see the LICENSE file for details.
This implementation draws inspiration from modern diffusion model research, including DDPM and classifier-free guidance techniques.