This project explores different generative models for image synthesis, including Convolutional Neural Networks (CNNs), Encoder-only Transformers, Generative Adversarial Networks (GANs), and Denoising Diffusion Probabilistic Models (DDPMs). We implement and experiment with these architectures, analyzing their effectiveness for image generation and inpainting tasks.
├── project/
│ ├── data/
│ ├── diffusion.py
│ ├── main.py
│ ├── requirements.txt
│ ├── run_in_cloud.ipynb
│ ├── trainer.py
│ ├── unet.py
│ ├── utils.py
├── README.md
To set up the environment locally, follow these steps:
- Install Python dependencies:
pip install torch einops clean-fid
- Run the main training script:
python main.py
CNNs are widely used for image-based generative tasks, especially in architectures like U-Net.
U-Net employs skip connections between the encoder and decoder layers to preserve spatial details during reconstruction. In this task, we use U-Net for inpainting, where missing pixels are filled based on surrounding image features.
- Input: Partially masked image with a binary mask indicating missing pixels.
- Output: Reconstructed image with missing pixels filled.
- Loss Functions:
- MSE Loss ensures that predicted pixels match the original ones.
- Adversarial Loss (when used with a discriminator) improves realism.
Transformers process entire sequences in parallel, making them effective for structured image representations.
We implement an encoder-only Transformer for part-of-speech tagging and analyze how these models differ from decoder-only variants.
- Encoder-only Models (e.g., BERT) generate contextual embeddings for all input tokens simultaneously.
- Decoder-only Models (e.g., GPT) use autoregressive generation, predicting tokens sequentially.
- Encoder-only: Classification, segmentation, token-wise prediction.
- Decoder-only: Text/image generation, machine translation.
GANs use an adversarial setup where a generator learns to create realistic samples while a discriminator tries to distinguish generated images from real ones.
For inpainting tasks, we use:
- Generator (U-Net-based): Predicts missing pixels given an input mask.
- Discriminator: Distinguishes inpainted images from real ones.
- Generator Loss:
L_G = E(x,m)[∥m⊙ (y − x')∥^2] - λ * E(x)[log D(x')] - Discriminator Loss:
L_D = E(x)[log D(x)] + E(x,m)[log(1 - D(x'))]
Diffusion models generate images by gradually denoising random noise through a learned reverse process.
The Diffusion class implements forward and reverse diffusion using:
- Cosine noise schedule to control variance.
- U-Net architecture for denoising function.
- Reparameterization trick for efficient sampling.
xt = sqrt(alpha_bar_t) * x0 + sqrt(1 - alpha_bar_t) * noisex_hat_0 = (xt - sqrt(1 - alpha_bar_t) * epsilon) / sqrt(alpha_bar_t)!()[diffusion_reverse_process]
Training minimizes the L1 loss between predicted and actual noise:
loss = F.l1_loss(pred_noise, noise)During sampling, the model iteratively refines noisy images to generate realistic outputs.
- Train the diffusion model:
python main.py --train
- Evaluate FID score:
python main.py (...) --fid
- CNNs (U-Net) effectively reconstruct missing image regions.
- Transformers capture contextual dependencies in structured tasks.
- GANs produce sharper inpainted images but can be unstable.
- Diffusion models generate high-quality images with iterative refinement.
This project is part of 10-623 Generative AI at Carnegie Mellon University, with datasets and starter code provided by the course instructors.







