Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
garg-aayush committed Aug 17, 2023
2 parents a09f6a5 + a860d61 commit f9a6a43
Show file tree
Hide file tree
Showing 68 changed files with 460 additions and 47 deletions.
Binary file modified .DS_Store
Binary file not shown.
61 changes: 14 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,69 +3,36 @@ This repository houses my personal summaries and notes on a variety of academic


## Papers
### 1. Attention Is All You Need, Vaswani et. al.

The paper introduces the `Transformer` model, a neural network architecture that solely relies on self-attention mechanisms, eliminating the need for recurrent or convolutional layers. This approach achieves SOTA results in number of NLP taks, revolutionizing the field using the power of attention mechanisms.

- [[`Archive link`](https://arxiv.org/abs/1706.03762)] [[`Paper explanation video: Yanic Kilcher`](https://www.youtube.com/watch?v=iDulhoQ2pro&t=2s)] [[`Basic annotated implementation`](http://nlp.seas.harvard.edu/annotated-transformer/)]
- [**`Summary notes`**](Summaries/Attention_Is_All_You_Need.md)

### 2. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et. al.

The paper introduces the concept of using Transformers, originally designed for NLP, for image recognition tasks. By dividing images into patches and leveraging self-attention mechanisms, this approach achieves competitive results on large-scale image recognition benchmarks, challenging the traditional convolutional neural network paradigm.

### 3. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, Liu et. al.

The paper proposes a hierarchical vision Transformer architecture that uses shifted windows to capture both local and global information in images. By leveraging hierarchical representations and efficient computation, Swin Transformer achieves strong performance on various vision tasks, surpassing previous Transformer-based models while maintaining computational efficiency.

### 4. Denoising Diffusion Probabilistic Models, Ho et. al.

### 1. Denoising Diffusion Probabilistic Models, Ho et. al.
It presents a generative model that employs denoising diffusion processes to learn and generate realistic images. By iteratively adding noise and removing it, the model learns a diffusion process that captures the underlying distribution of complex image data, enabling high-quality image synthesis.

- [[`Archive link`](https://arxiv.org/abs/2006.11239)] [[`Paper explanation video: Yanic Kilcher`](https://www.youtube.com/watch?v=W-O7AZNzbzQ)] [[`Basic annotated implementation`](https://nn.labml.ai/diffusion/ddpm/index.html)]
- [**`Summary notes`**](Summaries/DDPM.md)]

### 5. Denoising Diffusion Implicit Models, Song et. al.

It presents a more efficient alternative sampling (DDIM) in comparison to DDPMs for high-quality image generation. By constructing non-Markovian diffusion processes, DDIMs achieve faster sampling, enabling trade-offs between computation and sample quality, and facilitating meaningful image interpolation in the latent space.
- [**`Summary notes`**](Summaries/DDPM.md)

### 2. Improved Denoising Diffusion Probabilistic Models, Nichol A. and Dhariwal P.


### 6. High-Resolution Image Synthesis with Latent Diffusion Models, Rombach. et. al.
### 3. Diffusion Models Beat GANs on Image Synthesis, Dhariwal P. and Nichol A.

It introduces the approach behind the `Stable Diffusion`. It proposes in the latent space of pretrained autoencoders, enabling near-optimal complexity reduction, detail preservation, and flexible generation for various conditioning inputs with improved visual fidelity.


### 7. Adding Conditional Control to Text-to-Image Diffusion Models, Lvmin Zhang and Maneesh Agarwala et. al.

The authors propose an architecture called ControlNet that enhances control over the image generation process in the diffusion/stable diffusion process, enabling the generation of specific and desired images. This is achieved by incorporating conditional inputs, such as edge maps, segmentation maps, and keypoints, into the diffusion model.
### 4. Denoising Diffusion Implicit Models, Song et. al.
It presents a more efficient alternative sampling (DDIM) in comparison to DDPMs for high-quality image generation. By constructing non-Markovian diffusion processes, DDIMs achieve faster sampling, enabling trade-offs between computation and sample quality, and facilitating meaningful image interpolation in the latent space.

### 5. High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et. al.

### 8. Null-text Inversion for Editing Real Images using Guided Diffusion Models, Mokady et. al.
### 6. Prompt-to-Prompt Image Editing with Cross Attention Control, Hertz et. al.

### 7. Null-text Inversion for Editing Real Images using Guided Diffusion Models, Mokady et. al.
The paper introduces an accurate inversion technique for text-guided diffusion models, enabling intuitive and versatile text-based image modification without tuning model weights. The proposed method demonstrates high-fidelity editing of real images through pivotal inversion and NULL-text optimization, showcasing its efficacy in prompt-based editing scenarios.
num

### 9. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, Podell et. al.

The paper introduces an enhanced stable diffusion model that surpasses the generating capabilities of previous versions. This is achieved by incorporating a larger UNet backbone and introducing novel conditioning schemes in the training stage.

### 10. Photoswap: Personalized Subject Swapping in Images, Gu et. al.

The paper discusses a novel approach that leverages pre-trained diffusion models for personalized subject swapping in images, allowing users to seamlessly replace subjects while preserving the composition. The approach revolves around swapping and manipulating the UNets attention maps in a training-free manner
### 8. Adding Conditional Control to Text-to-Image Diffusion Models, Lvmin Zhang and Maneesh Agarwala et. al.
The authors propose an architecture called ControlNet that enhances control over the image generation process in the diffusion/stable diffusion process, enabling the generation of specific and desired images. This is achieved by incorporating conditional inputs, such as edge maps, segmentation maps, and keypoints, into the diffusion model.

### 11. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, Meng et. al.
### 9. DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion, Karras et. al.

11. Barbershop: GAN-based Image Compositing using Segmentation Masks, Zhu et. al.
- [[`Archive link`](https://arxiv.org/abs/2106.01505)] [[`Github repository`](https://github.com/lllyasviel/ControlNet)] [[`Huggingface blog`](https://huggingface.co/blog/controlnet)]
- [[`Very very short summary`](#barbershop)]
- [[`Summary notes`](Summaries/Barbershop.md)]
### 10. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, Podell et. al.
The paper introduces an enhanced stable diffusion model that surpasses the generating capabilities of previous versions. This is achieved by incorporating a larger UNet backbone and introducing novel conditioning schemes in the training stage.

12. Barbershop: GAN-based Image Compositing using Segmentation Masks, Zhu et. al.
- [[`Archive link`](https://arxiv.org/abs/2204.11823)] [[`Github repository`](https://github.com/stylegan-human/StyleGAN-Human)] [[`Project page`](https://stylegan-human.github.io/)]
- [[`Very very short summary`](#stylegan-human)]
- [[`Summary notes`](Summaries/StyleGAN-Human.md)]




113 changes: 113 additions & 0 deletions Summaries/DDIM.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Summary Notes (DDIM)

- Diffusion-based DDPM processes have demonstrated the ability to produce high-resolution samples comparable to GANs. However, one major drawback of these models is that require many iterations to produce an high-quality sample. For e.g., DDPM requires `1000` inference steps to produce a single sample. This is orders of magnitude slower than GANs that requires just one inference pass.
- The paper present DDIMS which are implicit probabilistic models and can produce high quality samples **10X** to **50X** faster in terms of wall-clock time compared to DDPM. DDIMs can produce high-quality samples for as less as 50 time steps.

![ddim](images/ddim/example_5.png)

- DDIMs generalizes DDPMs by using a class of non-Markovian diffusion process that lead to sample training objective as of the DDPMs.
- These non-Markovian diffusion processes lead to "short" generative Markov chains that can simulate image generation in a small number of steps.
- Moreover, the authors show, since the training objective in DDIM is similar to DDPM, one can use any pretrained DDPM model with DDIM or other generative processes that can generative images in least steps.
- Since there is no stochastic noise factor in DDIM sampling, they maintain the same high-level semantic information for different scheduling trajectories.

![ddim](images/ddim/example_2.png)

- This further allows for meaningful latent code interpolation and latent code inversion similar to GAN inversion.

![ddim](images/ddim/example_3.png)

## Notations
- $q(x_0)$ : the real data distribution
- $\bar{x}$ : a data point sampled from a real data distribution
- $\bar{x}_T$ : the final pure Gaussian noise $\mathcal{N}(\bar{x}_T; 0, \mathbf{I})$ after the forward diffusion proceess
- $q(\bar{x}_{1:T} | \bar{x}_{0})$ : forward diffusion process
- $p_\theta(\bar{x}_{0:T})$ : reverse diffusion process
- $\beta_t$ : the fixed variance schedule in the diffusion process
- $\alpha_t=1-\beta_t$ and $\bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i$

## Recap of DDPM

### Forward diffusion process following Markov chain with Gaussian transitions
From DDPM paper,
![ddpm](images/ddim/ddpm_1.png)

This can further be reduces as:
![ddpm](images/ddim/ddpm_3.png)
![ddpm](images/ddim/ddpm_2.png)

Thus, $\bar{x_t}$ can be represented as:
![ddpm](images/ddim/ddpm_4.png)

### Reverse diffusion process (generative process)
![ddpm](images/ddim/ddpm_5.png)

Here, $\theta$ are learnt parameters to fit $q(x_0)$ by maximizing the variational lower bound:
![ddpm](images/ddim/ddpm_6.png)

Making the different assumptions and assuming trainable fixed means and variances for all the conditional Gaussians:
![ddpm](images/ddim/ddpm_7.png)

## Non-Markovian process

## Non-Markovian forward process
The DDPM objective in the form of Lγ only depends on the marginals $q(\bar{x}_t|\bar{x}_0)$, but not directly on the joint $q(\bar{x}_{1:T}|\bar{x}_0)$. Since there are many inference distributions (joints) with the same marginals, the inference processes can be expressed as an non-Markovian, which leads to new generative processes.

![non](images/ddim/non_markovian_1.png)
![non](images/ddim/non_markovian_2.png)


## Non-Markovian generative process
![non](images/ddim/non_markovian_3.png)

## Unified variational inference objective
![non](images/ddim/non_markovian_4.png)


> **Note, they show ${J}_{\sigma}$ is equivalent to $L_{\gamma}$ for certain weights $\gamma$. Thus, non-Markovian inference process
lead to the same surrogate objective function as DDPM.**

> "With L1 as the objective, we are not only learning a generative process for the Markovian inference process considered in Sohl-Dickstein et al. (2015) and Ho et al. (2020), but also generative processes for many non-Markovian forward processes parametrized by σthat we have described. Therefore, we can essentially use pretrained DDPM models as the solutions to the new objectives, and focus on finding a generative process that is better at producing samples subject to our needs by changing $\sigma$"
## DDIM
Sampling procedure:
![ddim](images/ddim/ddim_1.png)

Fo $\sigma = 0$ for all $t$; the forward process becomes deterministic given $x_{t−1}$ and $x_0$, except for $t = 1$; in the generative process, the coefficient before the random noise becomes zero. The resulting model becomes an implicit probabilistic mode, where **samples are generated from latent variables with a fixed procedure** (from $x_T$ to $x_0$). Thus model is called the **denoising diffusion implicit model (DDIM)**, because it is an implicit probabilistic model trained with the DDPM objective (despite the forward process no longer being a diffusion).

![ddim](images/ddim/ddim_2.png)

Note, when ${\sigma}_t = \sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}}\sqrt{1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}}$
the forward process becomes Markovian, and the generative process becomes a DDPM.

![ddim](images/ddim/ddim_3.png)

The authors also shows one can train a model with an arbitrary number of forward steps but only sample from some of them in the generative process. Therefore, the trained model could consider many more steps than what is considered in (Ho et al., 2020) or even a continuous time variable t (Chen et al., 2020).

## DDIM Inversion
The DDIM sampling equation can be written as simple ordinary differential equation as follows:
![ddim](images/ddim/ddim_inversion_eq_1.png)

In the limit of discretization steps, one can reverse the generation process, which encodes $\bar{x_0}$ to $\bar{x_T}$ and simulates the reverse of the ODE. Thus, unlike DDPM, one use DDIM to obtain encodings of the observations, which is useful for downstream applications that requires latent representations.

![ddim](images/ddim/ddim_inversion_eq_2.png)

We can't do the similar inversion to generate the latent code using DDPM due to its stochastic nature.

## Examples
For the examples, $\sigma$ is defined as:
$${\sigma}_t = \eta\sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}}\sqrt{1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}}$$
Here, $\sigma$ controls the stochasticity, which is controled using $\eta$ (0: DDIM, 1: DDPM)


![ddim](images/ddim/example_1.png)
![ddim](images/ddim/example_2.png)

- DDIM ($\eta$ = 0) achieves the best sample quality when dim(τ) is small, and DDPM ($\eta$ = 1) typically has worse sample quality compared to less stochastic countarparts, especially for smaller time steps schedule.
- The generated images with the same initial $\bar{x}_T$, has most high-level features similar, regardless of the different generative trajectory.

![ddim](images/ddim/example_3.png)

- The high level features of the DDIM sample is encoded by $\bar{x}_T$ without any stochastic influence, thus there is similar meaningful semantic interpolation effect.

![ddim](images/ddim/example_4.png)
- The DDIM inversion can be used to generate the latent code of the input image.
Loading

0 comments on commit f9a6a43

Please sign in to comment.