Skip to content

Commit

Permalink
correct minor errors in readme and summaries
Browse files Browse the repository at this point in the history
  • Loading branch information
garg-aayush committed Sep 13, 2023
1 parent a17f591 commit b8b009a
Show file tree
Hide file tree
Showing 7 changed files with 24 additions and 25 deletions.
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,26 +10,26 @@ This repository houses my personal summaries and notes on a variety of academic
| [**`Summary notes`**](Summaries/Diffusion/DDPM.md) | [`Paper explanation video: Yanic Kilcher`](https://www.youtube.com/watch?v=W-O7AZNzbzQ) |
|---|---|
| [**`Archive link`**](https://arxiv.org/abs/2006.11239) | [**`Basic annotated implementation`**](https://nn.labml.ai/diffusion/ddpm/index.html) |
<br></br>


### 2. Denoising Diffusion Implicit Models, Song et. al.
- Present DDIMS which are implicit probabilistic models and can produce high quality samples **10X** to **50X** faster (in about 50 steps) in comparison to DDPM
- Generalizes DDPMs by using a class of non-Markovian diffusion process that lead to "short" generative Markov chains that can simulate image generation in a small number of steps
- The training objective in DDIM is similar to DDPM, one can use any pretrained DDPM model with DDIM or other generative processes that can generative images in least steps
| [**`Summary notes`**](Summaries/Diffusion/DDIM.md) | [`Archive link`](https://arxiv.org/abs/2010.02502) | [`Github repo`](https://github.com/ermongroup/ddim) |
|---|---|---|
<br></br>


### 3. High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et. al.
<br></br>


### 4. Prompt-to-Prompt Image Editing with Cross Attention Control, Hertz et. al.
- Introduces a textual editing method to semantically edit images in pre-trained text-conditioned diffusion models via Prompt-to-Prompt manipulations
- Approach allows for editing the image while preserving the original composition of the image and addressing the content of the new prompt.
- The key idea is that onr can edit images by injecting the cross-attention maps during the diffusion process, controlling which pixels attend to which tokens of the prompt text during which diffusion steps.
| [**`Summary notes`**](Summaries/Diffusion/Prompt-to-prompt.md) | [`Archive link`](https://arxiv.org/abs/2208.01626) | [`Github repo`](https://github.com/google/prompt-to-prompt/) |
|---|---|---|
<br></br>


### 5. Null-text Inversion for Editing Real Images using Guided Diffusion Models, Mokady et. al.
- Introduces an accurate inversion scheme for **real input images**, enabling intuitive and versatile text-based image modification without tuning model weights.
Expand All @@ -38,7 +38,7 @@ This repository houses my personal summaries and notes on a variety of academic
| [**`Summary notes`**](Summaries/Diffusion/Null-TextInversion.md) | [`Archive link`](https://arxiv.org/abs/2211.09794) |
|---|---|
| [**`Paper walkthrough video: Original author`**](https://www.youtube.com/watch?v=qzTlzrMWU2M&t=52s) | [**`Github repo`**](https://github.com/google/prompt-to-prompt/#null-text-inversion-for-editing-real-images) |
<br></br>


### 6. Adding Conditional Control to Text-to-Image Diffusion Models, Lvmin Zhang and Maneesh Agarwala et. al.
- Allows additional control for the pre-trained large diffusion models, such as Stable diffusion, by providing the facility of input visual conditions such as edge maps, segment masks, depth masks, etc.
Expand All @@ -49,21 +49,21 @@ This repository houses my personal summaries and notes on a variety of academic
|---|---|---|
| [**`HF usage example`**](https://huggingface.co/blog/controlnet) |[**`Controlnet SD1.5 1.0 and 1.1 ckpts`**](https://huggingface.co/lllyasviel) | [**`Controlnet SDXL ckpts`**](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) |

<br></br>


### 7. DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion, Karras et. al.
- An image-and-pose conditioned diffusion method based upon Stable Diffusion to turn fashion photographs into realistic, animated videos
- Introduces a pose conditioning approach that greatly improves temporal consistency across frames
- Uses an image CLIP and VAE encoder, instead of text encoder, that increases the output fidelity to the conditioning image
| [**`Summary notes`**](Summaries/Diffusion/SDXL.md) | [`Archive link`](https://arxiv.org/abs/2304.06025) | [`Github repo`](https://github.com/johannakarras/DreamPose)|
| [**`Summary notes`**](Summaries/Diffusion/DreamPose.md) | [`Archive link`](https://arxiv.org/abs/2304.06025) | [`Github repo`](https://github.com/johannakarras/DreamPose)|
|---|---|---|
<br></br>


### 8. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, Podell et. al.
- Introduces an enhanced stable diffusion model that surpasses the generating capabilities of previous versions
- Uses a larger UNet backbone and introducing novel conditioning schemes in the training stage
- Probably, the best public domain open-source text-to-image model at this moment (Aug, 2023)
| [**`Summary notes`**](Summaries/Diffusion/DreamPose.md) | [`Archive link`](https://arxiv.org/abs/2307.01952) |
| [**`Summary notes`**](Summaries/Diffusion/SDXL.md) | [`Archive link`](https://arxiv.org/abs/2307.01952) |
|---|---|
| [**`Paper walkthrough video: Two minute papers`**](https://www.youtube.com/watch?v=kkYaikeLJdc) | [**`HF usage example`**](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl) |

Expand Down
2 changes: 1 addition & 1 deletion Summaries/Diffusion/DDIM.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ We can't do the similar inversion to generate the latent code using DDPM due to

## Examples
For the examples, $\sigma$ is defined as:
$${\sigma}_t = \eta\sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}}\sqrt{1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}}$$
${\sigma}_t = \eta\sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}}\sqrt{1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}}$
Here, $\sigma$ controls the stochasticity, which is controled using $\eta$ (0: DDIM, 1: DDPM)


Expand Down
16 changes: 8 additions & 8 deletions Summaries/Diffusion/DDPM.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,38 +10,38 @@
- $q(x_0)$ : the real data distribution
- $\bar{x}$ : a data point sampled from a real data distribution
- $\bar{x}_T$ : the final pure Gaussian noise $\mathcal{N}(\bar{x}_T; 0, \mathbf{I})$ after the forward diffusion proceess
- $q(\bar{x}_{1:T} | \bar{x}_{0})$ : forward diffion process
- $q(\bar{x}_{1:T} \vert \bar{x}_{0})$ : forward diffion process
- $\beta_t$ : the fixed variance schedule in the diffusion process

## Forward diffusion process
For a sample $\bar{x}_0$ from the given real distribution, $q(x_0)$, we define a forward diffusion process, $q(\bar{x}_{1:T} | \bar{x}_{0})$, in which we add small amount of Gaussian noise to the $\bar{x}_0$ in $T$ steps, producing a sequence of noisy samples $\bar{x}_1$, $\bar{x}_2$,...,$\bar{x}_T$, according to a pre-defined variance schedule $\{\beta_t \in (0,1) \}_{t=1}^{T}$. The data sample gradually loses its features as the steps approaches $T$ such that $\bar{x}^T$ is equivalent to isotropic Gaussian noise.
For a sample $\bar{x}_0$ from the given real distribution, $q(x_0)$, we define a forward diffusion process, $q(\bar{x}_{1:T} \vert \bar{x}_{0})$, in which we add small amount of Gaussian noise to the $\bar{x}_0$ in $T$ steps, producing a sequence of noisy samples $\bar{x}_1$, $\bar{x}_2$,...,$\bar{x}_T$, according to a pre-defined variance schedule $\{\beta_t \in (0,1) \}_{t=1}^{T}$. The data sample gradually loses its features as the steps approaches $T$ such that $\bar{x}^T$ is equivalent to isotropic Gaussian noise.

![forward process](images/ddpm/forwardprocess.png)

As the forward process is a Markov chain, therefore:

$$q(\bar{x}_{1:T} | \bar{x}_{0}) = \prod_{t=1}^{T}q(\bar{x}_{t} | \bar{x}_{t-1})$$
$$q(\bar{x}_{1:T} \vert \bar{x}_{0}) = \prod_{t=1}^{T}q(\bar{x}_{t} \vert \bar{x}_{t-1})$$

Since, the distributons are Gaussians:
$$q(\bar{x}_{t} | \bar{x}_{t-1}) = \mathcal{N}(\bar{x}_t; \sqrt{1-\beta_t}\bar{x}_{t-1}, \beta_t\mathbf{I})$$
$$q(\bar{x}_{t} \vert \bar{x}_{t-1}) = \mathcal{N}(\bar{x}_t; \sqrt{1-\beta_t}\bar{x}_{t-1}, \beta_t\mathbf{I})$$

The $\bar{x}_t$ can also be sampled using $\bar{x}_0$ as follows:

$$q(\bar{x}_{t} | \bar{x}_{0}) = \mathcal{N}(\bar{x}_0; \sqrt{\bar{\alpha}_t}\bar{x}_{0}, (1-\bar{\alpha}_t)\mathbf{I})$$
$$q(\bar{x}_{t} \vert \bar{x}_{0}) = \mathcal{N}(\bar{x}_0; \sqrt{\bar{\alpha}_t}\bar{x}_{0}, (1-\bar{\alpha}_t)\mathbf{I})$$

Here, $\alpha_t=1-\beta_t$ and $\bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i$.

The following shows how to derive the forward diffusion process.
![forward diffusion process](images/ddpm/derivation1.jpg)

## Reverse diffusion process
If we know the $p(\bar{x}_{t-1} | \bar{x}_{t})$ conditional process, then we can reverse the forward process starting from pure noise and gradually "denoising" it so that we end up with a sample from the real distribution.
If we know the $p(\bar{x}_{t-1} \vert \bar{x}_{t})$ conditional process, then we can reverse the forward process starting from pure noise and gradually "denoising" it so that we end up with a sample from the real distribution.

However, it is intractable and requires knowing the actual data distribution of the images in order to calculate this conditional probability. Hence, we use a neural network $p_\theta$ to approximate (learn) the $p_{\theta}(\bar{x}_{t-1} | \bar{x}_{t})$ conditional probability distribution.
However, it is intractable and requires knowing the actual data distribution of the images in order to calculate this conditional probability. Hence, we use a neural network $p_\theta$ to approximate (learn) the $p_{\theta}(\bar{x}_{t-1} \vert \bar{x}_{t})$ conditional probability distribution.

Starting with the pure Gaussian noise $p(\bar{x}_T) = \mathcal{N}(\bar{x}_T; 0, \mathbf{I})$, assuming the reverse process to be Gaussian and Markov, the joint conditional ditribution $p_{\theta}(\bar{x}_{0:T})$ is given as follows:

$$ p_{\theta}(\bar{x}_{0:T}) = p(\bar{x}_T) \prod_{t=1}^{T}p_\theta(\bar{x}_{t-1} | \bar{x}_{t}) $$
$$ p_{\theta}(\bar{x}_{0:T}) = p(\bar{x}_T) \prod_{t=1}^{T}p_\theta(\bar{x}_{t-1} \vert \bar{x}_{t}) $$

$$ p_{\theta}(\bar{x}_{0:T}) = p(\bar{x}_T) \prod_{t=1}^{T} \mathcal{N}(\bar{x}_{t-1}; \mu_{\theta}(\bar{x}_t,t), \Sigma_{\theta}(\bar{x}_t,t))$$

Expand Down
6 changes: 3 additions & 3 deletions Summaries/Diffusion/Null-TextInversion.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ The paper introduces an accurate inversion scheme, achieving near-perfect recons

### Classifier-free guidance
Classifier-Free Guidance (CFG) is a lightweight technique to encourage prompt-adherence in text-to-image generation. In diffusion models, in each step, the prediction is performed twice: once unconditionally and once with the text condition. These predictions are then extrapolated to amplify the effect of the text guidance. The CFG prediction is defined as:
$$ \bar{\epsilon_\theta}(z_t, t, \phi)= w.\epsilon_\theta(z_t, t, C) + (1-w) .\epsilon_\theta(z_t, t, \phi)$$
$ \bar{\epsilon_\theta}(z_t, t, \phi)= w.\epsilon_\theta(z_t, t, C) + (1-w) .\epsilon_\theta(z_t, t, \phi)$

### DDIM Inversion
DDIM inversion is a simple inversion technique that is reverse of DDIM sampling, based on the assumption that the ODE process can be reversed in the limit of small steps. The diffusion process in performed in the reverse direction, that is $z_0 \rightarrow z_T$ insted of $z_T \rightarrow z_0$
Expand Down Expand Up @@ -59,14 +59,14 @@ embedding with an optimized one, referred to as null-text optimization
- In particular, authors aim to perform our optimization around a pivotal noise vector which is a good approximation and thus allows a more efficient inversion.
- **For this, authors use DDIM inversion with guidance scale w = 1 as a rough approximation of the original image which is highly editable but far from accurate.**
- The DDIM inversion with $w=1$ is called as pivot trajectory and optimization is performed around it. The optimization aims to maximize the similarity to the original image.
$$ min||{z}_{t-1}^{*} - z_{t-1} ||_2^2 $$
$ min||{z}_{t-1}^{*} - z_{t-1} ||_2^2 $
- Note, a separate optimization is performed for each timestep $t$ from $t=T \rightarrow t=1$ with the endpoint of the previous step optimization as the starting point for the current $t$.

### Null-text optimization
- As mentioned before, authors optimize only the unconditional embedding $\phi$ as part of null-text optimization with the model and the conditional textual embedding as being kept unchanged.
- Authors observed that optimizing a different ”null embedding” for each timestamp $t$ significantly improves the reconstruction quality and is best suited for pivotal inversion.
- Therefore, the unconditional text embeddings (${\phi}_{t=1}^T$) at all steps is optimized with starting point as previous timestep result.
$$ min||{z}_{t-1}^{*} - z_{t-1}(\bar{z_t, \phi_t, C}) ||_2^2 $$
$ min||{z}_{t-1}^{*} - z_{t-1}(\bar{z_t, \phi_t, C}) ||_2^2 $

The full algorithm can be summarized as follows:
![](images/null-inversion/algorithm.png)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
\text{{head}}_i = \text{{Attention}}(QW_i^Q, KW_i^K, VW_i^V)
\]

![](images/transformers/1_attention_mechanism.png)
![](images/attention/1_attention_mechanism.png)


- The block `Mask (opt.)` represents the optional masking of specific entries in the attention matrix. This is for instance used if we stack multiple sequences with different lengths into a batch. This helps in the parallelization in PyTorch. The masking is also used in the self-attention mechanism of the decoder part of the Transformers to allow the information flow only from previous tokens and restrict any learning from future tokens
Expand All @@ -65,7 +65,7 @@

- The `Transformer model` introduced in the paper contains consists of an encoder and a decoder.

![](images/transformers/1_transformer_model.png)
![](images/attention/1_transformer_model.png)

- The encoder processes the input sequence, while the decoder generates the output sequence.

Expand All @@ -91,5 +91,4 @@

## Acknowledgments
- Transformers United 2023: Introduction to Transformers w/ Andrej Karpathy: https://www.youtube.com/watch?v=XfpMkf4rD6E
- Original paper: https://arxiv.org/abs/1706.03762
- ChatGPT
- Original paper: https://arxiv.org/abs/1706.03762

0 comments on commit b8b009a

Please sign in to comment.