From b8b009a81bfce0e5bb5134588a8aff05e5251f29 Mon Sep 17 00:00:00 2001 From: Aayush Garg Date: Wed, 13 Sep 2023 16:34:55 +0530 Subject: [PATCH] correct minor errors in readme and summaries --- README.md | 18 +++++++++--------- Summaries/Diffusion/DDIM.md | 2 +- Summaries/Diffusion/DDPM.md | 16 ++++++++-------- Summaries/Diffusion/Null-TextInversion.md | 6 +++--- .../{Transformers.md => Attention.md} | 7 +++---- .../1_attention_mechanism.png | Bin .../1_transformer_model.png | Bin 7 files changed, 24 insertions(+), 25 deletions(-) rename Summaries/Transformers/{Transformers.md => Attention.md} (97%) rename Summaries/Transformers/images/{transformers => attention}/1_attention_mechanism.png (100%) rename Summaries/Transformers/images/{transformers => attention}/1_transformer_model.png (100%) diff --git a/README.md b/README.md index 4cf4c27..df52cde 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ This repository houses my personal summaries and notes on a variety of academic | [**`Summary notes`**](Summaries/Diffusion/DDPM.md) | [`Paper explanation video: Yanic Kilcher`](https://www.youtube.com/watch?v=W-O7AZNzbzQ) | |---|---| | [**`Archive link`**](https://arxiv.org/abs/2006.11239) | [**`Basic annotated implementation`**](https://nn.labml.ai/diffusion/ddpm/index.html) | -

+ ### 2. Denoising Diffusion Implicit Models, Song et. al. - Present DDIMS which are implicit probabilistic models and can produce high quality samples **10X** to **50X** faster (in about 50 steps) in comparison to DDPM @@ -18,10 +18,10 @@ This repository houses my personal summaries and notes on a variety of academic - The training objective in DDIM is similar to DDPM, one can use any pretrained DDPM model with DDIM or other generative processes that can generative images in least steps | [**`Summary notes`**](Summaries/Diffusion/DDIM.md) | [`Archive link`](https://arxiv.org/abs/2010.02502) | [`Github repo`](https://github.com/ermongroup/ddim) | |---|---|---| -

+ ### 3. High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et. al. -

+ ### 4. Prompt-to-Prompt Image Editing with Cross Attention Control, Hertz et. al. - Introduces a textual editing method to semantically edit images in pre-trained text-conditioned diffusion models via Prompt-to-Prompt manipulations @@ -29,7 +29,7 @@ This repository houses my personal summaries and notes on a variety of academic - The key idea is that onr can edit images by injecting the cross-attention maps during the diffusion process, controlling which pixels attend to which tokens of the prompt text during which diffusion steps. | [**`Summary notes`**](Summaries/Diffusion/Prompt-to-prompt.md) | [`Archive link`](https://arxiv.org/abs/2208.01626) | [`Github repo`](https://github.com/google/prompt-to-prompt/) | |---|---|---| -

+ ### 5. Null-text Inversion for Editing Real Images using Guided Diffusion Models, Mokady et. al. - Introduces an accurate inversion scheme for **real input images**, enabling intuitive and versatile text-based image modification without tuning model weights. @@ -38,7 +38,7 @@ This repository houses my personal summaries and notes on a variety of academic | [**`Summary notes`**](Summaries/Diffusion/Null-TextInversion.md) | [`Archive link`](https://arxiv.org/abs/2211.09794) | |---|---| | [**`Paper walkthrough video: Original author`**](https://www.youtube.com/watch?v=qzTlzrMWU2M&t=52s) | [**`Github repo`**](https://github.com/google/prompt-to-prompt/#null-text-inversion-for-editing-real-images) | -

+ ### 6. Adding Conditional Control to Text-to-Image Diffusion Models, Lvmin Zhang and Maneesh Agarwala et. al. - Allows additional control for the pre-trained large diffusion models, such as Stable diffusion, by providing the facility of input visual conditions such as edge maps, segment masks, depth masks, etc. @@ -49,21 +49,21 @@ This repository houses my personal summaries and notes on a variety of academic |---|---|---| | [**`HF usage example`**](https://huggingface.co/blog/controlnet) |[**`Controlnet SD1.5 1.0 and 1.1 ckpts`**](https://huggingface.co/lllyasviel) | [**`Controlnet SDXL ckpts`**](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) | -

+ ### 7. DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion, Karras et. al. - An image-and-pose conditioned diffusion method based upon Stable Diffusion to turn fashion photographs into realistic, animated videos - Introduces a pose conditioning approach that greatly improves temporal consistency across frames - Uses an image CLIP and VAE encoder, instead of text encoder, that increases the output fidelity to the conditioning image - | [**`Summary notes`**](Summaries/Diffusion/SDXL.md) | [`Archive link`](https://arxiv.org/abs/2304.06025) | [`Github repo`](https://github.com/johannakarras/DreamPose)| + | [**`Summary notes`**](Summaries/Diffusion/DreamPose.md) | [`Archive link`](https://arxiv.org/abs/2304.06025) | [`Github repo`](https://github.com/johannakarras/DreamPose)| |---|---|---| -

+ ### 8. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, Podell et. al. - Introduces an enhanced stable diffusion model that surpasses the generating capabilities of previous versions - Uses a larger UNet backbone and introducing novel conditioning schemes in the training stage - Probably, the best public domain open-source text-to-image model at this moment (Aug, 2023) - | [**`Summary notes`**](Summaries/Diffusion/DreamPose.md) | [`Archive link`](https://arxiv.org/abs/2307.01952) | + | [**`Summary notes`**](Summaries/Diffusion/SDXL.md) | [`Archive link`](https://arxiv.org/abs/2307.01952) | |---|---| | [**`Paper walkthrough video: Two minute papers`**](https://www.youtube.com/watch?v=kkYaikeLJdc) | [**`HF usage example`**](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl) | diff --git a/Summaries/Diffusion/DDIM.md b/Summaries/Diffusion/DDIM.md index cef740a..1d2a4ae 100644 --- a/Summaries/Diffusion/DDIM.md +++ b/Summaries/Diffusion/DDIM.md @@ -95,7 +95,7 @@ We can't do the similar inversion to generate the latent code using DDPM due to ## Examples For the examples, $\sigma$ is defined as: -$${\sigma}_t = \eta\sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}}\sqrt{1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}}$$ +${\sigma}_t = \eta\sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}}\sqrt{1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}}$ Here, $\sigma$ controls the stochasticity, which is controled using $\eta$ (0: DDIM, 1: DDPM) diff --git a/Summaries/Diffusion/DDPM.md b/Summaries/Diffusion/DDPM.md index 5fdaf20..a070fcc 100644 --- a/Summaries/Diffusion/DDPM.md +++ b/Summaries/Diffusion/DDPM.md @@ -10,24 +10,24 @@ - $q(x_0)$ : the real data distribution - $\bar{x}$ : a data point sampled from a real data distribution - $\bar{x}_T$ : the final pure Gaussian noise $\mathcal{N}(\bar{x}_T; 0, \mathbf{I})$ after the forward diffusion proceess -- $q(\bar{x}_{1:T} | \bar{x}_{0})$ : forward diffion process +- $q(\bar{x}_{1:T} \vert \bar{x}_{0})$ : forward diffion process - $\beta_t$ : the fixed variance schedule in the diffusion process ## Forward diffusion process -For a sample $\bar{x}_0$ from the given real distribution, $q(x_0)$, we define a forward diffusion process, $q(\bar{x}_{1:T} | \bar{x}_{0})$, in which we add small amount of Gaussian noise to the $\bar{x}_0$ in $T$ steps, producing a sequence of noisy samples $\bar{x}_1$, $\bar{x}_2$,...,$\bar{x}_T$, according to a pre-defined variance schedule $\{\beta_t \in (0,1) \}_{t=1}^{T}$. The data sample gradually loses its features as the steps approaches $T$ such that $\bar{x}^T$ is equivalent to isotropic Gaussian noise. +For a sample $\bar{x}_0$ from the given real distribution, $q(x_0)$, we define a forward diffusion process, $q(\bar{x}_{1:T} \vert \bar{x}_{0})$, in which we add small amount of Gaussian noise to the $\bar{x}_0$ in $T$ steps, producing a sequence of noisy samples $\bar{x}_1$, $\bar{x}_2$,...,$\bar{x}_T$, according to a pre-defined variance schedule $\{\beta_t \in (0,1) \}_{t=1}^{T}$. The data sample gradually loses its features as the steps approaches $T$ such that $\bar{x}^T$ is equivalent to isotropic Gaussian noise. ![forward process](images/ddpm/forwardprocess.png) As the forward process is a Markov chain, therefore: -$$q(\bar{x}_{1:T} | \bar{x}_{0}) = \prod_{t=1}^{T}q(\bar{x}_{t} | \bar{x}_{t-1})$$ +$$q(\bar{x}_{1:T} \vert \bar{x}_{0}) = \prod_{t=1}^{T}q(\bar{x}_{t} \vert \bar{x}_{t-1})$$ Since, the distributons are Gaussians: -$$q(\bar{x}_{t} | \bar{x}_{t-1}) = \mathcal{N}(\bar{x}_t; \sqrt{1-\beta_t}\bar{x}_{t-1}, \beta_t\mathbf{I})$$ +$$q(\bar{x}_{t} \vert \bar{x}_{t-1}) = \mathcal{N}(\bar{x}_t; \sqrt{1-\beta_t}\bar{x}_{t-1}, \beta_t\mathbf{I})$$ The $\bar{x}_t$ can also be sampled using $\bar{x}_0$ as follows: -$$q(\bar{x}_{t} | \bar{x}_{0}) = \mathcal{N}(\bar{x}_0; \sqrt{\bar{\alpha}_t}\bar{x}_{0}, (1-\bar{\alpha}_t)\mathbf{I})$$ +$$q(\bar{x}_{t} \vert \bar{x}_{0}) = \mathcal{N}(\bar{x}_0; \sqrt{\bar{\alpha}_t}\bar{x}_{0}, (1-\bar{\alpha}_t)\mathbf{I})$$ Here, $\alpha_t=1-\beta_t$ and $\bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i$. @@ -35,13 +35,13 @@ The following shows how to derive the forward diffusion process. ![forward diffusion process](images/ddpm/derivation1.jpg) ## Reverse diffusion process -If we know the $p(\bar{x}_{t-1} | \bar{x}_{t})$ conditional process, then we can reverse the forward process starting from pure noise and gradually "denoising" it so that we end up with a sample from the real distribution. +If we know the $p(\bar{x}_{t-1} \vert \bar{x}_{t})$ conditional process, then we can reverse the forward process starting from pure noise and gradually "denoising" it so that we end up with a sample from the real distribution. -However, it is intractable and requires knowing the actual data distribution of the images in order to calculate this conditional probability. Hence, we use a neural network $p_\theta$ to approximate (learn) the $p_{\theta}(\bar{x}_{t-1} | \bar{x}_{t})$ conditional probability distribution. +However, it is intractable and requires knowing the actual data distribution of the images in order to calculate this conditional probability. Hence, we use a neural network $p_\theta$ to approximate (learn) the $p_{\theta}(\bar{x}_{t-1} \vert \bar{x}_{t})$ conditional probability distribution. Starting with the pure Gaussian noise $p(\bar{x}_T) = \mathcal{N}(\bar{x}_T; 0, \mathbf{I})$, assuming the reverse process to be Gaussian and Markov, the joint conditional ditribution $p_{\theta}(\bar{x}_{0:T})$ is given as follows: -$$ p_{\theta}(\bar{x}_{0:T}) = p(\bar{x}_T) \prod_{t=1}^{T}p_\theta(\bar{x}_{t-1} | \bar{x}_{t}) $$ +$$ p_{\theta}(\bar{x}_{0:T}) = p(\bar{x}_T) \prod_{t=1}^{T}p_\theta(\bar{x}_{t-1} \vert \bar{x}_{t}) $$ $$ p_{\theta}(\bar{x}_{0:T}) = p(\bar{x}_T) \prod_{t=1}^{T} \mathcal{N}(\bar{x}_{t-1}; \mu_{\theta}(\bar{x}_t,t), \Sigma_{\theta}(\bar{x}_t,t))$$ diff --git a/Summaries/Diffusion/Null-TextInversion.md b/Summaries/Diffusion/Null-TextInversion.md index fd6b02b..b3d7658 100644 --- a/Summaries/Diffusion/Null-TextInversion.md +++ b/Summaries/Diffusion/Null-TextInversion.md @@ -29,7 +29,7 @@ The paper introduces an accurate inversion scheme, achieving near-perfect recons ### Classifier-free guidance Classifier-Free Guidance (CFG) is a lightweight technique to encourage prompt-adherence in text-to-image generation. In diffusion models, in each step, the prediction is performed twice: once unconditionally and once with the text condition. These predictions are then extrapolated to amplify the effect of the text guidance. The CFG prediction is defined as: -$$ \bar{\epsilon_\theta}(z_t, t, \phi)= w.\epsilon_\theta(z_t, t, C) + (1-w) .\epsilon_\theta(z_t, t, \phi)$$ +$ \bar{\epsilon_\theta}(z_t, t, \phi)= w.\epsilon_\theta(z_t, t, C) + (1-w) .\epsilon_\theta(z_t, t, \phi)$ ### DDIM Inversion DDIM inversion is a simple inversion technique that is reverse of DDIM sampling, based on the assumption that the ODE process can be reversed in the limit of small steps. The diffusion process in performed in the reverse direction, that is $z_0 \rightarrow z_T$ insted of $z_T \rightarrow z_0$ @@ -59,14 +59,14 @@ embedding with an optimized one, referred to as null-text optimization - In particular, authors aim to perform our optimization around a pivotal noise vector which is a good approximation and thus allows a more efficient inversion. - **For this, authors use DDIM inversion with guidance scale w = 1 as a rough approximation of the original image which is highly editable but far from accurate.** - The DDIM inversion with $w=1$ is called as pivot trajectory and optimization is performed around it. The optimization aims to maximize the similarity to the original image. -$$ min||{z}_{t-1}^{*} - z_{t-1} ||_2^2 $$ +$ min||{z}_{t-1}^{*} - z_{t-1} ||_2^2 $ - Note, a separate optimization is performed for each timestep $t$ from $t=T \rightarrow t=1$ with the endpoint of the previous step optimization as the starting point for the current $t$. ### Null-text optimization - As mentioned before, authors optimize only the unconditional embedding $\phi$ as part of null-text optimization with the model and the conditional textual embedding as being kept unchanged. - Authors observed that optimizing a different ”null embedding” for each timestamp $t$ significantly improves the reconstruction quality and is best suited for pivotal inversion. - Therefore, the unconditional text embeddings (${\phi}_{t=1}^T$) at all steps is optimized with starting point as previous timestep result. -$$ min||{z}_{t-1}^{*} - z_{t-1}(\bar{z_t, \phi_t, C}) ||_2^2 $$ +$ min||{z}_{t-1}^{*} - z_{t-1}(\bar{z_t, \phi_t, C}) ||_2^2 $ The full algorithm can be summarized as follows: ![](images/null-inversion/algorithm.png) diff --git a/Summaries/Transformers/Transformers.md b/Summaries/Transformers/Attention.md similarity index 97% rename from Summaries/Transformers/Transformers.md rename to Summaries/Transformers/Attention.md index 946f2d6..2e6ff7e 100644 --- a/Summaries/Transformers/Transformers.md +++ b/Summaries/Transformers/Attention.md @@ -45,7 +45,7 @@ \text{{head}}_i = \text{{Attention}}(QW_i^Q, KW_i^K, VW_i^V) \] -![](images/transformers/1_attention_mechanism.png) +![](images/attention/1_attention_mechanism.png) - The block `Mask (opt.)` represents the optional masking of specific entries in the attention matrix. This is for instance used if we stack multiple sequences with different lengths into a batch. This helps in the parallelization in PyTorch. The masking is also used in the self-attention mechanism of the decoder part of the Transformers to allow the information flow only from previous tokens and restrict any learning from future tokens @@ -65,7 +65,7 @@ - The `Transformer model` introduced in the paper contains consists of an encoder and a decoder. -![](images/transformers/1_transformer_model.png) +![](images/attention/1_transformer_model.png) - The encoder processes the input sequence, while the decoder generates the output sequence. @@ -91,5 +91,4 @@ ## Acknowledgments - Transformers United 2023: Introduction to Transformers w/ Andrej Karpathy: https://www.youtube.com/watch?v=XfpMkf4rD6E -- Original paper: https://arxiv.org/abs/1706.03762 -- ChatGPT +- Original paper: https://arxiv.org/abs/1706.03762 \ No newline at end of file diff --git a/Summaries/Transformers/images/transformers/1_attention_mechanism.png b/Summaries/Transformers/images/attention/1_attention_mechanism.png similarity index 100% rename from Summaries/Transformers/images/transformers/1_attention_mechanism.png rename to Summaries/Transformers/images/attention/1_attention_mechanism.png diff --git a/Summaries/Transformers/images/transformers/1_transformer_model.png b/Summaries/Transformers/images/attention/1_transformer_model.png similarity index 100% rename from Summaries/Transformers/images/transformers/1_transformer_model.png rename to Summaries/Transformers/images/attention/1_transformer_model.png