Skip to content

Commit 9bfc01c

Browse files
committed
small updates
1 parent 79f9bb0 commit 9bfc01c

File tree

2 files changed

+194
-225
lines changed

2 files changed

+194
-225
lines changed

Diffusers_library.ipynb

Lines changed: 36 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -66,8 +66,18 @@
6666
" - in training, generate noisy images for training \n",
6767
" - in inference, compute the next sample given the model's output\n",
6868
"\n",
69-
"![Screenshot%202022-09-04%20221316.png](attachment:Screenshot%202022-09-04%20221316.png)\n",
70-
"\n",
69+
"![Screenshot%202022-09-04%20221316.png](attachment:Screenshot%202022-09-04%20221316.png)"
70+
]
71+
},
72+
{
73+
"cell_type": "markdown",
74+
"id": "9a393e0d",
75+
"metadata": {
76+
"slideshow": {
77+
"slide_type": "subslide"
78+
}
79+
},
80+
"source": [
7181
"Models and schedulers are kept as independent from each other as possible:\n",
7282
"- A scheduler should never accept a model as an input and vice-versa. "
7383
]
@@ -1026,14 +1036,14 @@
10261036
}
10271037
},
10281038
"source": [
1029-
"All schedulers provide one (or more) ``step()`` methods that can be used to compute the slightly less noisy image, i.e., the next sample in the backward process. \n",
1030-
"\n",
1031-
"The ``step()`` method may vary from one scheduler to another, but normally expects:\n",
1039+
"All schedulers provide one (or more) ``step()`` methods to compute the slightly less noisy image. The ``step()`` method may vary from one scheduler to another, but normally expects:\n",
10321040
"- the model output $\\tilde z_t$ (what we called ``noisy_residual``)\n",
10331041
"- the ``timestep`` $t$\n",
10341042
"- the current ``noisy_sample`` $\\tilde x_t$\n",
10351043
"\n",
1036-
"![photo_2022-09-06_19-04-27.jpg](attachment:photo_2022-09-06_19-04-27.jpg)"
1044+
"<div>\n",
1045+
"<img src=\"attachment:photo_2022-09-06_19-04-27.jpg\" width=\"700\"/>\n",
1046+
"</div>"
10371047
]
10381048
},
10391049
{
@@ -1083,7 +1093,7 @@
10831093
"source": [
10841094
"Time to define the **denoising loop**.\n",
10851095
"\n",
1086-
"- We loop over ``scheduler.timesteps``, a tensor defining the sequence of timesteps over which to iterate during the denoising process. \n",
1096+
"- We loop over ``scheduler.timesteps``, the sequence of timesteps for the denoising process. \n",
10871097
"- Usually, the denoising process goes in decreasing order of timesteps (here from 1000 to 0).\n",
10881098
"- To visualize what is going on, we print out the (less and less) noisy samples every 50 steps."
10891099
]
@@ -1558,7 +1568,8 @@
15581568
"source": [
15591569
"- It takes quite some time to produce a meaningful image\n",
15601570
"- To speed-up the generation, we switch the DDPM scheduler with the DDIM scheduler\n",
1561-
"- The DDIM scheduler removes stochasticity during sampling and updates the samples every $T/S$ steps, reducing the total number of inference steps from $T$ to $S$\n",
1571+
"- The DDIM scheduler removes stochasticity during sampling and updates the samples every $T/S$ steps\n",
1572+
"- The total number of inference steps is reduced from $T$ to $S$\n",
15621573
"- Note that some schedulers follow different protocols and cannot be switched so easily like in this case"
15631574
]
15641575
},
@@ -1757,7 +1768,7 @@
17571768
"source": [
17581769
"- In DDPM some noise with variance $\\sigma_t$ (or $\\tilde \\beta_t$) is added to get the next sample\n",
17591770
"- Instead, the DDIM scheduler is deterministic\n",
1760-
"- Starting from the same input $x_t$ gives the same output $x_0$"
1771+
"- Starting from the same input $x_T$ gives the same output $x_0$"
17611772
]
17621773
},
17631774
{
@@ -1776,9 +1787,9 @@
17761787
"\n",
17771788
"There are 3 main components in the latent diffusion model.\n",
17781789
"\n",
1779-
"1. The U-Net (as in DDPM/DDIM)\n",
1790+
"1. A tokenizer + text-encoder (CLIP)\n",
17801791
"2. An autoencoder (VAE)\n",
1781-
"3. A text-encoder (CLIP)"
1792+
"3. The U-Net (as in DDPM/DDIM)"
17821793
]
17831794
},
17841795
{
@@ -1790,9 +1801,11 @@
17901801
}
17911802
},
17921803
"source": [
1804+
"Tokenizer + test-encoder:\n",
1805+
"\n",
17931806
"- The **text-encoder** is responsible for transforming a text prompt into an embedding space that can be understood by the U-Net \n",
1794-
"- It is usually a transformer-based encoder that maps a sequence of tokens (generated with a **tokenizer**) into a sequence of latent text-embeddings\n",
1795-
"- Stable Diffusion does not train the text-encoder and simply uses an already trained text encoder such as CLIP or BERT"
1807+
"- It is usually a transformer-based encoder that maps a sequence of tokens (generated with a **tokenizer**) into a (large fixed size) text-embedding\n",
1808+
"- Stable Diffusion does not train the text-encoder and simply uses an already trained one such as CLIP or BERT"
17961809
]
17971810
},
17981811
{
@@ -1805,8 +1818,8 @@
18051818
},
18061819
"source": [
18071820
"The VAE model has two parts:\n",
1808-
"- the **encoder** is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net\n",
1809-
"- the **decoder** transforms the latent representation back into an image.\n",
1821+
"- The **encoder** is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net\n",
1822+
"- The **decoder** transforms the latent representation back into an image.\n",
18101823
"- During **training**, the encoder is used to get the latent representations of the images for the forward diffusion process \n",
18111824
"- During **inference**, the denoised latents generated by the reverse diffusion process are converted back into images using the VAE decoder\n",
18121825
"- Working with latens is the key of the speed and memory efficiency of Stable Diffusion\n",
@@ -2036,15 +2049,17 @@
20362049
"## Custom pipeline\n",
20372050
"\n",
20382051
"- As we did before, we will replace the ``scheduler`` in the pre-built pipeline.\n",
2039-
"- The other 4 components, ``vae``, ``tokenizer``, ``text_encoder``, and ``unet`` are not changed but they could be switched as well (e.g., the CLIP ``text_encoder`` with BERT or a different type of ``vae``)\n",
2052+
"- The other 4 components, ``vae``, ``tokenizer``, ``text_encoder``, and ``unet`` are not changed but they could also be switched (e.g., CLIP ``text_encoder`` with BERT or a different ``vae``)\n",
20402053
"- We start by loading them"
20412054
]
20422055
},
20432056
{
20442057
"cell_type": "code",
20452058
"execution_count": 22,
20462059
"id": "71424bf6",
2047-
"metadata": {},
2060+
"metadata": {
2061+
"collapsed": true
2062+
},
20482063
"outputs": [
20492064
{
20502065
"data": {
@@ -3112,7 +3127,7 @@
31123127
"\n",
31133128
"- First, we get the embeddings for the prompt\n",
31143129
"- These embeddings will be used to condition the UNet model and guide the image generation towards something that should resemble the input prompt\n",
3115-
"- The text_embeddings are arrays of size $77 \\times 768$"
3130+
"- The ``text_embeddings`` are arrays of size $77 \\times 768$"
31163131
]
31173132
},
31183133
{
@@ -3192,13 +3207,12 @@
31923207
"source": [
31933208
"**Guidance**\n",
31943209
"\n",
3195-
"For classifier-free guidance, we need to do two forward passes: \n",
3210+
"For classifier-free guidance, we need $\\tilde z = \\tilde z_x + \\gamma \\big( \\tilde z_{x|y} - \\tilde z_x \\big)$.\n",
3211+
"We need two forward passes: \n",
31963212
"- one with the conditioned input (``text_embeddings``) to get $\\tilde z_{x|y}$ (i.e., the score function $\\nabla_x p(x|y)$)\n",
31973213
"- one with the unconditional embeddings (``uncond_embeddings``) to get $\\tilde z_x$ (i.e., the score function $\\nabla_x p(x)$)\n",
31983214
"\n",
3199-
"In practice, we can concatenate both into a single batch to avoid doing two forward passes\n",
3200-
"\n",
3201-
"The final predicted noise is $\\tilde z = \\tilde z_x + \\gamma \\big( \\tilde z_{x|y} - \\tilde z_x \\big)$"
3215+
"In practice, we can concatenate both into a single batch to avoid doing two forward passes"
32023216
]
32033217
},
32043218
{
@@ -3476,14 +3490,6 @@
34763490
"\n",
34773491
"![v1-variants-scores.jpg](attachment:v1-variants-scores.jpg)"
34783492
]
3479-
},
3480-
{
3481-
"cell_type": "code",
3482-
"execution_count": null,
3483-
"id": "89107156",
3484-
"metadata": {},
3485-
"outputs": [],
3486-
"source": []
34873493
}
34883494
],
34893495
"metadata": {

diffusion_from_scratch.ipynb

Lines changed: 158 additions & 195 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)