|
66 | 66 | " - in training, generate noisy images for training \n",
|
67 | 67 | " - in inference, compute the next sample given the model's output\n",
|
68 | 68 | "\n",
|
69 |
| - "\n", |
70 |
| - "\n", |
| 69 | + "" |
| 70 | + ] |
| 71 | + }, |
| 72 | + { |
| 73 | + "cell_type": "markdown", |
| 74 | + "id": "9a393e0d", |
| 75 | + "metadata": { |
| 76 | + "slideshow": { |
| 77 | + "slide_type": "subslide" |
| 78 | + } |
| 79 | + }, |
| 80 | + "source": [ |
71 | 81 | "Models and schedulers are kept as independent from each other as possible:\n",
|
72 | 82 | "- A scheduler should never accept a model as an input and vice-versa. "
|
73 | 83 | ]
|
|
1026 | 1036 | }
|
1027 | 1037 | },
|
1028 | 1038 | "source": [
|
1029 |
| - "All schedulers provide one (or more) ``step()`` methods that can be used to compute the slightly less noisy image, i.e., the next sample in the backward process. \n", |
1030 |
| - "\n", |
1031 |
| - "The ``step()`` method may vary from one scheduler to another, but normally expects:\n", |
| 1039 | + "All schedulers provide one (or more) ``step()`` methods to compute the slightly less noisy image. The ``step()`` method may vary from one scheduler to another, but normally expects:\n", |
1032 | 1040 | "- the model output $\\tilde z_t$ (what we called ``noisy_residual``)\n",
|
1033 | 1041 | "- the ``timestep`` $t$\n",
|
1034 | 1042 | "- the current ``noisy_sample`` $\\tilde x_t$\n",
|
1035 | 1043 | "\n",
|
1036 |
| - "" |
| 1044 | + "<div>\n", |
| 1045 | + "<img src=\"attachment:photo_2022-09-06_19-04-27.jpg\" width=\"700\"/>\n", |
| 1046 | + "</div>" |
1037 | 1047 | ]
|
1038 | 1048 | },
|
1039 | 1049 | {
|
|
1083 | 1093 | "source": [
|
1084 | 1094 | "Time to define the **denoising loop**.\n",
|
1085 | 1095 | "\n",
|
1086 |
| - "- We loop over ``scheduler.timesteps``, a tensor defining the sequence of timesteps over which to iterate during the denoising process. \n", |
| 1096 | + "- We loop over ``scheduler.timesteps``, the sequence of timesteps for the denoising process. \n", |
1087 | 1097 | "- Usually, the denoising process goes in decreasing order of timesteps (here from 1000 to 0).\n",
|
1088 | 1098 | "- To visualize what is going on, we print out the (less and less) noisy samples every 50 steps."
|
1089 | 1099 | ]
|
|
1558 | 1568 | "source": [
|
1559 | 1569 | "- It takes quite some time to produce a meaningful image\n",
|
1560 | 1570 | "- To speed-up the generation, we switch the DDPM scheduler with the DDIM scheduler\n",
|
1561 |
| - "- The DDIM scheduler removes stochasticity during sampling and updates the samples every $T/S$ steps, reducing the total number of inference steps from $T$ to $S$\n", |
| 1571 | + "- The DDIM scheduler removes stochasticity during sampling and updates the samples every $T/S$ steps\n", |
| 1572 | + "- The total number of inference steps is reduced from $T$ to $S$\n", |
1562 | 1573 | "- Note that some schedulers follow different protocols and cannot be switched so easily like in this case"
|
1563 | 1574 | ]
|
1564 | 1575 | },
|
|
1757 | 1768 | "source": [
|
1758 | 1769 | "- In DDPM some noise with variance $\\sigma_t$ (or $\\tilde \\beta_t$) is added to get the next sample\n",
|
1759 | 1770 | "- Instead, the DDIM scheduler is deterministic\n",
|
1760 |
| - "- Starting from the same input $x_t$ gives the same output $x_0$" |
| 1771 | + "- Starting from the same input $x_T$ gives the same output $x_0$" |
1761 | 1772 | ]
|
1762 | 1773 | },
|
1763 | 1774 | {
|
|
1776 | 1787 | "\n",
|
1777 | 1788 | "There are 3 main components in the latent diffusion model.\n",
|
1778 | 1789 | "\n",
|
1779 |
| - "1. The U-Net (as in DDPM/DDIM)\n", |
| 1790 | + "1. A tokenizer + text-encoder (CLIP)\n", |
1780 | 1791 | "2. An autoencoder (VAE)\n",
|
1781 |
| - "3. A text-encoder (CLIP)" |
| 1792 | + "3. The U-Net (as in DDPM/DDIM)" |
1782 | 1793 | ]
|
1783 | 1794 | },
|
1784 | 1795 | {
|
|
1790 | 1801 | }
|
1791 | 1802 | },
|
1792 | 1803 | "source": [
|
| 1804 | + "Tokenizer + test-encoder:\n", |
| 1805 | + "\n", |
1793 | 1806 | "- The **text-encoder** is responsible for transforming a text prompt into an embedding space that can be understood by the U-Net \n",
|
1794 |
| - "- It is usually a transformer-based encoder that maps a sequence of tokens (generated with a **tokenizer**) into a sequence of latent text-embeddings\n", |
1795 |
| - "- Stable Diffusion does not train the text-encoder and simply uses an already trained text encoder such as CLIP or BERT" |
| 1807 | + "- It is usually a transformer-based encoder that maps a sequence of tokens (generated with a **tokenizer**) into a (large fixed size) text-embedding\n", |
| 1808 | + "- Stable Diffusion does not train the text-encoder and simply uses an already trained one such as CLIP or BERT" |
1796 | 1809 | ]
|
1797 | 1810 | },
|
1798 | 1811 | {
|
|
1805 | 1818 | },
|
1806 | 1819 | "source": [
|
1807 | 1820 | "The VAE model has two parts:\n",
|
1808 |
| - "- the **encoder** is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net\n", |
1809 |
| - "- the **decoder** transforms the latent representation back into an image.\n", |
| 1821 | + "- The **encoder** is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net\n", |
| 1822 | + "- The **decoder** transforms the latent representation back into an image.\n", |
1810 | 1823 | "- During **training**, the encoder is used to get the latent representations of the images for the forward diffusion process \n",
|
1811 | 1824 | "- During **inference**, the denoised latents generated by the reverse diffusion process are converted back into images using the VAE decoder\n",
|
1812 | 1825 | "- Working with latens is the key of the speed and memory efficiency of Stable Diffusion\n",
|
|
2036 | 2049 | "## Custom pipeline\n",
|
2037 | 2050 | "\n",
|
2038 | 2051 | "- As we did before, we will replace the ``scheduler`` in the pre-built pipeline.\n",
|
2039 |
| - "- The other 4 components, ``vae``, ``tokenizer``, ``text_encoder``, and ``unet`` are not changed but they could be switched as well (e.g., the CLIP ``text_encoder`` with BERT or a different type of ``vae``)\n", |
| 2052 | + "- The other 4 components, ``vae``, ``tokenizer``, ``text_encoder``, and ``unet`` are not changed but they could also be switched (e.g., CLIP ``text_encoder`` with BERT or a different ``vae``)\n", |
2040 | 2053 | "- We start by loading them"
|
2041 | 2054 | ]
|
2042 | 2055 | },
|
2043 | 2056 | {
|
2044 | 2057 | "cell_type": "code",
|
2045 | 2058 | "execution_count": 22,
|
2046 | 2059 | "id": "71424bf6",
|
2047 |
| - "metadata": {}, |
| 2060 | + "metadata": { |
| 2061 | + "collapsed": true |
| 2062 | + }, |
2048 | 2063 | "outputs": [
|
2049 | 2064 | {
|
2050 | 2065 | "data": {
|
|
3112 | 3127 | "\n",
|
3113 | 3128 | "- First, we get the embeddings for the prompt\n",
|
3114 | 3129 | "- These embeddings will be used to condition the UNet model and guide the image generation towards something that should resemble the input prompt\n",
|
3115 |
| - "- The text_embeddings are arrays of size $77 \\times 768$" |
| 3130 | + "- The ``text_embeddings`` are arrays of size $77 \\times 768$" |
3116 | 3131 | ]
|
3117 | 3132 | },
|
3118 | 3133 | {
|
|
3192 | 3207 | "source": [
|
3193 | 3208 | "**Guidance**\n",
|
3194 | 3209 | "\n",
|
3195 |
| - "For classifier-free guidance, we need to do two forward passes: \n", |
| 3210 | + "For classifier-free guidance, we need $\\tilde z = \\tilde z_x + \\gamma \\big( \\tilde z_{x|y} - \\tilde z_x \\big)$.\n", |
| 3211 | + "We need two forward passes: \n", |
3196 | 3212 | "- one with the conditioned input (``text_embeddings``) to get $\\tilde z_{x|y}$ (i.e., the score function $\\nabla_x p(x|y)$)\n",
|
3197 | 3213 | "- one with the unconditional embeddings (``uncond_embeddings``) to get $\\tilde z_x$ (i.e., the score function $\\nabla_x p(x)$)\n",
|
3198 | 3214 | "\n",
|
3199 |
| - "In practice, we can concatenate both into a single batch to avoid doing two forward passes\n", |
3200 |
| - "\n", |
3201 |
| - "The final predicted noise is $\\tilde z = \\tilde z_x + \\gamma \\big( \\tilde z_{x|y} - \\tilde z_x \\big)$" |
| 3215 | + "In practice, we can concatenate both into a single batch to avoid doing two forward passes" |
3202 | 3216 | ]
|
3203 | 3217 | },
|
3204 | 3218 | {
|
|
3476 | 3490 | "\n",
|
3477 | 3491 | ""
|
3478 | 3492 | ]
|
3479 |
| - }, |
3480 |
| - { |
3481 |
| - "cell_type": "code", |
3482 |
| - "execution_count": null, |
3483 |
| - "id": "89107156", |
3484 |
| - "metadata": {}, |
3485 |
| - "outputs": [], |
3486 |
| - "source": [] |
3487 | 3493 | }
|
3488 | 3494 | ],
|
3489 | 3495 | "metadata": {
|
|
0 commit comments