FilippoMB
diff --git a/‎Diffusers_library.ipynb‎
Lines changed: 36 additions & 30 deletions b/‎Diffusers_library.ipynb‎
Lines changed: 36 additions & 30 deletions
diff --git a/‎diffusion_from_scratch.ipynb‎
Lines changed: 158 additions & 195 deletions b/‎diffusion_from_scratch.ipynb‎
Lines changed: 158 additions & 195 deletions
@@ -66,8 +66,18 @@
     "    - in training, generate noisy images for training \n",
     "    - in inference, compute the next sample given the model's output\n",
     "\n",
-    "![Screenshot%202022-09-04%20221316.png](attachment:Screenshot%202022-09-04%20221316.png)\n",
-    "\n",
+    "![Screenshot%202022-09-04%20221316.png](attachment:Screenshot%202022-09-04%20221316.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a393e0d",
+   "metadata": {
+    "slideshow": {
+     "slide_type": "subslide"
+    }
+   },
+   "source": [
     "Models and schedulers are kept as independent from each other as possible:\n",
     "- A scheduler should never accept a model as an input and vice-versa. "
    ]
@@ -1026,14 +1036,14 @@
     }
    },
    "source": [
-    "All schedulers provide one (or more) ``step()`` methods that can be used to compute the slightly less noisy image, i.e., the next sample in the backward process. \n",
-    "\n",
-    "The ``step()`` method may vary from one scheduler to another, but normally expects:\n",
+    "All schedulers provide one (or more) ``step()`` methods to compute the slightly less noisy image. The ``step()`` method may vary from one scheduler to another, but normally expects:\n",
     "- the model output $\\tilde z_t$ (what we called  ``noisy_residual``)\n",
     "- the ``timestep`` $t$\n",
     "- the current ``noisy_sample`` $\\tilde x_t$\n",
     "\n",
-    "![photo_2022-09-06_19-04-27.jpg](attachment:photo_2022-09-06_19-04-27.jpg)"
+    "<div>\n",
+    "<img src=\"attachment:photo_2022-09-06_19-04-27.jpg\" width=\"700\"/>\n",
+    "</div>"
    ]
   },
   {
@@ -1083,7 +1093,7 @@
    "source": [
     "Time to define the **denoising loop**.\n",
     "\n",
-    "- We loop over ``scheduler.timesteps``, a tensor defining the sequence of timesteps over which to iterate during the denoising process. \n",
+    "- We loop over ``scheduler.timesteps``, the sequence of timesteps for the denoising process. \n",
     "- Usually, the denoising process goes in decreasing order of timesteps (here from 1000 to 0).\n",
     "- To visualize what is going on, we print out the (less and less) noisy samples every 50 steps."
    ]
@@ -1558,7 +1568,8 @@
    "source": [
     "- It takes quite some time to produce a meaningful image\n",
     "- To speed-up the generation, we switch the DDPM scheduler with the DDIM scheduler\n",
-    "- The DDIM scheduler removes stochasticity during sampling and updates the samples every $T/S$ steps, reducing the total number of inference steps from $T$ to $S$\n",
+    "- The DDIM scheduler removes stochasticity during sampling and updates the samples every $T/S$ steps\n",
+    "- The total number of inference steps is reduced from $T$ to $S$\n",
     "- Note that some schedulers follow different protocols and cannot be switched so easily like in this case"
    ]
   },
@@ -1757,7 +1768,7 @@
    "source": [
     "- In DDPM some noise with variance $\\sigma_t$ (or $\\tilde \\beta_t$) is added to get the next sample\n",
     "- Instead, the DDIM scheduler is deterministic\n",
-    "- Starting from the same input $x_t$ gives the same output $x_0$"
+    "- Starting from the same input $x_T$ gives the same output $x_0$"
    ]
   },
   {
@@ -1776,9 +1787,9 @@
     "\n",
     "There are 3 main components in the latent diffusion model.\n",
     "\n",
-    "1. The U-Net (as in DDPM/DDIM)\n",
+    "1. A tokenizer + text-encoder (CLIP)\n",
     "2. An autoencoder (VAE)\n",
-    "3. A text-encoder (CLIP)"
+    "3. The U-Net (as in DDPM/DDIM)"
    ]
   },
   {
@@ -1790,9 +1801,11 @@
     }
    },
    "source": [
+    "Tokenizer + test-encoder:\n",
+    "\n",
     "- The **text-encoder** is responsible for transforming a text prompt into an embedding space that can be understood by the U-Net \n",
-    "- It is usually a transformer-based encoder that maps a sequence of tokens (generated with a **tokenizer**) into a sequence of latent text-embeddings\n",
-    "- Stable Diffusion does not train the text-encoder and simply uses an already trained text encoder such as CLIP or BERT"
+    "- It is usually a transformer-based encoder that maps a sequence of tokens (generated with a **tokenizer**) into a (large fixed size) text-embedding\n",
+    "- Stable Diffusion does not train the text-encoder and simply uses an already trained one such as CLIP or BERT"
    ]
   },
   {
@@ -1805,8 +1818,8 @@
    },
    "source": [
     "The VAE model has two parts:\n",
-    "- the **encoder** is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net\n",
-    "- the **decoder** transforms the latent representation back into an image.\n",
+    "- The **encoder** is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net\n",
+    "- The **decoder** transforms the latent representation back into an image.\n",
     "- During **training**, the encoder is used to get the latent representations of the images for the forward diffusion process \n",
     "- During **inference**, the denoised latents generated by the reverse diffusion process are converted back into images using the VAE decoder\n",
     "- Working with latens is the key of the speed and memory efficiency of Stable Diffusion\n",
@@ -2036,15 +2049,17 @@
     "## Custom pipeline\n",
     "\n",
     "- As we did before, we will replace the ``scheduler`` in the pre-built pipeline.\n",
-    "- The other 4 components, ``vae``, ``tokenizer``, ``text_encoder``, and ``unet`` are not changed but they could be switched as well (e.g., the CLIP ``text_encoder`` with BERT or a different type of ``vae``)\n",
+    "- The other 4 components, ``vae``, ``tokenizer``, ``text_encoder``, and ``unet`` are not changed but they could also be switched (e.g., CLIP ``text_encoder`` with BERT or a different ``vae``)\n",
     "- We start by loading them"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 22,
    "id": "71424bf6",
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [
     {
      "data": {
@@ -3112,7 +3127,7 @@
     "\n",
     "- First, we get the embeddings for the prompt\n",
     "- These embeddings will be used to condition the UNet model and guide the image generation towards something that should resemble the input prompt\n",
-    "- The text_embeddings are arrays of size $77 \\times 768$"
+    "- The ``text_embeddings`` are arrays of size $77 \\times 768$"
    ]
   },
   {
@@ -3192,13 +3207,12 @@
    "source": [
     "**Guidance**\n",
     "\n",
-    "For classifier-free guidance, we need to do two forward passes: \n",
+    "For classifier-free guidance, we need $\\tilde z = \\tilde z_x + \\gamma \\big( \\tilde z_{x|y} - \\tilde z_x \\big)$.\n",
+    "We need two forward passes: \n",
     "- one with the conditioned input (``text_embeddings``) to get $\\tilde z_{x|y}$ (i.e., the score function $\\nabla_x p(x|y)$)\n",
     "- one with the unconditional embeddings (``uncond_embeddings``) to get $\\tilde z_x$ (i.e., the score function $\\nabla_x p(x)$)\n",
     "\n",
-    "In practice, we can concatenate both into a single batch to avoid doing two forward passes\n",
-    "\n",
-    "The final predicted noise is $\\tilde z = \\tilde z_x + \\gamma \\big( \\tilde z_{x|y} - \\tilde z_x \\big)$"
+    "In practice, we can concatenate both into a single batch to avoid doing two forward passes"
    ]
   },
   {
@@ -3476,14 +3490,6 @@
     "\n",
     "![v1-variants-scores.jpg](attachment:v1-variants-scores.jpg)"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "89107156",
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {