Update how-unet-works.ipynb

Yongyao Jiang · Yongyao Jiang · commit 6e939b4a9c9d · 2019-06-17T15:03:14.000-07:00
diff --git a/guide/14-deep-learning/how-unet-works.ipynb b/guide/14-deep-learning/how-unet-works.ipynb
@@ -61,12 +61,18 @@
    "source": [
     "U-net was originally invented and first used for biomedical image segmentation. Its architecture can be broadly thought of as an **encoder** network followed by a **decoder** network. Unlike classification where the end result of the the deep network is the only important thing, semantic segmentation not only requires discrimination at pixel level but also a mechanism to project the discriminative features learnt at different stages of the encoder onto the pixel space.\n",
     "\n",
-    "- The encoder is the first half in the architecture diagram (Figure 2). It usually is a pre-trained classification network like VGG/ResNet where you apply convolution blocks followed by a maxpool downsampling to encode the input image into feature representations at multiple different levels.\n",
+    "- The encoder is the first half in the architecture diagram (Figure 2). It usually is a pre-trained classification network like VGG/ResNet where you apply convolution blocks followed by a maxpool downsampling to encode the input image into feature representations at multiple different levels.  \n",
+    "\n",
     "- The decoder is the second half of the architecture. The goal is to semantically project the discriminative features (lower resolution) learnt by the encoder onto the pixel space (higher resolution) to get a dense classification. The decoder consists of **upsampling** and **concatenation** followed by regular convolution operations. \n",
     "\n",
     "<center><img src=\"../../static/img/unet.png\" height=\"600\" width=\"600\"></center>\n",
-    "<center>Figure 2. U-net architecture. Blue boxes represent multi-channel feature maps, while while boxes represent copied feature maps. The arrows of different colors represent different operations</center>\n",
-    "\n",
+    "<center>Figure 2. U-net architecture. Blue boxes represent multi-channel feature maps, while while boxes represent copied feature maps. The arrows of different colors represent different operations</center>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "Upsampling in CNN might be new to those of you who are used to classification and object detection architecture, but the idea is fairly simple. The intuition is that we would like to restore the condensed feature map to the original size of the input image, therefore we expand the feature dimensions. Upsampling is also referred to as transposed convolution, upconvolution, or deconvolution. There are a few ways of upsampling such as Nearest Neighbor, Bilinear Interpolation, and Transposed Convolution from simplest to more complex. For more details, please refer to “[A guide to convolution arithmetic for deep learning](https://arxiv.org/pdf/1603.07285.pdf)” we mentioned in the beginning. \n",
     "\n",
     "Specifically, we would like to upsample it to meet the same size with the corresponding concatenation blocks from the left. You may see the gray and green arrows, where we concatenate two feature maps together. The main [contribution](https://medium.com/@keremturgutlu/semantic-segmentation-u-net-part-1-d8d6f6005066) of U-Net in this sense is that while upsampling in the network we are also concatenating the higher resolution feature maps from the encoder network with the upsampled features in order to better learn representations with following convolutions. Since upsampling is a sparse operation we need a good prior from earlier stages to better represent the localization.\n",