Skip to content

Commit f2ed98a

Browse files
authored
Merge pull request #78 from bkhanal-11/master
Added conclusion in UNet article
2 parents 5d08efe + dac1113 commit f2ed98a

File tree

2 files changed

+23
-15
lines changed

2 files changed

+23
-15
lines changed

_posts/Applied Machine Learning/2022-10-11-semantic-segmentation-unet.md renamed to _posts/Applied Machine Learning/2022-10-12-semantic-segmentation-unet.md

+23-15
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: "Semantic Segmentation using U-Net"
33
excerpt_separator: "<!--more-->"
4-
last_modified_at: 2022-10-11T14:36:02-05:00
4+
last_modified_at: 2022-10-12T14:36:02-05:00
55
categories:
66
- Applied Machine Learning
77
tags:
@@ -16,10 +16,10 @@ author: bishwash
1616

1717
## U-Net
1818

19-
U-Net is a u-shaped encoder-decoder network architecture, which consists of four encoder blocks
20-
and four decoder blocks that are connected via bridge. It is one of the most popularly used approaches
19+
U-Net is a U-shaped encoder-decoder network architecture, which consists of four encoder blocks
20+
and four decoder blocks that are connected via bridges with skip connections. It is one of the most popularly used approaches
2121
in any semantic segmentation task. It was originally introduced by Olaf Ronneberger through the publication
22-
"U-Net: Convolutional Networks for Biomedical Image Segmentation". It is a fully convolutional neural network
22+
"U-Net: Convolutional Networks for Biomedical Image Segmentation". It is built upon fully convolutional neural network
2323
that is designed to learn from fewer training samples.
2424

2525
<p align="center">
@@ -28,20 +28,20 @@ that is designed to learn from fewer training samples.
2828
</p>
2929

3030
It has three main componets namely encoder network, decoder network and skip connections. The encoder network
31-
(contracting path) halfs the spatial dimensions and doubles the number of feature channels at each encoder block
32-
while the decoder network doubles the spatial dimensions and halfs the number of of feature channels. The skip
31+
(contracting path) half the spatial dimensions and double the number of feature channels at each encoder block
32+
while the decoder network double the spatial dimensions and half the number of of feature channels. The skip
3333
connections connect output of encoder block with corresponding input of decoder block.
3434

3535

3636
### Encoder Network
3737

3838
Encoder Network acts as the feature extractor and learn an abstract representation of the input image through
39-
a sequence of the encoder blocks. Each encoder block consists of 3x3 convolutions where each convolution is followed by
39+
a sequence of the encoder blocks. Each encoder block consists of $3 \times 3$ convolutions where each convolution is followed by
4040
a ReLU (Rectified Linear Unit) activation function. The ReLU function introduces non-linearity into the network, which
4141
helps in the better generalization of the training data. The output of ReLU acts as a skip connection for the corresponding
4242
decoder block.
4343

44-
Next follows a 2x2 max-pooling, where the spatial dimensions of the feature maps are reduced by half. This reduces the
44+
Next follows a $2 \times 2$ max-pooling, where the spatial dimensions of the feature maps are reduced by half. This reduces the
4545
computational cost by decreasing the number of trainable parameters.
4646

4747
### Skip Connection
@@ -53,10 +53,10 @@ The bridge connects the encoder and decoder network and completes the flow of in
5353

5454
### Decoder Network
5555

56-
It is used to take the abstract representation and generate a semantic segmentation mask. The decoder block starts with 2x2 transpose
56+
It is used to take the abstract representation and generate a semantic segmentation mask. The decoder block starts with $2 \times 2$ transpose
5757
convolution. Next, it is concatenated with the corresponding skip connection feature map from the encoder block. These skip
5858
connections provide features from earlier layers that are sometimes lost due to the depth of the network. The output of the last
59-
decoder passes through 1x1 convolution with sigmoid activation. The sigmoid activation function gives the segmenation mask representing the
59+
decoder passes through $1 \times 1$ convolution with sigmoid activation. The sigmoid activation function gives the segmenation mask representing the
6060
pixel-wise classification.
6161

6262
It is prefered to use batch normalization in between the convolution layer and the ReLU activation function. It reduces internal
@@ -69,22 +69,23 @@ This in turn helps the network to better generalize and prevent it from overfitt
6969

7070
This dataset provides data images and labeled semantic segmentations captured via CARLA self-driving car simulator. The data was
7171
generated as part of the Lyft Udacity Challenge . This dataset can be used to train ML algorithms to identify semantic segmentation
72-
of cars, roads etc in an image. The data has 5 sets of 1000 images and corresponding labels. There are 23 different labels ranging from
72+
of cars, roads etc in an image. The data has $5$ sets of $1000$ images and corresponding labels. There are $23$ different labels ranging from
7373
road, roadlines, sidewalk to building, pedestrians, fences.
7474

7575
<p align="center">
7676
<img src="{{ site.url }}{{ site.baseurl }}/assets/images/unet/example.png" width="600"/>
7777
</p>
7878

79-
For our training, we select the first 13 labels. Then all the images were resized to 256x256. The train-validation-test split was
80-
0.6, 0.2 and 0.2. Learning rate was chosen to be 0.001 for Adam optimizer. The performance of the network was optimized with the help
81-
of Dice Loss which is defined as
79+
For our training, we chose the first $13$ labels. All the images and labels were resized to $256 \times 256$ to avoid any tensor mismatch while training.
80+
The train-validation-test split was $0.6$, $0.2$ and $0.2$. Learning rate was chosen to be $0.001$ for Adam optimizer. The performance of the
81+
network was optimized with the help of Dice Loss which is defined as
8282

8383
<p align="center">
8484
<img src="{{ site.url }}{{ site.baseurl }}/assets/images/unet/diceloss.png" width="350"/>
8585
</p>
8686

87-
The performance of network for for first 25 epoch out of 100 is as follow.
87+
where $p_{true}, p_{pred}$ are ground truth and predicted labels. $\epsilon$ is small number $\leq 1$. The network was trained over 100 epochs.
88+
Other network parameters were {batch_size=8, device=gpu}.The performance of network for the first 25 epoch is given as
8889

8990
<p align="center">
9091
<img src="{{ site.url }}{{ site.baseurl }}/assets/images/unet/loss.png" width="450"/>
@@ -96,6 +97,13 @@ Some predicted results for buildings with their ground truth is as follow.
9697
<img src="{{ site.url }}{{ site.baseurl }}/assets/images/unet/results.png" width="600"/>
9798
</p>
9899

100+
## Conclusion
101+
102+
In all, U-Net works exceptionally for basic semantic segmentation tasks even when we have fewer training samples.
103+
Improved version to original U-Net like [U-Net++](https://arxiv.org/abs/1807.10165),
104+
[Dense-UNet](https://www.sciencedirect.com/science/article/abs/pii/S0030401821002200) have been introduced for specific tasks
105+
in medical and aerial imaging.
106+
99107
#### References
100108

101109
[1] [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/pdf/1505.04597.pdf)

assets/images/unet/loss.png

-18.5 KB
Loading

0 commit comments

Comments
 (0)