1
1
---
2
2
title : " Semantic Segmentation using U-Net"
3
3
excerpt_separator : " <!--more-->"
4
- last_modified_at : 2022-10-11T14 :36:02-05:00
4
+ last_modified_at : 2022-10-12T14 :36:02-05:00
5
5
categories :
6
6
- Applied Machine Learning
7
7
tags :
@@ -16,10 +16,10 @@ author: bishwash
16
16
17
17
## U-Net
18
18
19
- U-Net is a u -shaped encoder-decoder network architecture, which consists of four encoder blocks
20
- and four decoder blocks that are connected via bridge . It is one of the most popularly used approaches
19
+ U-Net is a U -shaped encoder-decoder network architecture, which consists of four encoder blocks
20
+ and four decoder blocks that are connected via bridges with skip connections . It is one of the most popularly used approaches
21
21
in any semantic segmentation task. It was originally introduced by Olaf Ronneberger through the publication
22
- "U-Net: Convolutional Networks for Biomedical Image Segmentation". It is a fully convolutional neural network
22
+ "U-Net: Convolutional Networks for Biomedical Image Segmentation". It is built upon fully convolutional neural network
23
23
that is designed to learn from fewer training samples.
24
24
25
25
<p align =" center " >
@@ -28,20 +28,20 @@ that is designed to learn from fewer training samples.
28
28
</p >
29
29
30
30
It has three main componets namely encoder network, decoder network and skip connections. The encoder network
31
- (contracting path) halfs the spatial dimensions and doubles the number of feature channels at each encoder block
32
- while the decoder network doubles the spatial dimensions and halfs the number of of feature channels. The skip
31
+ (contracting path) half the spatial dimensions and double the number of feature channels at each encoder block
32
+ while the decoder network double the spatial dimensions and half the number of of feature channels. The skip
33
33
connections connect output of encoder block with corresponding input of decoder block.
34
34
35
35
36
36
### Encoder Network
37
37
38
38
Encoder Network acts as the feature extractor and learn an abstract representation of the input image through
39
- a sequence of the encoder blocks. Each encoder block consists of 3x3 convolutions where each convolution is followed by
39
+ a sequence of the encoder blocks. Each encoder block consists of $3 \times 3$ convolutions where each convolution is followed by
40
40
a ReLU (Rectified Linear Unit) activation function. The ReLU function introduces non-linearity into the network, which
41
41
helps in the better generalization of the training data. The output of ReLU acts as a skip connection for the corresponding
42
42
decoder block.
43
43
44
- Next follows a 2x2 max-pooling, where the spatial dimensions of the feature maps are reduced by half. This reduces the
44
+ Next follows a $2 \times 2$ max-pooling, where the spatial dimensions of the feature maps are reduced by half. This reduces the
45
45
computational cost by decreasing the number of trainable parameters.
46
46
47
47
### Skip Connection
@@ -53,10 +53,10 @@ The bridge connects the encoder and decoder network and completes the flow of in
53
53
54
54
### Decoder Network
55
55
56
- It is used to take the abstract representation and generate a semantic segmentation mask. The decoder block starts with 2x2 transpose
56
+ It is used to take the abstract representation and generate a semantic segmentation mask. The decoder block starts with $2 \times 2$ transpose
57
57
convolution. Next, it is concatenated with the corresponding skip connection feature map from the encoder block. These skip
58
58
connections provide features from earlier layers that are sometimes lost due to the depth of the network. The output of the last
59
- decoder passes through 1x1 convolution with sigmoid activation. The sigmoid activation function gives the segmenation mask representing the
59
+ decoder passes through $1 \times 1$ convolution with sigmoid activation. The sigmoid activation function gives the segmenation mask representing the
60
60
pixel-wise classification.
61
61
62
62
It is prefered to use batch normalization in between the convolution layer and the ReLU activation function. It reduces internal
@@ -69,22 +69,23 @@ This in turn helps the network to better generalize and prevent it from overfitt
69
69
70
70
This dataset provides data images and labeled semantic segmentations captured via CARLA self-driving car simulator. The data was
71
71
generated as part of the Lyft Udacity Challenge . This dataset can be used to train ML algorithms to identify semantic segmentation
72
- of cars, roads etc in an image. The data has 5 sets of 1000 images and corresponding labels. There are 23 different labels ranging from
72
+ of cars, roads etc in an image. The data has $5$ sets of $ 1000$ images and corresponding labels. There are $23$ different labels ranging from
73
73
road, roadlines, sidewalk to building, pedestrians, fences.
74
74
75
75
<p align =" center " >
76
76
<img src="{{ site.url }}{{ site.baseurl }}/assets/images/unet/example.png" width="600"/>
77
77
</p >
78
78
79
- For our training, we select the first 13 labels. Then all the images were resized to 256x256. The train-validation-test split was
80
- 0.6, 0.2 and 0.2. Learning rate was chosen to be 0.001 for Adam optimizer. The performance of the network was optimized with the help
81
- of Dice Loss which is defined as
79
+ For our training, we chose the first $13$ labels. All the images and labels were resized to $256 \times 256$ to avoid any tensor mismatch while training.
80
+ The train-validation-test split was $ 0.6$, $ 0.2$ and $ 0.2$ . Learning rate was chosen to be $ 0.001$ for Adam optimizer. The performance of the
81
+ network was optimized with the help of Dice Loss which is defined as
82
82
83
83
<p align =" center " >
84
84
<img src="{{ site.url }}{{ site.baseurl }}/assets/images/unet/diceloss.png" width="350"/>
85
85
</p >
86
86
87
- The performance of network for for first 25 epoch out of 100 is as follow.
87
+ where $p_ {true}, p_ {pred}$ are ground truth and predicted labels. $\epsilon$ is small number $\leq 1$. The network was trained over 100 epochs.
88
+ Other network parameters were {batch_size=8, device=gpu}.The performance of network for the first 25 epoch is given as
88
89
89
90
<p align =" center " >
90
91
<img src="{{ site.url }}{{ site.baseurl }}/assets/images/unet/loss.png" width="450"/>
@@ -96,6 +97,13 @@ Some predicted results for buildings with their ground truth is as follow.
96
97
<img src="{{ site.url }}{{ site.baseurl }}/assets/images/unet/results.png" width="600"/>
97
98
</p >
98
99
100
+ ## Conclusion
101
+
102
+ In all, U-Net works exceptionally for basic semantic segmentation tasks even when we have fewer training samples.
103
+ Improved version to original U-Net like [ U-Net++] ( https://arxiv.org/abs/1807.10165 ) ,
104
+ [ Dense-UNet] ( https://www.sciencedirect.com/science/article/abs/pii/S0030401821002200 ) have been introduced for specific tasks
105
+ in medical and aerial imaging.
106
+
99
107
#### References
100
108
101
109
[ 1] [ U-Net: Convolutional Networks for Biomedical Image Segmentation] ( https://arxiv.org/pdf/1505.04597.pdf )
0 commit comments