The goal of this project is to implement a Neural Network to predict the image content with a desired architecture.
The architecture is composed of a sequence of intermediate blocks B1, B2, . . . , BK that are followed by an output block O, as shown in the following figure. These blocks are detailed in the following subsections.
An intermediate block receives an image x and outputs an image x′ . Each block has L independent convolutional layers. Each convolutional layer Cl in a block receives the input image x and outputs an image Cl(x). Each oft hese images is combined into the single output image x′, which is given by
where
Suppose that the input image x has c channels each channel of x is computed and stored into a c-dimensional vector m. The vector m is the input to a fully connected layer that outputs the vector a. Note that this fully connected layer should have as many units as there are convolutional layers in the block. Each block in the basic architecture may have a different number of convolutional layers, and each convolutional layer may have different hyperparameters (within or across blocks). However, every convolutional layer within a block should output an image with the same shape.
The output block receives an image x (output of the last intermediate block) and outputs a logits vector o. Suppose that the input image x has c channels. In order to compute the vector o, the average value of each channel of x is computed and stored into a c-dimensional vector m. The vector m is the input to a sequence of zero or more fully connected layer(s) that output the vector o.
The model that was employed in this project has 4 intermediate blocks and each block is consisted of 3 parallel Convolutional layers. In all the layers the padding is set as ‘same’ so that the input and output will have the same. In order to use ‘same’ it was required to set the value of stride as 1. Therefore, no alternation has been done on stride and padding. The set up of each block is as followed in the table below. The number of input and output channels has been inspired by VGG16.
| Table | Num. Layers | Kernel 1 | Kernel 2 | Kernel 3 | In channels | Out channels |
|---|---|---|---|---|---|---|
| 1 | 3 | 1 | 3 | 5 | 3 | 16 |
| 2 | 3 | 1 | 3 | 5 | 16 | 32 |
| 3 | 3 | 1 | 3 | 5 | 32 | 64 |
| 4 | 3 | 1 | 3 | 5 | 64 | 128 |
The class defined for the output block is called “FinalModel”. In this class, the class defined for creating intermediate blocks is used to create the intermediate blocks. For using these blocks sequentially nn.Sequential is used so that the output of each block will be an input of the next.
After going through the intermediate blocks, the average results of each channel will be processed by two fully connected layers. The output of these layers should have the same number of channels as the classes in our dataset, which in this case is 10.
The best accuracy achieved by this model is 96.708% for training dataset and for this training accuracy the test accuracy is 85.31%. The figures for cross entropy loss for each training batch and train and test accuracy for epochs are also shown respectively.


