Odin Hoff Gardå, April 2023
In this project, we will compare the performance of the following three models:
- A standard off-the-shelf convolutional neural network (CNN),
- a convolutional neural network with averaged/smoothed output (SmoothCNN) and
- a group equivariant convolutional neural network (GCNN).
The learning objective for all models is to predict weather condition (cloudy or sunny) from a stereo image as input. In other words, we are solving a binary regression problem. We report cross entropy loss (also known as log loss) and classification accuracies on the training, validation and test dataset.
Given a set of symmetries acting on our input image, it is often reasonable to wish for invariance under this action. For example, a rotated version of an image should give the same prediction as the original image. Often, we try to solve this by using data augmentation. However, training with augmented data does not guarantee invariance. Also, it is costly both in terms of time and the amount of data needed to approximate invariance with augmentation. Therefore, if we a priori know which symmetries we want our network to be invariant under, it is a better idea to design the network to have this property built-in. We aim to achieve this by using group equivariant convolutional layers followed by some permutation invariant operation (e.g., min, max, sum or arithmetic mean).
The dataset consists of 1000 stereo images each consisting of one left and one right image. Each image has 3 channels (RGB) with resolution 879x400 (WxH). The possible label values are 'cloudy' (0) and 'sunny' (1). The dataset is perfectly balanced with 500 samples of each label.
Figure: Two images (index 2 and 3) from the dataset (left and right view) with labels 'sunny' and 'cloudy'.
Note: The two cameras used were mounted on top of a car, with one in the center and one on the right hand side, 54 cm apart. Still, we will call the center and right images the left view and the right view, respectively.
In practice, each view is resized to 400x200 (WxH) to reduce memory requirements and to speed up loading data to GPU. The data loader returns tensors of shape utils/dataloader.py
.
We now describe the symmetries that we want the model to be invariant under.
The symmetry group of a (non-square) rectangle is the dihedral group
-
$e$ : rotation by 0 (identity), -
$r$ : rotation by$\pi$ , -
$m_h$ : mirroring along the horizontal axis, and -
$m_v$ : mirroring along the vertical axis.
Table: The Cayley table of
We could let two copies of
Figure: We have three non-trivial
The following table shows the different symmetries applied to an example image:
The actual implementation of the group actions can be found in utils/group.py
. The functions and the relations between them are then given to the Group
class constructor as a list of functions and a Cayley table, respectively. The group equivariant layers then take an instance of the Group
class as the first argument.
Remark: In the first implementation, the two views were stacked side-by-side and treated as a single rectangle with the natural
We now give a more theoretical perspective on the action of
Let
Let
on the generators
We lift the
Since in
Note: The map
In practice, a stereo image is represented by a tensor of shape torch.flip()
function. The group acts in exactly the same way on stereo feature maps and weights (kernels) since these can also be viewed as signals on the domain
The CNN model is more or less a standard CNN network. The only difference is that we treat the two views, left and right, separately when performing convolution, pooling and batch normalization. In other words, the right and left views have their own set of weights and biases. Convolution, max pooling and batch normalization layers for stereo images are implemented in models/stereoconv.py
. We use circular padding on each view (wrapping around in both directions). The implementation of the CNN model can be found in models/cnn.py
.
It is hard to believe that convolution is equivariant under translations. The idea for the SmoothCNN model is to force the CNN model to be invariant under symmetries. In the GCNN model, we build
The CNN model has 28637 learnable parameters and consists of the following layers:
Layer (type:depth-idx) Param #
======================================================================
CNNModel --
├─Sequential: 1-1 --
│ └─StereoConvBlock: 2-1 --
│ │ └─Sequential: 3-1 1,920
│ └─StereoMaxPool2d: 2-2 --
│ └─StereoConvBlock: 2-3 --
│ │ └─Sequential: 3-2 9,312
│ └─StereoMaxPool2d: 2-4 --
│ └─StereoConvBlock: 2-5 --
│ │ └─Sequential: 3-3 4,704
│ └─StereoConvBlock: 2-6 --
│ │ └─Sequential: 3-4 2,352
│ └─StereoConvBlock: 2-7 --
│ │ └─Sequential: 3-5 1,200
│ └─StereoMaxPool2d: 2-8 --
│ └─StereoConvBlock: 2-9 --
│ │ └─Sequential: 3-6 600
│ └─StereoMaxPool2d: 2-10 --
│ └─StereoConvBlock: 2-11 --
│ │ └─Sequential: 3-7 312
│ └─StereoConvBlock: 2-12 --
│ │ └─Sequential: 3-8 156
├─Sequential: 1-2 --
│ └─Linear: 2-13 8,040
│ └─Linear: 2-14 41
│ └─Sigmoid: 2-15 --
======================================================================
We now describe the most naive approach to obtain a (non-trivial)
Figure: A diagram showing the SmoothCNN model. The function
The SmoothCNN model is just the CNN model with a modified forward()
method averaging the output probabilities over all transformed version of a stereo image. The implementation can be found in models/smoothcnn.py
.
We will now discuss the group equivariant convolutional neural network. Implementations of the different layers mentioned below can be found in models/groupconv.py
and the final model can be found in models/gcnn.py
.
The first layer in our group equivariant convolutional neural network is often called the lifting convolution and takes a signal on
where
In practice, we have more than one feature map (out channel). The definition is the same, but with an additional variable indexing the out channels.
The lifting convolutional layer is
In practice, the lifting convolution is implemented by performing normal convolution with all transformed versions of the filter
Once our stereo images (signals on
Note that since
Batch normalization is done in a similar way to the case of stereo images. However, we only have one scale and one bias parameter for each
We also have a group pooling layer which compute the average (or sum/min/max) over the group dimension. This layer can also be set to average over all transformed versions of the input to force a
We create a small GCNN consisting of a lifting convolutional layer, followed by two models/groupconv.py
, this is very easily accomplished:
self.conv1 = nn.Sequential(StereoZ2ConvG(group, 3, 8, 3, 1), StereoGBatchNorm2d(group, 8), nn.ReLU())
self.conv2 = StereoGConvBlock(group, 8, 8, 3, 1)
self.conv3 = StereoGConvBlock(group, 8, 3, 5, 2)
self.g_pool = StereoGAveragePool(reduction="sum")
Here we use three channels in the last convolutional layer so that we can visualize the activations as RGB images. The following illustrating examples were created by running equivariance_testing.py
.
Symmetry: |
Input: |
Activation: |
---|---|---|
Table: Letting a symmetry act on a stereo image before applying the GCNN
Now, to force invariance under the
self.g_pool = StereoGAveragePool(group, reduction="sum")
When the group is passed to this layer, we also reduce over all transformed versions of the input. Hence, forcing the output to be invariant. In the larger GCNN model, we do this right before the last two fully connected layers preserving equivariance as deep in the network as possible. The output activation then looks as follows (regardless of which symmetry is applied to the input):
Figure: Output activation when we ask the group pooling layer to force invariance.
The GCNN model has 28121 learnable parameters (slightly less than the CNN and SmoothCNN models) and consists of the following layers:
Layer (type:depth-idx) Param #
======================================================================
GCNNModel --
├─Sequential: 1-1 --
│ └─StereoZ2ConvG: 2-1 872
│ └─StereoGBatchNorm2d: 2-2 16
│ │ └─GroupNorm: 3-1 --
│ │ └─GroupNorm: 3-2 --
│ └─ReLU: 2-3 --
├─Sequential: 1-2 --
│ └─StereoGMaxPool2d: 2-4 --
│ └─StereoGConvBlock: 2-5 --
│ │ └─Sequential: 3-3 9,240
│ └─StereoGConvBlock: 2-6 --
│ │ └─Sequential: 3-4 4,632
│ └─StereoGMaxPool2d: 2-7 --
│ └─StereoGConvBlock: 2-8 --
│ │ └─Sequential: 3-5 2,328
│ └─StereoGConvBlock: 2-9 --
│ │ └─Sequential: 3-6 1,176
│ └─StereoGMaxPool2d: 2-10 --
│ └─StereoGConvBlock: 2-11 --
│ │ └─Sequential: 3-7 1,176
│ └─StereoGConvBlock: 2-12 --
│ │ └─Sequential: 3-8 600
│ └─StereoGAveragePool: 2-13 --
├─Sequential: 1-3 --
│ └─Linear: 2-14 8,040
│ └─ReLU: 2-15 --
│ └─Linear: 2-16 41
│ └─Sigmoid: 2-17 --
======================================================================
All models were trained with the Adam optimizer (with learning rate 1e-4 and weight decay 1e-2) and a batch size of 16. Validation was performed four times every epoch and the training early stopped if there where no improvement (hysteresis set to 0.01) in validation accuracy for 20 validation steps (5 epochs).
Figure: Loss and accuracy for training and validation data for the CNN model. The model trained for 46 epochs before the early stopper terminated the training. The mean time used for each epoch (including validation steps) was 7.19 seconds.
Figure: Loss and accuracy for training and validation data for the SmoothCNN model. The model trained for 34 epochs before the early stopper terminated the training. The mean time used for each epoch (including validation steps) was 8.86 seconds.
Figure: Loss and accuracy for training and validation data for the GCNN model. The model trained for the maximum number of epochs set to 50 and was not terminated by the early stopper. The mean time used for each epoch (including validation steps) was 12.27 seconds.
The following table shows the performance of each model on the test dataset and the (mean) time spent on each epoch during training:
Model | Test Accuracy | Mean epoch time |
---|---|---|
CNN | 0.9265 | 7.19s |
SmoothCNN | 0.9638 | 8.86s |
GCNN | 0.9737 | 12.27 |
When using more training data, the standard CNN network easily gives a model with >99% test accuracy leaving little room for improvement. To make the task more challenging, we use only 400 images in our training dataset. Still, we only see a slight increase in test accuracies when evaluating the SmoothCNN and the GCNN models. The CNN network is also the fastest of the three models, both in training and evaluation. The SmoothCNN is a sort of all-rounder performing better than the standard CNN and being faster than the GCNN. It should be noted that there likely is room for optimizing the implementation of the group equivariant layers.
It would be interesting to compare the models on an even more challenging task such as predicting depth maps from the stereo images. Another task we could test is the rotated (and/or mirrored) MNIST dataset. Even though the handwritten digit six is a transformed version of the digit nine, it would be interesting to see if the GCNN could differentiate them based subtle differences in how we (humans) write the different digits.
Of course, there is room for experimentation when it comes to architecture and hyper-parameters. Since the sum of two equivariant layers is equivariant, we can also create residual blocks allowing for deeper networks by reducing the risk of vanishing gradients. We could also try to train the standard CNN model with data augmentation and compare to the GCNN model.
The code provided for the GCNN network can easily be modified to work with other finite groups and regular images.