This repo contains two jupyter notebooks with a simple implementation of the ideas presented in the paper Learning Deep Features for Discriminative Localization. A CNN model is trained on the original MNIST to classify the digits and then its class activation maps (CAM) are generated for some examples. These activation maps are a kind of post-hoc attention that identify regions in the image that were more relevant to make a particular predicition, on a pre-trained model.
- CAM - class activation map2 contains the model training and testing. The model is named c3 because it is the third iteration; the folders '
model_cX
contain more information this and the other models. Model c3 is a convultional NN with 5 stacked conv layers with 16 channels each, except the last one which has 256 channels. The last conv layer is connected with a Global Average Pooling layer with transforms the(h,w,256)
volume into a vector with 256 entries. This vector is then multiplied with a matrixW
and the result softmaxed, so the output is a vector with 10 entries corresponding to the probabilities of each class. - Visualize CAMs loads the model c3 with the weights learnt in the other notebook and uses the function
get_CAM
to get the class activation maps. To get the activation map for the digit i, we obtain the 256 channels outputed by the last convolutional layer and sum them, weighted them by the corresponding i-th column of the matrixW
. Below are examples of some of the handwritten digits and their respective activation maps above them.
Notice that the areas where the model is paying more attention (the areas in red) tend to be the areas with high curvature (this is particularly evident in the last two handwritten threes). These findings are consistent with the ideas presented by Attneave (see Attneave’s cat) that states that the visual information tends to be situated in points of extreme curvature.