Paper: Explain and improve: LRP-inference fine-tuning for image captioning models (Link)
This is a Pytorch implementation of the latest version of Understanding Image Captioning Model beyond Visualizing Attention
- To train image captioning models with two kinds of attention mechanisms, adaptive attention, and multi-head attention.
- To get both image explanations and linguistic explanations for a predicted word using LRP, Grad-CAM, Guided Grad-CAM, and GuidedBackpropagation.
- To fine-tune a pre-trained image captioning model with LRP-inference fine-tuning to improve the mAP of frequent object words.
python >=3.6 pytorch =1.4.0
We prepare the Flick30K as the Karpathy split.
We select 110000 images from the training set for training and 5000 images from the training set for validation. The original validation set is used for testing.
The vocabulary is built on the training set for both datasets. Each caption is encoded with a <start> token at the beginning and an <end> token at the end.
For the words that appear less than 3/4 time for Flicker30K and MSCOCO2017, we encode them with an <unk> token.
To build the vocabulary and encode the reference captions, please refer to preparedataset.py.
This repo experiments with both the CNN features and the bottom-up features. The CNN features are extracted from the pre-trained VGG16 on ImageNet. We follow the py-bottom-up-attention to extract 36 bottom-up features per image for training.
We train the image captioning models with two attention mechanisms, the adaptive attention with an LSTM layer as the predictor
and multi-head attention with an FC layer as the predictor.
The two models are defined in gridTDmodel.py and aoamodel.py respectively.
Our pre-trained models can be downloaded here. Please email to sunjiamei.hit@gmail.com if you could not access them.
We evaluate the image captioning models using BLEU, SPICE, ROUGE, METEOR, and CIDER metrics. We also use BERT score. To generate these evaluations,
we need to download the pycocoevalcap tools and copy the folders of different metrics under ./pycocoevalcap.
We already provide the bert folder.
We provide three decoding methods:
- greedy search
- beam search
- diverse beam search
We provide LRP, GradCAM, Guided-GradCAM, and Guided Backpropagation to explain the image captioning models. These explanation methods are defined under the corresponding model files.
There are two stages of explanation. We first explain the decoder to get the explanation of each proceeding word and the encoded image features. We then explain the image encoder to obtain the image explanations.
We provide three optimization methods to optimize image captioning models trained with cross-entropy loss:
- --cider_tune: the SCST optimization on a pre-trained model
- --lrp_cider_tune: the lrp-inference SCST optimization
- --lrp_tune: the lrp-inference finetune with cross-entropy loss
Please refer to the examples in evaluatioin.py. This will generate the results of our ablation experiment and correctness scores across various explanation methods. we need to download the COCOvalEntities.json file for calculating the correctness scores.
Many thanks to the works:


