"a future in the past" -- Assassin's Creed
This repository contains the reference code for the paper Efficient Modeling of Future Context for Image Captioning. In this paper, we aims to utilize mask-based non-autoregressive image caption (NAIC) model to improve the performance of conventional image captioning model with dynamic distribution calibration. As NAIC model is applied to calibrate the generated sentence, the length predictor is dropped.
torch==1.10.1
transformers==4.11.3
clip
To run the code, annotations and detection features for the COCO dataset are needed. Please download the annotations file annotations.zip and extract it. Image representation are firstly computed with the pre-trained model provided by CLIP.
First, run python train_NAIC.py
to obtain the non-autoregressive image captioning model, which serves as a teacher model. Then, run python train_combine.py
to conduct a distribution calibration of conventional transformer image captioning model.
Training arguments are as followings:
Argument | Possible values |
---|---|
--batch_size |
Batch size (default: 10) |
--workers |
Number of workers (default: 0) |
--warmup |
Warmup value for learning rate scheduling (default: 10000) |
--resume_last |
If used, the training will be resumed from the last checkpoint. |
--data_path |
Path to COCO dataset file |
--annotation_folder |
Path to folder with COCO annotations |
To reproduce the results reported in our paper, download the pretrained model file from google drive and place it in the ckpt folder.
Run python inference.py
using the following arguments:
Argument | Possible values |
---|---|
--batch_size |
Batch size (default: 10) |
--workers |
Number of workers (default: 0) |
--data_path |
Path to COCO dataset file |
--annotation_folder |
Path to folder with COCO annotations |
This repository is based on M2T and Huggingface, and you may refer to it for more details about the code.