Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
curl --output gpt2-pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin
Clone the repository and create the visualgpt
conda environmnet
conda env create -f environment.yml
conda activate visualgpt
Then download spacy data
python -m spacy download en
We provide the COCO dataset for downloading. Please download the annotations file annotations.zip and extract it.
and coco_detections.hdf5, in which the data is stored in a <key, value>
where key is the image id and value is a tensor (N, 2048). N it the number of detections
create the log folder mkdir logs
and start the training
python train_visualGPT.py --batch_size 50 --head 12 --tau 0.2 --features_path coco_detections.hdf5 --annotation_folder annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --lr 1e-4 --decoder_layer 12 --optimizer_type adamw --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data
This code used resources from Meshed Memory Transformer and Transformers
Please cite our paper from the following bibtex
@@InProceedings{Chen_2022_CVPR,
author = {Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
title = {VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {18030-18040}
}