ClipCap++: Efficient Image Captioning with CLIP

Disclaimer: This is a projected developed for a NLP Capstone Class at the University of Washington.

ClipCap++: Efficient Image Captioning with CLIP

We propose an efficient image captioning model that utilizes pretrained Image and Language models. Our approach, based on a prior work(ClipCap: CLIP Prefix for Image Captioning), improves on the utilization of CLIP and GPT-2, showing competitive results on COCO Captions without fine-tuning any of the pretrained models.

This code is based on the official implementation of "ClipCap: CLIP Prefix for Image Captioning"

We thank the authors for their work and for sharing their implementation

Setup

For evaluation we use the COCO caption evaluation tool, we suggest installing it via

pip install git+https://github.com/flauted/coco-caption.git@python23

For specific packages, we refer the user to our conda env file environment.yml

git clone https://github.com/quocthai9120/UW-NLP-Capstone-SP22.git && cd UW-NLP-Capstone-SP22
conda env create -f environment.yml
conda activate clip_prefix_caption

COCO training

Download train_captions to data/coco/annotations. Download training images and validation images and unzip (We use Karpathy et el. split). Additionally, we suggest downloading our copy of the validation captions. Place the data into a directory named data/ within the base directory of this repo.

Extract CLIP features

output is data/coco/oscar_split_<model_type>_<run_type>.pkl, we support [ViT-B_32, RN50x4].

# for training data
python parse_coco.py --clip_model_type <model_type> --run_type train
# for validation data
python parse_coco.py --clip_model_type <model_type> --run_type val

Training the mapping network

While the original ClipCap framework has two variants: MLP with finetuned GPT-2, mapping transformers with no finetuning of GPT-2, we focus on the latter. To train the transformer mapping network:

python train.py --only_prefix --data ./data/coco/oscar_split_<model_type>_train.pkl --out_dir ./coco_train/ --mapping_type transformer  --num_layers 8 --prefix_length 40 --prefix_length_clip 40

Training the Spatial Feature Extraction model:

TODO

Evaluation

To evaluate the model we need to save predicions:

CUDA_VISIBLE_DEVICES=1 python predict.py --only_prefix --data ./data/coco/oscar_split_ViT-B_32_val.pkl --text_data ./data/coco/oscar_split_clipcap_base_val.pkl --out_dir ./refinement_v2-concat/ --mapping_type transformer --num_layers 8 --prefix_length 40 --prefix_length_clip 40 --weights ./refinement_v2-concat/coco-prefix_refinment-v2-concat_best.pt --tag best

Finally, run evaluation on the predictions by running:

python eval.py --preds_captions refinement_v2-concat/pred_val_caption_best.json

Guided Decoding

TODO

Citation

If you use our code for your research, please cite (along with original clipcap work):

# TODO: let's add our report here as well
@article{mokady2021clipcap,
  title={ClipCap: CLIP Prefix for Image Captioning},
  author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
  journal={arXiv preprint arXiv:2111.09734},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
CLIP @ 01c91ab		CLIP @ 01c91ab
Images		Images
data		data
notebooks		notebooks
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
environment.yml		environment.yml
eval.py		eval.py
manager.py		manager.py
network.py		network.py
network_c.py		network_c.py
parse_coco.py		parse_coco.py
parse_conceptual.py		parse_conceptual.py
predict.py		predict.py
predict_c.py		predict_c.py
save_captions.py		save_captions.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClipCap++: Efficient Image Captioning with CLIP

This code is based on the official implementation of "ClipCap: CLIP Prefix for Image Captioning"

Setup

COCO training

Extract CLIP features

Training the mapping network

Training the Spatial Feature Extraction model:

Evaluation

Guided Decoding

Citation

About

Releases

Packages

Contributors 2

Languages

License

quocthai9120/UW-NLP-Capstone-SP22

Folders and files

Latest commit

History

Repository files navigation

ClipCap++: Efficient Image Captioning with CLIP

This code is based on the official implementation of "ClipCap: CLIP Prefix for Image Captioning"

Setup

COCO training

Extract CLIP features

Training the mapping network

Training the Spatial Feature Extraction model:

Evaluation

Guided Decoding

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages