CapEnrich

This is the official PyTorch implementation for the WWW 2023 paper:

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

We provide the codes of our plug-and-play framework CapEnrich taking VinVL (Oscar+) as the Vision-Language-Pretraining(VLP) backbone. Our codes are built on the VinVL repo.

Requirements

First install the requirements that VinVL needs referring to its INSTALL.md.

Then install other requirements and the CLIP:

$ conda activate oscar
$ pip install ftfy regex tqdm spacy
$ pip install git+https://github.com/openai/CLIP.git

Install the coco_caption evaluation codes:

pip install git+https://github.com/jmhessel/pycocoevalcap.git

Download

Download the image features, text annotations of MSCOCO dataset and the released pre-trained model of VinVL available at its repo page.

The raw images, region features, annotations of MSCOCO datasets should be put in ./oscar/datasets/

The official released VinVL_base model (after CE and RL two-stage fine-tuning on MSCOCO dataset) should be put in ./oscar/pretrained_model/

Automatic Data-building

Construct new-format data like "generic caption, details" on the MSCOCO dataset:

Extract scene graph of all annotations using the tool:

# install scene graph parser tool 
pip install SceneGraphParser
python -m spacy download en
# get scene graphs
cd process_data/
python get_scenegraphs.py

Aggregate multiple annotations to a more detailed one based on the scene graphs

python newdata_construct.py

Training with Learnable Prompts

Refer to run.sh and the specific commands are as followings:

cd ..
python setup.py build develop
cd oscar

CUDA_VISIBLE_DEVICES=3 python run_captioning.py \
    --model_name_or_path ./pretrained_model/coco_captioning_base_scst/checkpoint-15-66405 \
    --do_train \
    --do_lower_case \
    --add_od_labels \
    --learning_rate 3e-4 \
    --per_gpu_train_batch_size 48 \
    --num_train_epochs 30 \
    --tie_weights \
    --freeze_embedding \
    --label_smoothing 0.1 \
    --drop_worst_ratio 0.2 \
    --drop_worst_after 20000 \
    --caption_file './datasets/{}_prefix_prompts.json' \
    --data_dir './datasets/coco_caption' \
    --evaluate_during_training \
    --save_epochs 1 \
    --n_ctx 2 \       
    --ctx_init "" \  
    --output_dir experiments/output_3e-4_nctx2_random

the number of prompts can be set by --n_ctx such as 2,4,6,8, default is 2.

the initialization of prompts can be set by --ctx_init, 1) random initialization from a zero-mean Gaussian distribution --ctx_init '' or 2) initialization from specified word embeddings, such as --ctx_init 'the man'

Inference

Refer to inference.sh and set the checkpoint path --eval_model_dir

# generate more details on test set
CUDA_VISIBLE_DEVICES=4 python end_uni_predict.py \
    --do_predict \
    --predict_yaml test.yaml \
    --per_gpu_eval_batch_size 1 \
    --num_beams 5 \
    --max_gen_length 40 \
    --data_dir ./datasets/coco_caption \
    --output_dir eval_results \
    --output_file output_3e-4_nctx2_random.json \
    --eval_model_dir experiments/output_3e-4_nctx2_random/best_checkpoint \
    --caption_file './eval_results/vinvl_result.json'

# aggregating multiple generated captions
cd process_data/
python post_process.py

The generated captions are available at ./oscar/eval_results/

Evaluation

Run the accuracy captioning metrics including SPICE, CLIPScore and Ref-CLIPScore as followings:

cd metrics/clipscore
python eval.py  --testfile your_test_file  --annofile your_gt_file

An example is:

python eval.py  --testfile '../../eval_results/vinvl_result.json'  --annofile  '../../datasets/coco_caption/test_caption_coco_format.json'

We also provide the codes to calculate the refined CLIP R@K score on the Hard Retrieval Pool.

cd metrics/clip_Self_retrieve
python coco_process_t2i_sim.py --testfile ../../eval_results/vinvl_result.json  --retrieve_set hard

Citation

@inproceedings{Yao2022CapEnrichEC,
  title={CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge},
  author={Linli Yao and Weijing Chen and Qin Jin},
  booktitle={{TheWebConf}},
  year = {2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
oscar		oscar
transformers		transformers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CapEnrich

Requirements

Download

Automatic Data-building

Training with Learnable Prompts

Inference

Evaluation

Citation

About

Releases

Packages

Languages

License

yaolinli/CapEnrich

Folders and files

Latest commit

History

Repository files navigation

CapEnrich

Requirements

Download

Automatic Data-building

Training with Learnable Prompts

Inference

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages