This is the official repository for Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning [ECCV24], mainly for the proposed framework PCM-Net.
- Salient Visual Concept Detection: For each input image, salient visual concepts are detected based on image-text similarity in CLIP space.
- Patch-wise Feature Fusion: Selectively fuses patch-wise visual features with textual features of salient concepts, creating a mixed-up feature map with reduced defects.
- Visual-Semantic Encoding: A visual-semantic encoder refines the feature map, which is then used by the sentence decoder for generating captions.
- CLIP-weighted Cross-Entropy Loss: A novel loss function prioritizes high-quality image-text pairs over low-quality ones, enhancing model training with synthetic data.
- SynthImgCap Dataset is available.
- We use OpenAI-CLIP-Feature to extract the visual CLIP features of synthetic images at training and GT real images at inference.
- META ANNO DATA and RELEASED MODELS are released in Google Driver.
Please refer to scripts/train.sh
.
Please refer to scripts/final_eval_for_paper.sh
.
If you use the SynthImgCap dataset or code or models for your research, please cite:
@inproceedings{luo2025unleashing,
title={Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning},
author={Luo, Jianjie and Chen, Jingwen and Li, Yehao and Pan, Yingwei and Feng, Jianlin and Chao, Hongyang and Yao, Ting},
booktitle={European Conference on Computer Vision (ECCV)},
pages={237--254},
year={2024},
organization={Springer}
}
This code used resources from X-Modaler Codebase and DenseCLIP code. We thank the authors for open-sourcing their awesome projects.
MIT