CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation (CVPR 2023)
-
2023/12/09
Our new paper TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training is accepted by AAAI 2024. It can generate image-level labels based on frozen CLIP and can realize annotation-free semantic segmentation without any training when combining with CLIP-ES. -
2023/2/28
Our paper is accepted by CVPR 2023.
# create conda env
conda create -n clip-es python=3.9
conda activate clip-es
# install packages
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html
pip install opencv-python ftfy regex tqdm ttach tensorboard lxml cython
# install pydensecrf from source
git clone https://github.com/lucasb-eyer/pydensecrf
cd pydensecrf
python setup.py install
Download images in PASCAL VOC2012 dataset at here and the train_aug groundtruth at here.
The structure of /your_home_dir/datasets/VOC2012
should be organized as follows:
---VOC2012/
--Annotations
--ImageSets
--JPEGImages
--SegmentationClass
--SegmentationClassAug
Download MS COCO images from the official website.
Download semantic segmentation annotations for the MS COCO dataset at here.
The structure of /your_home_dir/datasets/COCO2014
are suggested to be organized as follows:
---COCO2014/
--Annotations
--JPEGImages
-train2014
-val2014
--SegmentationClass
Download CLIP pre-trained [ViT-B/16] at here and put it to /your_home_dir/pretrained_models/clip
.
# For VOC12
CUDA_VISIBLE_DEVICES=0 python generate_cams_voc12.py --img_root /your_home_dir/datasets/VOC2012/JPEGImages --split_file ./voc12/train_aug.txt --model /your_home_dir/pretrained_models/clip/ViT-B-16.pt --num_workers 1 --cam_out_dir ./output/voc12/cams
# For COCO14
CUDA_VISIBLE_DEVICES=0 python generate_cams_coco14.py --img_root /your_home_dir/datasets/COCO2014/JPEGImages/train2014 --split_file ./coco14/train.txt --model /your_home_dir/pretrained_models/clip/ViT-B-16.pt --num_workers 1 --cam_out_dir ./output/coco14/cams
# (optional) evaluate generated CAMs
## for VOC12
python eval_cam.py --cam_out_dir ./output/voc12/cams --cam_type attn_highres --gt_root /your_home_dir/datasets/VOC2012/SegmentationClassAug --split_file ./voc12/train.txt
## for COCO14
python eval_cam.py --cam_out_dir ./output/coco14/cams --cam_type attn_highres --gt_root /your_home_dir/datasets/COCO2014/SegmentationClass --split_file ./coco14/train.txt
# use CRF process to generate pseudo masks
(realize confidence-guided loss by setting pixels with low confidence to 255)
## for VOC12
python eval_cam_with_crf.py --cam_out_dir ./output/voc12/cams --gt_root /your_home_dir/datasets/VOC2012/SegmentationClassAug --image_root /your_home_dir/datasets/VOC2012/JPEGImages --split_file ./voc12/train_aug.txt --pseudo_mask_save_path ./output/voc12/pseudo_masks
## for COCO14
python eval_cam_with_crf.py --cam_out_dir ./output/coco14/cams --gt_root /your_home_dir/datasets/COCO2014/SegmentationClass --image_root /your_home_dir/datasets/COCO2014/JPEGImages/train2014 --split_file ./coco14/train.txt --pseudo_mask_save_path ./output/coco2014/pseudo_masks
# eval CRF processed pseudo masks
## for VOC12
python eval_cam_with_crf.py --cam_out_dir ./output/voc12/cams --gt_root /your_home_dir/datasets/VOC2012/SegmentationClassAug --image_root /your_home_dir/datasets/VOC2012/JPEGImages --split_file ./voc12/train_aug.txt --eval_only
## for COCO14
python eval_cam_with_crf.py --cam_out_dir ./output/coco14/cams --gt_root /your_home_dir/datasets/COCO2014/SegmentationClass --image_root /your_home_dir/datasets/COCO2014/JPEGImages/train2014 --split_file ./coco14/train.txt --eval_only
The generated pseudo masks of VOC12 and COCO14 can be found at Google Drive.
To train DeepLab-v2, we refer to deeplab-pytorch. The ImageNet pre-trained model can be found in AdvCAM.
Method | CAMs | +CRF |
---|---|---|
CLIP-ES | 70.8 | 75.0 |
Method | Network | Pretrained | val | test |
---|---|---|---|---|
CLIP-ES | DeepLabV2 | ImageNet | 71.1 | 71.4 |
CLIP-ES | DeepLabV2 | COCO | 73.8 | 73.9 |
Method | Network | Pretrained | val |
---|---|---|---|
CLIP-ES | DeepLabV2 | ImageNet | 45.4 |
We borrowed the code from CLIP and pytorch_grad_cam. Thanks for their wonderful works.
If you find this project helpful for your research, please consider citing the following BibTeX entry.
@InProceedings{Lin_2023_CVPR,
author = {Lin, Yuqi and Chen, Minghao and Wang, Wenxiao and Wu, Boxi and Li, Ke and Lin, Binbin and Liu, Haifeng and He, Xiaofei},
title = {CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {15305-15314}
}