ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning
This repository contains the official code for "ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning"
( CVPR 2025 Highlight )
In this paper:
-
We frame synthetic images as standalone knowledge repositories and present a CLIP adaptation methodology that pretrains on purely synthetic images before fine-tuning for few-shot tasks. This marks a clear departure from existing one-stage fine-tuning methods that simply treat synthetic images as complements to real images.
-
We propose an improved Self-SL method based on DINO, specifically tailored for FSL. It introduces higher-order moments for image representation and employs synthetic augmentation for effective view construction.
-
We develop a systematic and scalable pipeline for synthesizing both captions and images, enabling generation of large-scale base sets for pretraining and task-specific datasets. Distinct from existing arts, we leverage chain-of-though and in-conetext learning techniques for diverse, realistic image generation.
git clone https://github.com/HaoyuanYang-2023/ImagineFSL.git
cd ImagineFSL
⚠️ To ensure stable and reproducible code execution, we strongly recommend setting up the following environment for experiments.
We conduct experiments using PyTorch 2.2.2 and Python 3.10. The CUDA version is 12.1. Install the corresponding PyTorch environment using:
pip install torch==2.2.2 torchvision==0.17.2 --index-url https://download.pytorch.org/whl/cu121
Install other dependencies using:
pip install -r requirements.txt
Note: We use Meta's xformers library to accelerate Attention computation. Different hardware environments may require different versions of xformers. The installation command is provided in requirements.txt
, which is validated on RTX 4090 and 3090. If installation fails, try different versions. For more information, refer to the offical website of xformers
.
Alternatively, you can use our provided Docker image, which contains all the required environments and dependencies for running the program.
To get the image, run the following command:
docker pull haoyuanyang2001/imaginefsl:v1
-
iBase Dataset:
The iBase dataset used for pretraining can be downloaded from the following links:
-
10 Downstream Datasets (Real Images):
We provide the following download links for the 10 downstream datasets used in our experiments (except ImageNet). These datasets are identical to those provided by
CoOp
but with standardized file organization for PyTorch compatibility.
Run the following command to get into the directory of synthesizing captions and images:
cd synthesizing
We query GPT-4 to analyze key factors for different datasets. You need to register an account on OpenAI
and obtain an api key for GPT-4. For more details, refer to the OpenAI API documentation
.
Run the following command to analyze attribute
:
python syn_attribute.py \
--api_key YOUR API_KEY \
--model gpt-4 \
--dataset DATASET_NAME \
Run the following command to analyze background(BG)
:
python syn_background.py \
--api_key YOUR API_KEY \
--model gpt-4 \
--dataset DATASET_NAME \
We also provide the factors of viewpoint
, lighting condition (LC)
and cause of degradation of photos (CD)
in the synthesizing/utils
folder.
Run the following command to high-quality exemplary captions for different datasets:
python syn_examples.py \
--api_key YOUR_API_KEY \
--model gpt-4 \
--dataset DATASET_NAME \
We use Llama 3 8B to synthesize extensive captions. The weight files of Llama 3 8B can be downloaded here
.
You need to install additional dependencies required for Llama 3:
fire==0.3.0
fairscale==0.4.13
tiktoken==0.7.0
blobfile==0.3.0
Run the following command to synthesize extensive captions for different datasets:
LLAMA_FOLDER=YOUR_LLAMA_WEIGHT_FILE_FOLDER
torchrun --nproc_per_node 1 --master_port 12388 \
syn_captions.py --ckpt_dir ${LLAMA_FOLDER} --tokenizer_path ${LLAMA_FOLDER}/tokenizer.model \
--max_batch_size 16 --max_seq_len 400 --max_gen_len 100 \
--total_captions 300 --seed 0 --category DATASET_NAME --temperature 0.8
We use Stable Diffusion 3 Medium accelerated by TensorRT to synthesize images. Refer to the code provided by NVIDIA
for details.
Run the following command for pretraining:
sh run_pretrain.sh
You need to specify the hyperparameters for pretraining in the config files in the dinov2/config/train
folder.
We provide download links for the pretrained model weights of CLIP ViT-B/16 and CLIP ViT-L/14:
-
CLIP ViT-B/16:
Baidu Yun
|Google Drive
-
CLIP ViT-L/14:
Baidu Yun
|Google Drive
ImagineFSL:
Run the following command for ImagenFSL fine-tuning:
sh run_imaginefsl.sh
You need to set the hyperparameters for fine-tuning in dinov2/eval/imgainefsl_tuning_pipline.py
folder and the dataset path in the dinov2/eval/ct_tuning_mixing.py
first.
For evaluation, run the following command:
sh run_imaginefsl_eval.sh
ImagineFSL_LoRA:
Run the following command for ImagenFSL_LoRA fine-tuning:
sh run_imaginefsl_lora.sh
You need to set the hyperparameters for fine-tuning in dinov2/eval/imgainefsl_lora_tuning_pipline.py
folder and the dataset path in the dinov2/eval/ct_tuning_mixing_lora.py
first.
For evaluation, run the following command:
sh run_imaginefsl_lora_eval.sh
Note: Due to the impact of randomness during training, the results on individual datasets may slightly differ from those in the paper. We recommend evaluating all methods across all 11 datasets and observing the average performance.
Models:
We provide download links for fine-tuned models on 1-/16-shot settings for ViT-B/16 across 11 datasets:
Method | 1-shot | 16-shot |
---|---|---|
ImagineFSL | 76.1 | Baidu Yun | Google Drive |
86.4 | Baidu Yun | Google Drive |
ImagineFSL_LoRA | 77.6 | Baidu Yun | Google Drive |
87.6 | Baidu Yun | Google Drive |
Detailed results of All K-shot settings can be found in here
.
-
We thank the authors of CLIP and DINOv2. This repository is built upon the official implementations of
CLIP
andDINOv2
. -
We are also grateful to the authors of CoOp for providing
dataset instructions
, DISEF fortheir codebase
, and SynCLR fortheir codebase
. -
We further acknowledge the contributions of other researchers who have made their code publicly available.
If this repository or the paper "ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning" is helpful for your research, please consider citing the paper:
@InProceedings{ImagineFSL_CVPR25,
author = {Yang, Haoyuan and Li, Xiaoou and Lv, Jiaming and Cheng, Xianjun and Wang, Qilong and Li, Peihua},
title = {{ImagineFSL}: Self-Supervised Pretraining Matters on Imagined Base Set for {VLM}-based Few-shot Learning},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2025},
}
If you have any questions or suggestions, please contact us:
- Haoyuan Yang (yanghaoyuan@mail.dlut.edu.cn)