GitHub - HaoyuanYang-2023/ImagineFSL: Official implementation of "ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning" [CVPR 2025 Highlight]

ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning

[Paper] • [Project]

Introduction

This repository contains the official code for "ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning" ( CVPR 2025 Highlight )

In this paper:

We frame synthetic images as standalone knowledge repositories and present a CLIP adaptation methodology that pretrains on purely synthetic images before fine-tuning for few-shot tasks. This marks a clear departure from existing one-stage fine-tuning methods that simply treat synthetic images as complements to real images.
We propose an improved Self-SL method based on DINO, specifically tailored for FSL. It introduces higher-order moments for image representation and employs synthetic augmentation for effective view construction.
We develop a systematic and scalable pipeline for synthesizing both captions and images, enabling generation of large-scale base sets for pretraining and task-specific datasets. Distinct from existing arts, we leverage chain-of-though and in-conetext learning techniques for diverse, realistic image generation.

Installation

1. Clone this repository:

git clone https://github.com/HaoyuanYang-2023/ImagineFSL.git
cd ImagineFSL

2. Install dependencies:

⚠️ To ensure stable and reproducible code execution, we strongly recommend setting up the following environment for experiments.

We conduct experiments using PyTorch 2.2.2 and Python 3.10. The CUDA version is 12.1. Install the corresponding PyTorch environment using:

pip install torch==2.2.2 torchvision==0.17.2 --index-url https://download.pytorch.org/whl/cu121

Install other dependencies using:

pip install -r requirements.txt

Note: We use Meta's xformers library to accelerate Attention computation. Different hardware environments may require different versions of xformers. The installation command is provided in requirements.txt, which is validated on RTX 4090 and 3090. If installation fails, try different versions. For more information, refer to the offical website of xformers.

Alternatively, you can use our provided Docker image, which contains all the required environments and dependencies for running the program.

To get the image, run the following command:

docker pull haoyuanyang2001/imaginefsl:v1

Dataset

iBase Dataset:

The iBase dataset used for pretraining can be downloaded from the following links:

Baidu Yun | Microsoft OneDrive
10 Downstream Datasets (Real Images):

We provide the following download links for the 10 downstream datasets used in our experiments (except ImageNet). These datasets are identical to those provided by CoOp but with standardized file organization for PyTorch compatibility.

Baidu Yun | Microsoft OneDrive

Getting started

1. Synthesizing Captions & Images

Run the following command to get into the directory of synthesizing captions and images:

cd synthesizing

Querying GPT-4 to Analyze Key Factors

We query GPT-4 to analyze key factors for different datasets. You need to register an account on OpenAI and obtain an api key for GPT-4. For more details, refer to the OpenAI API documentation.

Run the following command to analyze attribute:

python syn_attribute.py \
--api_key YOUR API_KEY \
--model gpt-4 \
--dataset DATASET_NAME \

Run the following command to analyze background(BG):

python syn_background.py \
--api_key YOUR API_KEY \
--model gpt-4 \
--dataset DATASET_NAME \

We also provide the factors of viewpoint, lighting condition (LC) and cause of degradation of photos (CD) in the synthesizing/utils folder.

Synthesize Examples Captions by GPT-4

Run the following command to high-quality exemplary captions for different datasets:

python syn_examples.py \
--api_key YOUR_API_KEY \
--model gpt-4 \
--dataset DATASET_NAME \

Synthesize Extensive Captions by Llama 3

We use Llama 3 8B to synthesize extensive captions. The weight files of Llama 3 8B can be downloaded here.

You need to install additional dependencies required for Llama 3:

fire==0.3.0
fairscale==0.4.13
tiktoken==0.7.0
blobfile==0.3.0

Run the following command to synthesize extensive captions for different datasets:

LLAMA_FOLDER=YOUR_LLAMA_WEIGHT_FILE_FOLDER

torchrun --nproc_per_node 1 --master_port 12388 \
    syn_captions.py --ckpt_dir ${LLAMA_FOLDER} --tokenizer_path ${LLAMA_FOLDER}/tokenizer.model \
    --max_batch_size 16 --max_seq_len 400 --max_gen_len 100 \
    --total_captions 300 --seed 0 --category DATASET_NAME --temperature 0.8

Synthesize Images

We use Stable Diffusion 3 Medium accelerated by TensorRT to synthesize images. Refer to the code provided by NVIDIA for details.

2. Pretraining

Run the following command for pretraining:

sh run_pretrain.sh

You need to specify the hyperparameters for pretraining in the config files in the dinov2/config/train folder.

We provide download links for the pretrained model weights of CLIP ViT-B/16 and CLIP ViT-L/14:

CLIP ViT-B/16: Baidu Yun | Google Drive
CLIP ViT-L/14: Baidu Yun | Google Drive

3. Few-shot Fine-tuning

ImagineFSL:

Run the following command for ImagenFSL fine-tuning:

sh run_imaginefsl.sh

You need to set the hyperparameters for fine-tuning in dinov2/eval/imgainefsl_tuning_pipline.py folder and the dataset path in the dinov2/eval/ct_tuning_mixing.py first.

For evaluation, run the following command:

sh run_imaginefsl_eval.sh

ImagineFSL_LoRA:

Run the following command for ImagenFSL_LoRA fine-tuning:

sh run_imaginefsl_lora.sh

You need to set the hyperparameters for fine-tuning in dinov2/eval/imgainefsl_lora_tuning_pipline.py folder and the dataset path in the dinov2/eval/ct_tuning_mixing_lora.py first.

For evaluation, run the following command:

sh run_imaginefsl_lora_eval.sh

Note: Due to the impact of randomness during training, the results on individual datasets may slightly differ from those in the paper. We recommend evaluating all methods across all 11 datasets and observing the average performance.

Models:

We provide download links for fine-tuned models on 1-/16-shot settings for ViT-B/16 across 11 datasets:

Method	1-shot	16-shot
ImagineFSL	76.1 \| `Baidu Yun` \| `Google Drive`	86.4 \| `Baidu Yun` \| `Google Drive`
ImagineFSL_LoRA	77.6 \| `Baidu Yun` \| `Google Drive`	87.6 \| `Baidu Yun` \| `Google Drive`

See readme.txt in the above links for more details of the models and hyperparameters for inference.

Detailed results of All K-shot settings can be found in here.

Acknowledgement

We thank the authors of CLIP and DINOv2. This repository is built upon the official implementations of CLIP and DINOv2.
We are also grateful to the authors of CoOp for providing dataset instructions, DISEF for their codebase, and SynCLR for their codebase.
We further acknowledge the contributions of other researchers who have made their code publicly available.

Citation

If this repository or the paper "ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning" is helpful for your research, please consider citing the paper:

@InProceedings{ImagineFSL_CVPR25,
    author    = {Yang, Haoyuan and Li, Xiaoou and Lv, Jiaming and Cheng, Xianjun and Wang, Qilong and Li, Peihua},
    title     = {{ImagineFSL}: Self-Supervised Pretraining Matters on Imagined Base Set for {VLM}-based Few-shot Learning},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year      = {2025},
}

Contact

If you have any questions or suggestions, please contact us:

Haoyuan Yang (yanghaoyuan@mail.dlut.edu.cn)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning

[Paper] • [Project]

Introduction

Installation

1. Clone this repository:

2. Install dependencies:

Alternatively, you can use our provided Docker image, which contains all the required environments and dependencies for running the program.

Dataset

Getting started

1. Synthesizing Captions & Images

Querying GPT-4 to Analyze Key Factors

Synthesize Examples Captions by GPT-4

Synthesize Extensive Captions by Llama 3

Synthesize Images

2. Pretraining

3. Few-shot Fine-tuning

See `readme.txt` in the above links for more details of the models and hyperparameters for inference.

Acknowledgement

Citation

Contact

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
clip		clip
dinov2		dinov2
imgs		imgs
results		results
synthesizing		synthesizing
README.md		README.md
requirements.txt		requirements.txt
run_imaginefsl.sh		run_imaginefsl.sh
run_imaginefsl_eval.sh		run_imaginefsl_eval.sh
run_imaginefsl_lora.sh		run_imaginefsl_lora.sh
run_imaginefsl_lora_eval.sh		run_imaginefsl_lora_eval.sh
run_pretrain.sh		run_pretrain.sh

HaoyuanYang-2023/ImagineFSL

Folders and files

Latest commit

History

Repository files navigation

ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning

[Paper] • [Project]

Introduction

Installation

1. Clone this repository:

2. Install dependencies:

Alternatively, you can use our provided Docker image, which contains all the required environments and dependencies for running the program.

Dataset

Getting started

1. Synthesizing Captions & Images

Querying GPT-4 to Analyze Key Factors

Synthesize Examples Captions by GPT-4

Synthesize Extensive Captions by Llama 3

Synthesize Images

2. Pretraining

3. Few-shot Fine-tuning

See readme.txt in the above links for more details of the models and hyperparameters for inference.

Acknowledgement

Citation

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages

See `readme.txt` in the above links for more details of the models and hyperparameters for inference.