Skip to content

Official implementation of "ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning" [CVPR 2025 Highlight]

Notifications You must be signed in to change notification settings

HaoyuanYang-2023/ImagineFSL

Repository files navigation

ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning

Introduction

This repository contains the official code for "ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning"   (  CVPR 2025 Highlight )

In this paper:

  • We frame synthetic images as standalone knowledge repositories and present a CLIP adaptation methodology that pretrains on purely synthetic images before fine-tuning for few-shot tasks. This marks a clear departure from existing one-stage fine-tuning methods that simply treat synthetic images as complements to real images.

  • We propose an improved Self-SL method based on DINO, specifically tailored for FSL. It introduces higher-order moments for image representation and employs synthetic augmentation for effective view construction.

  • We develop a systematic and scalable pipeline for synthesizing both captions and images, enabling generation of large-scale base sets for pretraining and task-specific datasets. Distinct from existing arts, we leverage chain-of-though and in-conetext learning techniques for diverse, realistic image generation.

Installation

1. Clone this repository:

git clone https://github.com/HaoyuanYang-2023/ImagineFSL.git
cd ImagineFSL

2. Install dependencies:

⚠️ To ensure stable and reproducible code execution, we strongly recommend setting up the following environment for experiments.

We conduct experiments using PyTorch 2.2.2 and Python 3.10. The CUDA version is 12.1. Install the corresponding PyTorch environment using:

pip install torch==2.2.2 torchvision==0.17.2 --index-url https://download.pytorch.org/whl/cu121

Install other dependencies using:

pip install -r requirements.txt

Note: We use Meta's xformers library to accelerate Attention computation. Different hardware environments may require different versions of xformers. The installation command is provided in requirements.txt, which is validated on RTX 4090 and 3090. If installation fails, try different versions. For more information, refer to the offical website of xformers.

Alternatively, you can use our provided Docker image, which contains all the required environments and dependencies for running the program.

To get the image, run the following command:

docker pull haoyuanyang2001/imaginefsl:v1

Dataset

  • iBase Dataset:

    The iBase dataset used for pretraining can be downloaded from the following links:

    Baidu Yun | Microsoft OneDrive

  • 10 Downstream Datasets (Real Images):

    We provide the following download links for the 10 downstream datasets used in our experiments (except ImageNet). These datasets are identical to those provided by CoOp but with standardized file organization for PyTorch compatibility.

    Baidu Yun | Microsoft OneDrive

Getting started

1. Synthesizing Captions & Images

Run the following command to get into the directory of synthesizing captions and images:

cd synthesizing

Querying GPT-4 to Analyze Key Factors

We query GPT-4 to analyze key factors for different datasets. You need to register an account on OpenAI and obtain an api key for GPT-4. For more details, refer to the OpenAI API documentation.

Run the following command to analyze attribute:

python syn_attribute.py \
--api_key YOUR API_KEY \
--model gpt-4 \
--dataset DATASET_NAME \ 

Run the following command to analyze background(BG):

python syn_background.py \
--api_key YOUR API_KEY \
--model gpt-4 \
--dataset DATASET_NAME \ 

We also provide the factors of viewpoint, lighting condition (LC) and cause of degradation of photos (CD) in the synthesizing/utils folder.

Synthesize Examples Captions by GPT-4

Run the following command to high-quality exemplary captions for different datasets:

python syn_examples.py \
--api_key YOUR_API_KEY \
--model gpt-4 \
--dataset DATASET_NAME \ 

Synthesize Extensive Captions by Llama 3

We use Llama 3 8B to synthesize extensive captions. The weight files of Llama 3 8B can be downloaded here.

You need to install additional dependencies required for Llama 3:

fire==0.3.0
fairscale==0.4.13
tiktoken==0.7.0
blobfile==0.3.0

Run the following command to synthesize extensive captions for different datasets:

LLAMA_FOLDER=YOUR_LLAMA_WEIGHT_FILE_FOLDER

torchrun --nproc_per_node 1 --master_port 12388 \
    syn_captions.py --ckpt_dir ${LLAMA_FOLDER} --tokenizer_path ${LLAMA_FOLDER}/tokenizer.model \
    --max_batch_size 16 --max_seq_len 400 --max_gen_len 100 \
    --total_captions 300 --seed 0 --category DATASET_NAME --temperature 0.8

Synthesize Images

We use Stable Diffusion 3 Medium accelerated by TensorRT to synthesize images. Refer to the code provided by NVIDIA for details.


2. Pretraining

Run the following command for pretraining:

sh run_pretrain.sh

You need to specify the hyperparameters for pretraining in the config files in the dinov2/config/train folder.

We provide download links for the pretrained model weights of CLIP ViT-B/16 and CLIP ViT-L/14:


3. Few-shot Fine-tuning

ImagineFSL:

Run the following command for ImagenFSL fine-tuning:

sh run_imaginefsl.sh

You need to set the hyperparameters for fine-tuning in dinov2/eval/imgainefsl_tuning_pipline.py folder and the dataset path in the dinov2/eval/ct_tuning_mixing.py first.

For evaluation, run the following command:

sh run_imaginefsl_eval.sh

ImagineFSL_LoRA:

Run the following command for ImagenFSL_LoRA fine-tuning:

sh run_imaginefsl_lora.sh

You need to set the hyperparameters for fine-tuning in dinov2/eval/imgainefsl_lora_tuning_pipline.py folder and the dataset path in the dinov2/eval/ct_tuning_mixing_lora.py first.

For evaluation, run the following command:

sh run_imaginefsl_lora_eval.sh

Note: Due to the impact of randomness during training, the results on individual datasets may slightly differ from those in the paper. We recommend evaluating all methods across all 11 datasets and observing the average performance.

Models:

We provide download links for fine-tuned models on 1-/16-shot settings for ViT-B/16 across 11 datasets:

Method 1-shot 16-shot
ImagineFSL 76.1 | Baidu Yun | Google Drive 86.4 | Baidu Yun | Google Drive
ImagineFSL_LoRA 77.6 | Baidu Yun | Google Drive 87.6 | Baidu Yun | Google Drive
See readme.txt in the above links for more details of the models and hyperparameters for inference.

Detailed results of All K-shot settings can be found in here.

Acknowledgement

  • We thank the authors of CLIP and DINOv2. This repository is built upon the official implementations of CLIP and DINOv2.

  • We are also grateful to the authors of CoOp for providing dataset instructions, DISEF for their codebase, and SynCLR for their codebase.

  • We further acknowledge the contributions of other researchers who have made their code publicly available.

Citation

If this repository or the paper "ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning" is helpful for your research, please consider citing the paper:

@InProceedings{ImagineFSL_CVPR25,
    author    = {Yang, Haoyuan and Li, Xiaoou and Lv, Jiaming and Cheng, Xianjun and Wang, Qilong and Li, Peihua},
    title     = {{ImagineFSL}: Self-Supervised Pretraining Matters on Imagined Base Set for {VLM}-based Few-shot Learning},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year      = {2025},
}

Contact

If you have any questions or suggestions, please contact us:

About

Official implementation of "ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning" [CVPR 2025 Highlight]

Topics

Resources

Stars

Watchers

Forks