Skip to content

Official PyTorch implementation of ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Notifications You must be signed in to change notification settings

VisionXLab/ProCLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProCLIP LogoProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Xiaoxing Hu1,2*Kaicheng Yang3*Ziyang Gong1Qi Ming4Zonghao Guo5Xiang An3Ziyong Feng3Junchi Yan1Xue Yang1†

1Shanghai Jiao Tong University 2 Beijing Institute of Technology 3 DeepGlint
4 Beijing University of Technology 5 Tsinghua University
* Equal contribution Corresponding author

If you find our work helpful, please consider giving us a ⭐!

Official PyTorch implementation of [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder.]

Notice

This repository is still being organized and refined. If you encounter any issues while using it, please contact |Email: xiaoxinghhh@gmail.com|WeChat: 15111480307| or submit an issue. Thank you for your attention.

TODO

  • Training and validation instruction
  • Paper Link
  • Model Weights

📢 News

  • [2025-10-20] We have model weights, please check the Model Zoo for details.
  • [2025-10-22] We have update the paper, please check the arXiv for details.

📖 Introduction

This repository contains the official pytorchimplementation of [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder.]. We introduce a progressive vision-language alignment approach that aligns the LLM-based embedder with the CLIP image encoder in a curriculum learning manner to enhance long-text, multilingual, and fine-grained understanding.

Paper Link: arXiv

Model Zoo: HuggingFace

👁️ Methodology

  • Stage 1: Align the LLM-based embedder with the CLIP text encoder via Cross-Architecture Distillation.
  • Stage 2: Align the LLM-based embedder with the CLIP image encoder with Self-Distillation Regularization.

🛠️ Requirements

  • Python >= 3.9
  • CUDA >= 11.8 (if using GPU)
  • Other dependencies in requirements.txt

🚀 Installation

  • Clone this repository and install dependencies:
# Clone the repo
git clone https://github.com/VisionXLab/ProCLIP.git
cd ProCLIP

# Create virtual environment
conda create -n proclip python=3.9 -y
conda activate proclip
# Install dependencies
pip install -r requirements.txt

Training

Coming soon.

Evaluation

Coming soon.

📊 Results

Retrieval Results

Sample Result

Classification Results

Sample Result

Multilingual Retrieval Results

Sample Result

Comparison with other LLM embedders-based CLIP models

Sample Result

  • More results can be found in the paper.

📜 Citation

If you find our work helpful, please cite our paper:

@article{hu2025proclip,
  title={ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder},
  author={Hu, Xiaoxing and Yang, Kaicheng and Feng, Ziyong and Ming, Qi and Guo, Zonghao and An, Xiang and Yan, Junchi and Yang, Xue},
  journal={arXiv preprint arXiv:2510.18795},
  year={2025}
}
@inproceedings{hu2025decoupled,
  title={Decoupled global-local alignment for improving compositional understanding},
  author={Hu, Xiaoxing and Yang, Kaicheng and Wang, Jun and Xu, Haoran and Feng, Ziyong and Wang, Yupei},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={3251--3260},
  year={2025}
}

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙌 Acknowledgments

Our work is inspired by LLM2CLIP and CLIP. We are grateful for their outstanding work and code.

About

Official PyTorch implementation of ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages