ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Xiaoxing Hu^1,2* Kaicheng Yang^3* Ziyang Gong¹ Qi Ming⁴ Zonghao Guo⁵ Xiang An³ Ziyong Feng³ Junchi Yan¹ Xue Yang^1†

¹Shanghai Jiao Tong University ² Beijing Institute of Technology ³ DeepGlint
⁴ Beijing University of Technology ⁵ Tsinghua University
^* Equal contribution ^† Corresponding author

If you find our work helpful, please consider giving us a ⭐!

Official PyTorch implementation of [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder.]

Notice

This repository is still being organized and refined. If you encounter any issues while using it, please contact |Email: xiaoxinghhh@gmail.com|WeChat: 15111480307| or submit an issue. Thank you for your attention.

TODO

Training and validation instruction
Paper Link
Model Weights

📢 News

[2025-10-20] We have model weights, please check the Model Zoo for details.
[2025-10-22] We have update the paper, please check the arXiv for details.

📖 Introduction

This repository contains the official pytorchimplementation of [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder.]. We introduce a progressive vision-language alignment approach that aligns the LLM-based embedder with the CLIP image encoder in a curriculum learning manner to enhance long-text, multilingual, and fine-grained understanding.

Paper Link:

Model Zoo:

👁️ Methodology

Stage 1: Align the LLM-based embedder with the CLIP text encoder via Cross-Architecture Distillation.
Stage 2: Align the LLM-based embedder with the CLIP image encoder with Self-Distillation Regularization.

🛠️ Requirements

Python >= 3.9
CUDA >= 11.8 (if using GPU)
Other dependencies in requirements.txt

🚀 Installation

Clone this repository and install dependencies:

# Clone the repo
git clone https://github.com/VisionXLab/ProCLIP.git
cd ProCLIP

# Create virtual environment
conda create -n proclip python=3.9 -y
conda activate proclip
# Install dependencies
pip install -r requirements.txt

Training

Coming soon.

Evaluation

Coming soon.

📊 Results

Retrieval Results

Classification Results

Multilingual Retrieval Results

Comparison with other LLM embedders-based CLIP models

More results can be found in the paper.

📜 Citation

If you find our work helpful, please cite our paper:

@article{hu2025proclip,
  title={ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder},
  author={Hu, Xiaoxing and Yang, Kaicheng and Feng, Ziyong and Ming, Qi and Guo, Zonghao and An, Xiang and Yan, Junchi and Yang, Xue},
  journal={arXiv preprint arXiv:2510.18795},
  year={2025}
}
@inproceedings{hu2025decoupled,
  title={Decoupled global-local alignment for improving compositional understanding},
  author={Hu, Xiaoxing and Yang, Kaicheng and Wang, Jun and Xu, Haoran and Feng, Ziyong and Wang, Yupei},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={3251--3260},
  year={2025}
}

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙌 Acknowledgments

Our work is inspired by LLM2CLIP and CLIP. We are grateful for their outstanding work and code.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
openclip		openclip
training		training
.DS_Store		.DS_Store
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!