Xiaoxing Hu1,2* Kaicheng Yang3* Ziyang Gong1 Qi Ming4 Zonghao Guo5 Xiang An3 Ziyong Feng3 Junchi Yan1 Xue Yang1†
4 Beijing University of Technology 5 Tsinghua University
* Equal contribution † Corresponding author
If you find our work helpful, please consider giving us a ⭐!
Official PyTorch implementation of [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder.]
This repository is still being organized and refined. If you encounter any issues while using it, please contact |Email: xiaoxinghhh@gmail.com|WeChat: 15111480307| or submit an issue. Thank you for your attention.
- Training and validation instruction
- Paper Link
- Model Weights
- [2025-10-20] We have model weights, please check the Model Zoo for details.
- [2025-10-22] We have update the paper, please check the arXiv for details.
This repository contains the official pytorchimplementation of [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder.]. We introduce a progressive vision-language alignment approach that aligns the LLM-based embedder with the CLIP image encoder in a curriculum learning manner to enhance long-text, multilingual, and fine-grained understanding.
- Stage 1: Align the LLM-based embedder with the CLIP text encoder via Cross-Architecture Distillation.
- Stage 2: Align the LLM-based embedder with the CLIP image encoder with Self-Distillation Regularization.
- Python >= 3.9
- CUDA >= 11.8 (if using GPU)
- Other dependencies in
requirements.txt
- Clone this repository and install dependencies:
# Clone the repo
git clone https://github.com/VisionXLab/ProCLIP.git
cd ProCLIP
# Create virtual environment
conda create -n proclip python=3.9 -y
conda activate proclip
# Install dependencies
pip install -r requirements.txtComing soon.
Coming soon.
- More results can be found in the paper.
If you find our work helpful, please cite our paper:
@article{hu2025proclip,
title={ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder},
author={Hu, Xiaoxing and Yang, Kaicheng and Feng, Ziyong and Ming, Qi and Guo, Zonghao and An, Xiang and Yan, Junchi and Yang, Xue},
journal={arXiv preprint arXiv:2510.18795},
year={2025}
}
@inproceedings{hu2025decoupled,
title={Decoupled global-local alignment for improving compositional understanding},
author={Hu, Xiaoxing and Yang, Kaicheng and Wang, Jun and Xu, Haoran and Feng, Ziyong and Wang, Yupei},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
pages={3251--3260},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
Our work is inspired by LLM2CLIP and CLIP. We are grateful for their outstanding work and code.





