Skip to content

[EMNLP 2025 Oral] ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning

License

Notifications You must be signed in to change notification settings

ruiwang2021/ProLongVid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[EMNLP 2025] ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning

ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning
Rui Wang, Bohao Li, Xiyang Dai, Jianwei Yang, Yi-Ling Chen, Zhen Xing, Yifan Yang, Dongdong Chen, Xipeng Qiu, Zuxuan Wu and Yu-Gang Jiang

Datasets

ProLongVid data (Annotations and videos) have been uploaded to huggingface.

Models

Model Frame (Train) Frame (Test) Video-MME (w/o sub) Huggingface
Image-SFT-7B - 32 57.6 prolongvid_image_sft_7B
ProLongVid-Stage-1-7B 32 32 60.1 prolongvid_stage1_7B
ProLongVid-Stage-2-7B 128 128 63.6 prolongvid_stage2_7B
ProLongVid-Stage-3-7B 192 192 63.8 prolongvid_7B
192 256 64.7

Installation

For training:

- conda create -n llava python==3.10 -y
- conda activate llava
- conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia -y
- pip install -e ".[train]"
- pip install bitsandbytes
- pip install tensorboardX
- pip install transformers==4.43.4
- pip install flash-attn==2.6.3 --no-build-isolation

Training

For example, for stage-2 training, run the following script:

bash scripts/train/train_stage2.sh 

Eval

Please use the code of lmms-eval in this repo to install the environment, and follow the instruction of this version to perform evaluation of video benchmarks.

ToDo

  • More comprehensive training and testing tutorials.
  • More efficient training framework that supports sequence parallelism.
  • Original Dense Video Caption data.
  • New models trained from stronger image-LMM baseline.

Contact

For questions, feedback, or collaboration opportunities, feel free to reach out: wangrui21@m.fudan.edu.cn

Acknowledgement

We build this repo based on LLaVA-Next. Thanks for their wonderful works.

Citation

If you find our works useful for your research, please consider citing:

@inproceedings{wang2025prolongvid,
  title={ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning},
  author={Rui Wang and Bohao Li and Xiyang Dai and Jianwei Yang and Yi-Ling Chen and Zhen Xing and Yifan Yang and Dongdong Chen and Xipeng Qiu and Zuxuan Wu and Yu-Gang Jiang},
  booktitle={EMNLP},
  year={2025}
}

About

[EMNLP 2025 Oral] ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •