Skip to content

Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"

License

Notifications You must be signed in to change notification settings

xjtupanda/T2Vid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


T2Vid: Efficient Video Fine-tuning Scheme for MLLMs

   📑 Paper    |    🤗 Hugging Face  

TL;DR: We proposed a data augmentation method (synthesizing "video" samples from long QA text data) to enrich the instruction diversity of video data, which facilitates more efficient training with comparable performance.

✨ Highlights

🤔 Main findings: The importance of instruction diversity in video fine-tuning and how to efficiently improve it.

  • We observed a limited instruction diversity in datasets developed for Video-LLMs, which led to low learning efficiency (More details and findings are available in our paper).
  • Since text data could be a rich and economical source, we leveraged these data in a format that was more consistent with video instruction data.


🚀 Train less, achieve more: By mixing in our synthetic data, one can achieve comparable or better performance, while the total training sample size is only 15%.

Video-MME MVBench TempCompass
MiniCPM-V-2.5-8B
zero-shot
48.2 42.9 49.1
MiniCPM-V-2.5-8B
200K video data
50.8 48.0 54.7
MiniCPM-V-2.5-8B
20K video data + 10K synthetic data
53.0 48.4 56.8
Idefics3-8B
zero-shot
51.2 49.6 55.9
Idefics3-8B
200K video data
53.3 50.7 62.9
Idefics3-8B
20K video data + 10K synthetic data
56.3 51.6 62.3

🛠️ Quick Setup

  1. Create a conda virtual environment and install the required packages.
conda create -n t2vid python=3.9
conda activate t2vid
pip install -r requirements.txt
  1. Install Flash Attention 2 (for efficient training and inference).
pip install -U flash-attn --no-build-isolation

💡 Training & Evaluation

The instructions on training and evaluation (including pre-trained weights) are in TRAIN.md and EVAL.md.

📖 Misc

For those interested in the implementation details of our paper:

  • How to translate text into images? Check t2vid.py.
  • How to visualize the distribution of instructions?

🙌 Related Projects

  • Video-MME: A comprehensive video benchmark that we mainly use in our study.
  • Awesome-MLLM: A project keeping track of new papers and the latest developments in the field of MLLMs.

🌻 Acknowledgement

🖋️ Citation

If you find our project useful, please consider citing our paper:

@article{yin2024t2vid,
  title={T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs},
  author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Shen, Yunhang and Ge, Chunjiang and Yang, Yan and Long, Zuwei and Dai, Yuhan and Xu, Tong and Sun, Xing and others},
  journal={arXiv preprint arXiv:2411.19951},
  year={2024}
}

About

Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published