Our work proposes Temporal Preference Optimization (TPO), which serves as comprehensive pipeline for self-training based temporal preference optimization for cutting-edge video large multimodal models (video-LMMs). TPO enhances video comprehension in video-LMMs by modeling temporal preferences at two granular levels: localized and comprehensive TPO. In localized TPO (upper-left), we generate queries focused on short segments, with contrastive responses that retain or exclude the target segment. For comprehensive TPO (lower-left), queries are designed broader understanding, using intact video versus sparse downsampled video for contrasting responses. After post-filtering, the contrast response pairs are serving as the preference dataset to train a video-LMM, guiding the model to prioritize preferred responses for improved video understanding.
Model | Huggingface Link |
---|---|
LongVA-7B-TPO | Download |
LLaVA-Video-7B-TPO | Download |
For LongVA-TPO:
git clone https://github.com/ruili33/TPO
cd TPO
conda create -n TPOLongVA python=3.10
conda activate TPOLongVA
pip install torch==2.1.2 torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -e "longva/.[train]"
pip install packaging && pip install ninja && pip install flash-attn==2.5.0 --no-build-isolation --no-cache-dir
pip install -r requirements_longva.txt
For LLaVA-Video-TPO:
conda create -n TPOllava python=3.10 -y
conda activate TPOllava
pip install --upgrade pip
pip install -e "LLaVA/.[train]"
pip install flash-attn==2.5.0 --no-build-isolation --no-cache-dir
For LongVA-TPO, please following the inference demo in longva/inference_longva.py
.
For LLaVA-Video-TPO, please following the inference demo in LLaVA/inference_llava.py
.
For evaluation, we utilize lmms-eval to facilitate a consistent evaluation as previous works. Please refer to their instructions for evaluation setup.
For LongVA-TPO, please refer to longva/eval.sh
for the evaluation script.
For LLaVA-Video-TPO, please refer to LLaVA/eval.sh
for the evaluation script.
The TPO dataset for LongVA are available at Huggingface Dataset.
To run the web demo for our TPO model (LLaVA-Video-7B-TPO), please run the following python script:
conda activate TPOllava
python local_demo/multimodal_chat.py
The training code is coming soon!
- Release the temporal preference data curation pipeline by March.
- Release the training code by March.
If you find this repository useful in your research or work, please consider citing our paper:
@misc{li2025temporalpreferenceoptimizationlongform,
title={Temporal Preference Optimization for Long-Form Video Understanding},
author={Rui Li and Xiaohan Wang and Yuhui Zhang and Zeyu Wang and Serena Yeung-Levy},
year={2025},
eprint={2501.13919},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.13919},
}
This work is based on the original LongVA and LLaVA-Video repository. We extend our gratitude to the maintainers and contributors of these repositories for their incredible work, which greatly facilitated the development of our project.