Temporal Preference Optimization (TPO) for Long-Form Video Understanding

Our work proposes Temporal Preference Optimization (TPO), which serves as comprehensive pipeline for self-training based temporal preference optimization for cutting-edge video large multimodal models (video-LMMs). TPO enhances video comprehension in video-LMMs by modeling temporal preferences at two granular levels: localized and comprehensive TPO. In localized TPO (upper-left), we generate queries focused on short segments, with contrastive responses that retain or exclude the target segment. For comprehensive TPO (lower-left), queries are designed broader understanding, using intact video versus sparse downsampled video for contrasting responses. After post-filtering, the contrast response pairs are serving as the preference dataset to train a video-LMM, guiding the model to prioritize preferred responses for improved video understanding.

🚀 Quick Start

Model Weights

Model	Huggingface Link
LongVA-7B-TPO	Download
LLaVA-Video-7B-TPO	Download

Install (Linux)

For LongVA-TPO:

git clone https://github.com/ruili33/TPO
cd TPO
conda create -n TPOLongVA python=3.10
conda activate TPOLongVA
pip install torch==2.1.2 torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -e "longva/.[train]"
pip install packaging &&  pip install ninja && pip install flash-attn==2.5.0 --no-build-isolation --no-cache-dir
pip install -r requirements_longva.txt

For LLaVA-Video-TPO:

conda create -n TPOllava python=3.10 -y
conda activate TPOllava
pip install --upgrade pip 
pip install -e "LLaVA/.[train]"
pip install flash-attn==2.5.0 --no-build-isolation --no-cache-dir

Inference

For LongVA-TPO, please following the inference demo in longva/inference_longva.py.

For LLaVA-Video-TPO, please following the inference demo in LLaVA/inference_llava.py.

Evaluation

For evaluation, we utilize lmms-eval to facilitate a consistent evaluation as previous works. Please refer to their instructions for evaluation setup.

For LongVA-TPO, please refer to longva/eval.sh for the evaluation script.

For LLaVA-Video-TPO, please refer to LLaVA/eval.sh for the evaluation script.

Datasets

The TPO dataset for LongVA are available at Huggingface Dataset.

Web Demo

To run the web demo for our TPO model (LLaVA-Video-7B-TPO), please run the following python script:

conda activate TPOllava
python local_demo/multimodal_chat.py

Training

The training code is coming soon!

TODO

Release the temporal preference data curation pipeline by March.
Release the training code by March.

Citation

If you find this repository useful in your research or work, please consider citing our paper:

@misc{li2025temporalpreferenceoptimizationlongform,
      title={Temporal Preference Optimization for Long-Form Video Understanding}, 
      author={Rui Li and Xiaohan Wang and Yuhui Zhang and Zeyu Wang and Serena Yeung-Levy},
      year={2025},
      eprint={2501.13919},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.13919}, 
}

Acknowledgements

This work is based on the original LongVA and LLaVA-Video repository. We extend our gratitude to the maintainers and contributors of these repositories for their incredible work, which greatly facilitated the development of our project.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
LLaVA		LLaVA
asset		asset
local_demo		local_demo
longva		longva
.DS_Store		.DS_Store
README.md		README.md
requirements_llava.txt		requirements_llava.txt
requirements_longva.txt		requirements_longva.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Temporal Preference Optimization (TPO) for Long-Form Video Understanding

🚀 Quick Start

Model Weights

Install (Linux)

Inference

Evaluation

Datasets

Web Demo

Training

TODO

Citation

Acknowledgements

About

Releases

Packages

Languages

ruili33/TPO

Folders and files

Latest commit

History

Repository files navigation

Temporal Preference Optimization (TPO) for Long-Form Video Understanding

🚀 Quick Start

Model Weights

Install (Linux)

Inference

Evaluation

Datasets

Web Demo

Training

TODO

Citation

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages