TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

[📖 Paper] [🤗 TSPO-model] [🤗 TSPO-train-data]

👀 Overview

To addressing the challenges of unsupervised and non-differentiable sparse frame sampling in Video-MLLMs, We propose Temporal Sampling Policy Optimization (TSPO), a reinforcement learning framework that advances long-form video understanding.

🏆 Performance

Our method achieves 63.9% accuracy on LongVideoBench and 76.3% on MLVU, setting a new state-of-the-art performance among 7B video-MLLMs.
Our trained temporal agent (TSPO-0.4B) demonstrates strong generalizability. When applied to LLaVA-Video-7B, it achieves an average improvement of 4.3% across four benchmarks; with Qwen2.5VL-7B, the gain reaches 6.1%. Transferability to other backbones is further analyzed in Table 2 of our paper.

🧸 Toy example

We present a toy example to show how TSPO works. We follow an intuition that Video-MLLMs can only give correct answers if the temporal agent samples the correct keyframes. Thanks to our joint modeling of keyframe sampling and language generation, we can use the language response accuracy reward $R_A$ derived from multiple-choice QA to supervise the temporal agent (without frame-level annotation).

As shown in the GIF, through TSPO training, the temporal agent learns to select frames that lead to the correct answer for the question "What is the scene at the beginning of the video?". As a result, $R_A$ increases, the predicted score peaks converge at the video's beginning, and the sampled frames converge to this time segment.
For reproduce this example, first set up the environment as the following section, and then download LLaVA-Video-Qwen, CLIP-Large, and 208.mp4, and modify the model_name_or_path and clip_path in the toy_example.sh. The script can be run on a single GPU with at least 28GB.

📐 Set up

conda create -n TSPO python=3.10
conda activate TSPO

pip install -r requirement.txt
pip install flash-attn==2.5.9.post1 --no-build-isolation
pip install qwen-vl-utils
pip install math_verify

cd lmms-eval
pip install -e .
cd ../

🎥 Demo

Download LLaVA-Video-Qwen-7B or Qwen2.5vl-7B, and our 🤗TSPO-0.4B. Then, you can try the demo/llava_video_tspo.py or demo/qwen25vl_tspo.py .
We provide example long videos: 208.mp4, 7XWqI121-Q4.mp4, 5dJUUQufzw4.mp4. you can feel free to edit the "video_path" and "question". Our model will output the responses and the sampled frames will be saved under the demo directory.

# using llava_video as backbone
CUDA_VISIBLE_DEVICES=0 python demo/llava_video_tspo.py

# using Qwen2.5vl as backbone
CUDA_VISIBLE_DEVICES=0 python demo/qwen25vl_tspo.py

💾 Dataset

Training
- Download LLaVA-Video-178K. You don't need to download the llava_hound video inside it.
- Download our TSPO-10K train dataset, which is available at 🤗 TSPO-train-data
Evaluation
- Download LongVideoBench, MLVU, VideoMME, LVBench
- For LongVideoBench and LVBench, we use the original JSON files. For MLVU and VideoMME, we convert their Parquet files into JSON format. These JSON files are stored in script/jsons.
- To adapt the data to our commonly used evaluation pipeline, we further organize them into TSV format and place them under evaluation/data.
- The final directory structure is as follows:
```
- evaluation
	- data
		- *.tsv
		- videos
			- LongVideoBench
				- video
					- data
						- *.mp4
            - MLVU
```

🚀 Training

First download LLaVA-Video-Qwen and CLIP-Large and modify the model_name_or_path and clip_path in the train_deepspeed.sh. For data path, you should modify the video_folder to be the path of LLaVA-Video-178K and jsonl_path to be the path of TSPO-10K.jsonl

Then, you can run the following command:

bash train_deepspeed.sh

To get your trained TSPO-0.4B weights, you should run the merge_weights.py

python scripts/merge_weights.py

🔮 Evaluation

Extract clip feature and select frame index
- You need to edit the model_path, root, and save_root in mp_tools/vlmeval/config.py.
- The first run will save the features locally; subsequent runs will directly load the saved features, making the process much faster.
```
cd mp_tools
bash get_frame_idx.sh LongVideoBench TSPO  # dataset_name method_name 
cd ../
```
Run Lmms-eval
- For LongVideoBench, VideoMME, and MLVU, we use lmms-eval, which we have adapted to the current project by adding files such as llava_vid_tspo.py and qwen_2_5_vl_tspo.py.
- Run:
```
# For LLaVA-Video
bash evaluation/TSPO_llava_video.sh LongVideoBench TSPO  # dataset_name method_name 

# For Qwen2.5-VL+TSPO
bash evaluation/TSPO_qwen25_vl.sh LongVideoBench TSPO
```

You can evaluate original model without our TSPO by:

# For Original Qwen2.5-VL 
bash evaluation/original_qwen25_vl.sh LongVideoBench xxx  # dataset_name method_name 

# For Original LLaVA-Video
bash evaluation/original_llava_video.sh LongVideoBench xxx # dataset_name method_name

For LVBench, we use its own evaluation protocol. The detailed code is to be released soon.

Acknowledgements

Open-LLaVA-Video-R1, Lmms-eval, VLMEvalKit, AKS

Citations

If you find our work helpful for your research, please consider citing our work.

@article{tang2025tspo,
  title={TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding},
  author={Tang, Canhui and Han, Zifan and Sun, Hongbo and Zhou, Sanping and Zhang, Xuchong and Wei, Xin and Yuan, Ye and Xu, Jinglin and Sun, Hao},
  journal={arXiv preprint arXiv:2508.04369},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

👀 Overview

🏆 Performance

🧸 Toy example

📐 Set up

🎥 Demo

💾 Dataset

🚀 Training

🔮 Evaluation

Acknowledgements

Citations

✨ Star History

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
demo		demo
evaluation		evaluation
llava		llava
lmms-eval		lmms-eval
model		model
mp_tools		mp_tools
scripts		scripts
src/open_tspo		src/open_tspo
toy_example		toy_example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
toy_example.sh		toy_example.sh
train_deepspeed.sh		train_deepspeed.sh

Hui-design/TSPO

Folders and files

Latest commit

History

Repository files navigation

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

👀 Overview

🏆 Performance

🧸 Toy example

📐 Set up

🎥 Demo

💾 Dataset

🚀 Training

🔮 Evaluation

Acknowledgements

Citations

✨ Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages