Ziqi Huang1, Ning Yu2✉†, Gordon Chen1, Haonan Qiu1, Paul Debevec2, Ziwei Liu1✉
1 Nanyang Technological University 2 Eyeline Labs
✉ corresponding authors † project lead
Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.
We introduce VChain, an inference-time tuning framework for reasoning in video generation. Given a user-provided prompt (e.g., “A rock and a feather are falling from the sky towards the ground.”), VChain leverages large multimodal models to generate a Chain of Visual Thoughts, which are a sparse set of causally important keyframes to guide the video generator via Sparse Inference-Time Tuning. VChain effectively improves reasoning in video generation without extensive re-training.
An overview of our three-stage inference-time pipeline for reasoning in video generation.
(a) Visual Thought Reasoning: Given a user-provided text prompt, a large multimodal model (GPT-4o) infers a causal chain of events and generates a sequence of keyframes, termed the Chain of Visual Thoughts, via iterative reasoning and image synthesis.
(b) Sparse Inference-Time Tuning: These visual thoughts (paired with their corresponding textual thoughts) serve as sparse supervision for fine-tuning a pre-trained video generator via LoRA.
(c) Video Sampling: The full sequence of textual thoughts is concatenated to form a single prompt, which is used to prompt the fine-tuned model in generating the final video output.
- 📄 Paper (arXiv): https://arxiv.org/abs/2510.05094
- 🌐 Project Page: https://eyeline-labs.github.io/VChain/
- 💻 Code: https://github.com/Eyeline-Labs/VChain
- 🎬 Video: https://www.youtube.com/watch?v=HV4uAHJwt1k
If you find our work useful, please consider citing:
@article{huang2025vchain,
title={{VChain}: Chain-of-Visual-Thought for Reasoning in Video Generation},
author = {Huang, Ziqi and Yu, Ning and Chen, Gordon and Qiu, Haonan and Debevec, Paul and Liu, Ziwei},
journal={arXiv preprint arXiv:2510.05094},
year={2025}
}

