Skip to content

Eyeline-Labs/VChain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

arXiv Project Page Video Visitors

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

Ziqi Huang1, Ning Yu2✉†, Gordon Chen1, Haonan Qiu1, Paul Debevec2, Ziwei Liu1✉

1 Nanyang Technological University     2 Eyeline Labs
✉ corresponding authors     † project lead

Abstract

Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.

🎬 VChain Demo

VChain Demo

Overview of VChain

We introduce VChain, an inference-time tuning framework for reasoning in video generation. Given a user-provided prompt (e.g., “A rock and a feather are falling from the sky towards the ground.”), VChain leverages large multimodal models to generate a Chain of Visual Thoughts, which are a sparse set of causally important keyframes to guide the video generator via Sparse Inference-Time Tuning. VChain effectively improves reasoning in video generation without extensive re-training.

VChain Overview

📽️ VChain Framework

An overview of our three-stage inference-time pipeline for reasoning in video generation.
(a) Visual Thought Reasoning: Given a user-provided text prompt, a large multimodal model (GPT-4o) infers a causal chain of events and generates a sequence of keyframes, termed the Chain of Visual Thoughts, via iterative reasoning and image synthesis.
(b) Sparse Inference-Time Tuning: These visual thoughts (paired with their corresponding textual thoughts) serve as sparse supervision for fine-tuning a pre-trained video generator via LoRA.
(c) Video Sampling: The full sequence of textual thoughts is concatenated to form a single prompt, which is used to prompt the fine-tuned model in generating the final video output.

VChain Framework

🔗 Links

🪶 Citation

If you find our work useful, please consider citing:

@article{huang2025vchain,
  title={{VChain}: Chain-of-Visual-Thought for Reasoning in Video Generation},
  author = {Huang, Ziqi and Yu, Ning and Chen, Gordon and Qiu, Haonan and Debevec, Paul and Liu, Ziwei},
  journal={arXiv preprint arXiv:2510.05094},
  year={2025}
}

About

The official implementation of paper “VChain: Chain-of-Visual-Thought for Reasoning in Video Generation”

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published