Skip to content

AIGeeksGroup/UniVid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo UniVid: The Open-Source Unified Video Model

This is the official repository for the paper:

UniVid: The Open-Source Unified Video Model

Jiabin Luo*, Junhui Lin*, Zeyu Zhang*, Biao Wu*, Meng Fang, Ling Chen, and Hao Tang

*Equal contribution. Project lead. Corresponding author.

output2.mp4

✏️ Citation

If you find our code or paper helpful, please consider starring ⭐ us and citing:

@article{luo2025univid,
  title={UniVid: The Open-Source Unified Video Model},
  author={Luo, Jiabin and Lin, Junhui and Zhang, Zeyu and Wu, Biao and Fang, Meng and Chen, Ling and Tang, Hao},
  journal={arXiv preprint arXiv:2509.24200},
  year={2025}
}

TODO List

  • ⬜️ Upload our paper to arXiv and build project pages.
  • ⬜️ Upload the code.

🏃 Intro UniVid

UniVid is an open-source model that enhances both video generation and video understanding.

Unified video modeling combining generation and understanding capabilities is increasingly important, yet faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the suboptimality of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of our unified video model, achieving a 2.2% improvement on VBench-Long total score compared to the previous SOTA method EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.

image

🔧Run Your UniVid

1. Install & Requirements

conda env create -f environment.yaml
conda activate univid

2. Understanding

Runs the Reflection pipeline on a subset of videos and saves results + traces.

Input format:

  • video_dir: contains files like video{video_id}.mp4
  • gt_file: JSON list with at least video_id, question, answer (optional id)
[{"video_id": 1203, "question": "What color is the car?", "answer": "Red"}]

Quick Start

python eval_understanding.py \
  --video_dir /path/to/videos \
  --gt_file /path/to/gt.json \
  --output_dir /path/to/out \
  --output_name subset_run \
  --model_path /path/to/MODEL_DIR \
  --no_ddp_ranker \
  --siglip_ckpt google/siglip2-base-patch16-naflex

Outputs

  • Batch summary: /path/to/out/subset_run.json(fields: id, video_id, question, answer, pred, trace_path)
  • Per-sample trace JSONs: /path/to/out/video{video_id}_reflexion.json
  • Keyframes (if enabled): sample_frames/video{video_id}/...

Note:

If all three rounds fail:

  • Static: fallback uses global-caption answer; if insufficient, use the last round.
  • Dynamic: fallback uses global-caption answer; if insufficient, use the first round.

For DDP frame ranking, omit --no_ddp_ranker and add --ddp_ranker clip_rank_video_ddp.py --nproc 4.

About

UniVid: The Open-Source Unified Video Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages