Sharath Girish1*
Tsai-Shien Chen1,2*
Zhikang Dong1
Mukesh Singhal2
Hao Chen1
Sergey Tulyakov1
Aliaksandr Siarohin1
1Snap Inc. 2UC Merced *Equal contribution
CineOrchestra is a unified video diffusion model for cinematic video generation that jointly controls subjects, events, camera, and shot transitions in a single forward pass — the first framework to do so.
The core insight: every cinematic element — a character acting, a camera pan, a hard cut — is an entity acting over a temporal interval. We represent all of them with one shared primitive: (start_time, end_time, prompt, [reference_image]), attaching a special {camera} or {transition} tag where needed. This reduces the architectural problem to positional encoding, solved by two coordinated RoPEs:
- Interval-sampled temporal RoPE — consistent attention across events ranging from 0.1s cuts to 10s camera moves
- 2D entity-temporal cross-attention RoPE — disambiguates per-entity conditions and routes each to its spatiotemporal target
For qualitative results and interactive demos, see the project page.
- [Jun 2026] CineBenchSyn benchmark data and evaluation code released.
- [Jun 2026] Project page released.
- CineBenchSyn benchmark data —
sharathgirish/CineBenchSyn - Benchmark evaluation code — see Benchmark Evaluation (
eval/)
The eval/ directory contains a self-contained pipeline that scores generated videos against the
CineBenchSyn conditioning along three axes: subject identity, dense-caption following, and
shot-transition timing.
conda env create -f eval/environment.yml
conda activate cinebenchsynffmpeg is included in the environment. All metric models — Grounding DINO,
SAM 2, DINOv2,
CLIP, ViCLIP, and
Qwen2.5-VL — download automatically from the
Hugging Face Hub on first use.
from huggingface_hub import snapshot_download
data = snapshot_download("sharathgirish/CineBenchSyn", repo_type="dataset")
# data/annotations/<id>_ultra_dense.json — per-scenario conditioning
# data/reference_images/<id>_ref_image_NN_<entity>.pngGenerate one video per scenario with your model. The pipeline expects a directory of <NNNNN>.mp4
files, 1-indexed, where video <NNNNN>.mp4 corresponds to annotation index <NNNNN − 1> (so
00001.mp4 ↔ 00000_ultra_dense.json):
my_videos/
00001.mp4
00002.mp4
...
00512.mp4
bash eval/run_eval.sh \
--videos-dir my_videos \
--prompts-dir "$data/annotations" \
--refs-dir "$data/reference_images" \
--output-dir results_my_model \
--name MyModel \
--stages extract_masks,compute_grounding,compute_vlm,compute_viclip,aggregateStages run in the order below; select a subset with --stages:
| Stage | Metric |
|---|---|
extract_masks |
Grounds + segments each entity in every video (Grounding DINO + SAM 2) |
compute_grounding |
DINO subject identity, CLIP / masked-CLIP caption alignment |
compute_vlm |
Qwen2.5-VL shot-transition-timing recall |
compute_viclip |
ViCLIP dense-caption following (scene / camera / transition) |
aggregate |
Collects per-run scores into aggregate_metrics.json |
Final scores are written to <output-dir>/aggregate_metrics.json.
The pipeline runs on a single GPU (or CPU) by default. For multi-GPU, launch the per-stage scripts in
eval/ with torchrun --nproc_per_node=N (work is split across ranks via a filesystem barrier; no
NCCL required). SAM 2 mask propagation is slow and is skipped by default (--skip_mask_tracking)
— enable it only if you need the masked-region metric variants.
This repository is released under the MIT License.