CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

Sharath Girish^1* Tsai-Shien Chen^1,2* Zhikang Dong¹ Mukesh Singhal² Hao Chen¹ Sergey Tulyakov¹ Aliaksandr Siarohin¹
¹Snap Inc. ²UC Merced ^*Equal contribution

Overview

CineOrchestra is a unified video diffusion model for cinematic video generation that jointly controls subjects, events, camera, and shot transitions in a single forward pass — the first framework to do so.

The core insight: every cinematic element — a character acting, a camera pan, a hard cut — is an entity acting over a temporal interval. We represent all of them with one shared primitive: (start_time, end_time, prompt, [reference_image]), attaching a special {camera} or {transition} tag where needed. This reduces the architectural problem to positional encoding, solved by two coordinated RoPEs:

Interval-sampled temporal RoPE — consistent attention across events ranging from 0.1s cuts to 10s camera moves
2D entity-temporal cross-attention RoPE — disambiguates per-entity conditions and routes each to its spatiotemporal target

For qualitative results and interactive demos, see the project page.

Updates

[Jun 2026] CineBenchSyn benchmark data and evaluation code released.
[Jun 2026] Project page released.

Release Plan

CineBenchSyn benchmark data — sharathgirish/CineBenchSyn
Benchmark evaluation code — see Benchmark Evaluation (eval/)

Benchmark Evaluation

The eval/ directory contains a self-contained pipeline that scores generated videos against the CineBenchSyn conditioning along three axes: subject identity, dense-caption following, and shot-transition timing.

1. Environment

conda env create -f eval/environment.yml
conda activate cinebenchsyn

ffmpeg is included in the environment. All metric models — Grounding DINO, SAM 2, DINOv2, CLIP, ViCLIP, and Qwen2.5-VL — download automatically from the Hugging Face Hub on first use.

2. Get the benchmark data

from huggingface_hub import snapshot_download
data = snapshot_download("sharathgirish/CineBenchSyn", repo_type="dataset")
# data/annotations/<id>_ultra_dense.json   — per-scenario conditioning
# data/reference_images/<id>_ref_image_NN_<entity>.png

3. Generate your videos

Generate one video per scenario with your model. The pipeline expects a directory of <NNNNN>.mp4 files, 1-indexed, where video <NNNNN>.mp4 corresponds to annotation index <NNNNN − 1> (so 00001.mp4 ↔ 00000_ultra_dense.json):

my_videos/
  00001.mp4
  00002.mp4
  ...
  00512.mp4

4. Run the metrics

bash eval/run_eval.sh \
  --videos-dir   my_videos \
  --prompts-dir  "$data/annotations" \
  --refs-dir     "$data/reference_images" \
  --output-dir   results_my_model \
  --name         MyModel \
  --stages       extract_masks,compute_grounding,compute_vlm,compute_viclip,aggregate

Stages run in the order below; select a subset with --stages:

Stage	Metric
`extract_masks`	Grounds + segments each entity in every video (Grounding DINO + SAM 2)
`compute_grounding`	DINO subject identity, CLIP / masked-CLIP caption alignment
`compute_vlm`	Qwen2.5-VL shot-transition-timing recall
`compute_viclip`	ViCLIP dense-caption following (scene / camera / transition)
`aggregate`	Collects per-run scores into `aggregate_metrics.json`

Final scores are written to <output-dir>/aggregate_metrics.json.

The pipeline runs on a single GPU (or CPU) by default. For multi-GPU, launch the per-stage scripts in eval/ with torchrun --nproc_per_node=N (work is split across ranks via a filesystem barrier; no NCCL required). SAM 2 mask propagation is slow and is skipped by default (--skip_mask_tracking) — enable it only if you need the masked-region metric variants.

License

This repository is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
eval		eval
LICENSE		LICENSE
README.md		README.md
teaser.mp4		teaser.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

Overview

Updates

Release Plan

Benchmark Evaluation

1. Environment

2. Get the benchmark data

3. Generate your videos

4. Run the metrics

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

Overview

Updates

Release Plan

Benchmark Evaluation

1. Environment

2. Get the benchmark data

3. Generate your videos

4. Run the metrics

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages