Skip to content

snap-research/CineOrchestra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

📄 Paper  |  🌐 Project Page

Sharath Girish1*  Tsai-Shien Chen1,2*  Zhikang Dong1  Mukesh Singhal2  Hao Chen1  Sergey Tulyakov1  Aliaksandr Siarohin1
1Snap Inc.   2UC Merced     *Equal contribution


CineOrchestra teaser — click to watch on project page

Overview

CineOrchestra is a unified video diffusion model for cinematic video generation that jointly controls subjects, events, camera, and shot transitions in a single forward pass — the first framework to do so.

The core insight: every cinematic element — a character acting, a camera pan, a hard cut — is an entity acting over a temporal interval. We represent all of them with one shared primitive: (start_time, end_time, prompt, [reference_image]), attaching a special {camera} or {transition} tag where needed. This reduces the architectural problem to positional encoding, solved by two coordinated RoPEs:

  • Interval-sampled temporal RoPE — consistent attention across events ranging from 0.1s cuts to 10s camera moves
  • 2D entity-temporal cross-attention RoPE — disambiguates per-entity conditions and routes each to its spatiotemporal target

For qualitative results and interactive demos, see the project page.

Updates

  • [Jun 2026] CineBenchSyn benchmark data and evaluation code released.
  • [Jun 2026] Project page released.

Release Plan

Benchmark Evaluation

The eval/ directory contains a self-contained pipeline that scores generated videos against the CineBenchSyn conditioning along three axes: subject identity, dense-caption following, and shot-transition timing.

1. Environment

conda env create -f eval/environment.yml
conda activate cinebenchsyn

ffmpeg is included in the environment. All metric models — Grounding DINO, SAM 2, DINOv2, CLIP, ViCLIP, and Qwen2.5-VL — download automatically from the Hugging Face Hub on first use.

2. Get the benchmark data

from huggingface_hub import snapshot_download
data = snapshot_download("sharathgirish/CineBenchSyn", repo_type="dataset")
# data/annotations/<id>_ultra_dense.json   — per-scenario conditioning
# data/reference_images/<id>_ref_image_NN_<entity>.png

3. Generate your videos

Generate one video per scenario with your model. The pipeline expects a directory of <NNNNN>.mp4 files, 1-indexed, where video <NNNNN>.mp4 corresponds to annotation index <NNNNN − 1> (so 00001.mp400000_ultra_dense.json):

my_videos/
  00001.mp4
  00002.mp4
  ...
  00512.mp4

4. Run the metrics

bash eval/run_eval.sh \
  --videos-dir   my_videos \
  --prompts-dir  "$data/annotations" \
  --refs-dir     "$data/reference_images" \
  --output-dir   results_my_model \
  --name         MyModel \
  --stages       extract_masks,compute_grounding,compute_vlm,compute_viclip,aggregate

Stages run in the order below; select a subset with --stages:

Stage Metric
extract_masks Grounds + segments each entity in every video (Grounding DINO + SAM 2)
compute_grounding DINO subject identity, CLIP / masked-CLIP caption alignment
compute_vlm Qwen2.5-VL shot-transition-timing recall
compute_viclip ViCLIP dense-caption following (scene / camera / transition)
aggregate Collects per-run scores into aggregate_metrics.json

Final scores are written to <output-dir>/aggregate_metrics.json.

The pipeline runs on a single GPU (or CPU) by default. For multi-GPU, launch the per-stage scripts in eval/ with torchrun --nproc_per_node=N (work is split across ranks via a filesystem barrier; no NCCL required). SAM 2 mask propagation is slow and is skipped by default (--skip_mask_tracking) — enable it only if you need the masked-region metric variants.

License

This repository is released under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors