Skip to content

Official Repo for MoCha Towards Movie-Grade Talking Character Synthesis

License

Notifications You must be signed in to change notification settings

congwei1230/MoChaBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MoChaBench

This repo contains the Benchmark, Standard Evaluation Codebase and MoCha's Generation Results for MoCha Towards Movie-Grade Talking Character Synthesis.




Many thanks to the community for sharing β€” An emotional narrative, created with light manual editing on clips generated by MoCha, has surpassed 1 million views on X.

MoCha

πŸ””News

πŸ“‘ Table of Contents


πŸ† MoChaBench Leaderboard

πŸ§‘ Single-Character Monologue (English)

Including categories: 1p_camera_movement, 1p_closeup_facingcamera, 1p_emotion, 1p_mediumshot_actioncontrol, 1p_portrait, 2p_1clip_1talk

Method Sync-Conf. ↑ Sync-Dist. ↓
MoCha 6.333 8.185
Hallo3 4.866 8.963
SadTalker 4.727 9.239
AniPortrait 1.740 11.383

πŸ‘₯ Multi-Character Turn-based Dialogue (English)

Including categories: 2p_2clip_2talk

Method Sync-Conf. ↑ Sync-Dist. ↓
MoCha 4.951 8.601

Per-Category Averages

Category Model Sync-Dist. ↓ Sync-Conf. ↑ Example (n)
1p_camera_movement MoCha 8.455 5.432 18
1p_closeup_facingcamera MoCha 7.958 6.298 27
1p_emotion MoCha 8.073 6.214 34
1p_generalize_chinese MoCha 8.273 4.398 4
1p_mediumshot_actioncontrol MoCha 8.386 6.241 52
1p_protrait MoCha 8.125 6.892 38
2p_1clip_1talk MoCha 8.082 6.493 30
2p_2clip_2talk MoCha 8.601 4.951 15

▢️ Evaluating Lip Sync Scores

Overview

We use SyncNet for evaluation. The codebase is adapted from joonson/syncnet_python with improved code structure and a unified API to facilitate evaluation for the community.

The implementation follows a Hugging Face Diffusers-style structure. We provided a SyncNetPipeline Class, located at eval-lipsync\script\syncnet_pipeline.py.

You can initialize SyncNetPipeline by providing the weights and configs:

pipe = SyncNetPipeline(
    {
        "s3fd_weights":  "path to sfd_face.pth",
        "syncnet_weights": "path to syncnet_v2.model",
    },
    device="cuda",          # or "cpu"
)

The pipeline offers an inference function to score a single pair of video and speech. For fair comparison, the input speech should be a denoised vocal source extracted from your audio. You can use seperator like Kim_Vocal_2 for general noise remvoal and Demucs_mdx_extra for music removal

av_off, sync_confs, sync_dists, best_conf, min_dist, s3fd_json, has_face = pipe.inference(
    video_path="path to video.mp4",   # RGB video
    audio_path="path to speech.wav",   # speech track (must be denoised from audio, ffmpeg-readable format)
    cache_dir= "path to store intermediate output",    # optional; omit to auto-cleanup intermediates
)

Benchmark files

We provide the benchmark files in the benchmark/ directory, organized by data type and category.

Each file follows the structure:
benchmark/<data_type>/<category>/<context_id>.<ext>

Directory Structure
β”œβ”€benchmark
β”‚  β”œβ”€audios
β”‚  β”‚  β”œβ”€1p_camera_movement
β”‚  β”‚  β”‚  β”œβ”€ 10_man_basketball_camera_push_in.wav
β”‚  β”‚  β”‚  ...
β”‚  β”‚  β”œβ”€1p_closeup_facingcamera
β”‚  β”‚  β”œβ”€1p_emotion
β”‚  β”‚  β”œβ”€1p_generalize_chinese
β”‚  β”‚  β”œβ”€1p_mediumshot_actioncontrol
β”‚  β”‚  β”œβ”€1p_protrait
β”‚  β”‚  β”œβ”€2p_1clip_1talk
β”‚  β”‚  └─2p_2clip_2talk
β”‚  β”œβ”€first-frames-from-mocha-generation
β”‚  β”‚  β”œβ”€1p_camera_movement
β”‚  β”‚  β”‚  β”œβ”€ 10_man_basketball_camera_push_in.png
β”‚  β”‚  β”‚  ...
β”‚  β”‚  β”œβ”€1p_closeup_facingcamera
β”‚  β”‚  β”œβ”€1p_emotion
β”‚  β”‚  β”œβ”€1p_generalize_chinese
β”‚  β”‚  β”œβ”€1p_mediumshot_actioncontrol
β”‚  β”‚  β”œβ”€1p_protrait
β”‚  β”‚  β”œβ”€2p_1clip_1talk
β”‚  β”‚  └─2p_2clip_2talk
β”‚  └─speeches
β”‚      β”œβ”€1p_camera_movement
|      |  β”œβ”€ 10_man_basketball_camera_push_in_speech.wav
β”‚      β”‚  ...
β”‚      β”œβ”€1p_closeup_facingcamera
β”‚      β”œβ”€1p_emotion
β”‚      β”œβ”€1p_generalize_chinese
β”‚      β”œβ”€1p_mediumshot_actioncontrol
β”‚      β”œβ”€1p_protrait
β”‚      β”œβ”€2p_1clip_1talk
β”‚      └─2p_2clip_2talk
└──benchmark.csv
  • benchmark.csv contains metadata for each sample, with each row specifying:
    idx_in_category, category, context_id, prompt.

    We use benchmark.csv to connect files. Any file in the benchmark can be located using the combination:
    /benchmark/<data_type>/<category>/<context_id>.<ext>

  • speeches files are generated from audios files by using Demucs_mdx_extra. For fair comparison, speeches should also be used as the input to your own model instead of audios.

  • We also provie first-frames-from-mocha-generation to facilitate fair comparison for (image + text + audio β†’ video) models.

How to Use

Download this repo

SyncNet Weights, Benchmark and MoCha's Generation Results are embedded in this git repo

git clone https://github.com/congwei1230/MoChaBench.git

Dependencies

conda create -n mochabench_eval python=3.8
conda activate mochabench_eval
cd eval-lipsync
pip install -r requirements.txt
# require ffmpeg installed

Example Script to run SyncNetPipeline on single pair of (video, speech)

cd script
python run_syncnet_pipeline_on_1example.py

You are expected to get values close (Β±0.1 due to ffmpeg version, the version i am using ffmpeg version 7.1.1-essentials_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developers) to:

AV offset:      1
Min dist:       9.255
Confidence:     4.497
best-confidence   : 4.4973907470703125
lowest distance   : 9.255396842956543
per-crop offsets  : [1]

Running SyncNetPipeline on MoCha-Generated Videos for MoChaBench Evaluation

We provide the MoCha-Generated videos in the mocha-generation/ directory

Each file follows the structure:
mocha-generation/<category>/<context_id>.<ext>

mocha-generation
    β”œβ”€1p_camera_movement
    β”‚   β”œβ”€ 10_man_basketball_camera_push_in.mp4
    β”‚   β”‚  ...
    β”œβ”€1p_closeup_facingcamera
    β”œβ”€1p_emotion
    β”œβ”€1p_generalize_chinese
    β”œβ”€1p_mediumshot_actioncontrol
    β”œβ”€1p_protrait
    β”œβ”€2p_1clip_1talk
    └─2p_2clip_2talk

To evaluate the results, simply run the pipeline below.
This script will print the score for each category, as well as the average scores for Monologue and Dialogue. It will also output a CSV file at eval-lipsync/mocha-eval-results/sync_scores.csv, recording each example’s score.

cd eval-lipsync/script
python run_syncnet_pipeline_on_mocha_generation_on_mocha_bench.py

Running SyncNetPipeline on Your Model’s Outputs for MoChaBench

To evaluate your own model’s outputs with MoChaBench, first use the following inputs to generate videos:

  • Speech input: benchmark/speeches
  • Text input: prompt from benchmark.csv
  • Image input: benchmark/first-frames-from-mocha-generation (if your model requires an image condition)

You can also use our HF version to generate videos.

Then organize your generated videos in a folder that matches the structure of mocha-generation/:

<your_outputs_dir>/
    β”œβ”€ 1p_camera_movement/
    β”‚    β”œβ”€ 10_man_basketball_camera_push_in.mp4
    β”‚    β”‚ ...
    β”œβ”€ 1p_closeup_facingcamera/
    β”œβ”€ 1p_emotion/
    β”œβ”€ 1p_generalize_chinese/
    β”œβ”€ 1p_mediumshot_actioncontrol/
    β”œβ”€ 1p_protrait/
    β”œβ”€ 2p_1clip_1talk/
    └─ 2p_2clip_2talk/

Each video should be named as <context_id>.mp4 within the corresponding category folder. You don’t need to provide an mp4 for every categoryβ€”the script will skip any missing videos and report scores for the rest.

Next, modify the script run_syncnet_pipeline_on_your_own_model_results.py to point to your video folder.

Then, run:

cd eval-lipsync/script
python run_syncnet_pipeline_on_your_own_model_results.py

The script will output a CSV file at eval-lipsync/your own model-eval-results/sync_scores.csv with the evaluation scores for each example.

🧩 Custom Benchmark Evaluation

Since our pipeline provides an API to score a pair of (video, audio), you can easily adapt it for other benchmark datasets by looping through your examples:

# Loop through your dataset
for example in dataset:
    video_fp = example["video_path"]
    audio_fp = example["audio_path"]
    id = example["context_id"]

    av_off, sync_confs, sync_dists, best_conf, min_dist, s3fd_json, has_face = pipe.inference(
        video_path=str(video_fp),
        audio_path=str(audio_fp),
        cache_dir="YOUR INPUT"
    )
    # Store or process the results as needed

# After processing all samples, compute average results.

▢️ Evaluating VIEScore

Evaluating with GPT-4o

We provide example scripts for running GPT-4o-based evaluation on 20 examples from MoChaBench, covering 4 models and 4 evaluation aspects.

conda activate mochabench
pip install openai opencv-python
cd eval-viescore
python eval_gpt_viescore.py

Evaluating Alignment with Human Ratings

We also provide a script to compute the agreement between GPT-4o scores and human majority vote ratings:

conda activate mochabench
pip install scikit-learn
cd eval-viescore
python compute_alignment.py

This script outputs alignment metrics (QWK, Spearman ρ, Footrule, MAE) for each aspect and overall.

πŸ“š Citation

🌟 If you find our work helpful, please leave us a star and cite our paper.

@article{wei2025mocha,
  title={MoCha: Towards Movie-Grade Talking Character Synthesis},
  author={Wei, Cong and Sun, Bo and Ma, Haoyu and Hou, Ji and Juefei-Xu, Felix and He, Zecheng and Dai, Xiaoliang and Zhang, Luxin and Li, Kunpeng and Hou, Tingbo and others},
  journal={arXiv preprint arXiv:2503.23307},
  year={2025}
}

About

Official Repo for MoCha Towards Movie-Grade Talking Character Synthesis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages