MoChaBench

This repo contains the Benchmark, Standard Evaluation Codebase and MoCha's Generation Results for MoCha Towards Movie-Grade Talking Character Synthesis.

Many thanks to the community for sharing — An emotional narrative, created with light manual editing on clips generated by MoCha, has surpassed 1 million views on X.

🔔News

🔥[2025-12-27]: Released a demo implementation built on HunyuanVideo — 🤗 Checkpoints and Code

📑 Table of Contents

🏆 MoChaBench Leaderboard
▶️ Evaluating Lip Sync Scores
▶️ Evaluating VIEScore
- Evaluating with GPT-4o
- Evaluating Alignment with Human Ratings
📚 Citation

🏆 MoChaBench Leaderboard

🧑 Single-Character Monologue (English)

Including categories: 1p_camera_movement, 1p_closeup_facingcamera, 1p_emotion, 1p_mediumshot_actioncontrol, 1p_portrait, 2p_1clip_1talk

Method	Sync-Conf. ↑	Sync-Dist. ↓
MoCha	6.333	8.185
Hallo3	4.866	8.963
SadTalker	4.727	9.239
AniPortrait	1.740	11.383

👥 Multi-Character Turn-based Dialogue (English)

Including categories: 2p_2clip_2talk

Method	Sync-Conf. ↑	Sync-Dist. ↓
MoCha	4.951	8.601

Per-Category Averages

Category	Model	Sync-Dist. ↓	Sync-Conf. ↑	Example (n)
1p_camera_movement	MoCha	8.455	5.432	18
1p_closeup_facingcamera	MoCha	7.958	6.298	27
1p_emotion	MoCha	8.073	6.214	34
1p_generalize_chinese	MoCha	8.273	4.398	4
1p_mediumshot_actioncontrol	MoCha	8.386	6.241	52
1p_protrait	MoCha	8.125	6.892	38
2p_1clip_1talk	MoCha	8.082	6.493	30
2p_2clip_2talk	MoCha	8.601	4.951	15

▶️ Evaluating Lip Sync Scores

Overview

We use SyncNet for evaluation. The codebase is adapted from joonson/syncnet_python with improved code structure and a unified API to facilitate evaluation for the community.

The implementation follows a Hugging Face Diffusers-style structure. We provided a SyncNetPipeline Class, located at eval-lipsync\script\syncnet_pipeline.py.

You can initialize SyncNetPipeline by providing the weights and configs:

pipe = SyncNetPipeline(
    {
        "s3fd_weights":  "path to sfd_face.pth",
        "syncnet_weights": "path to syncnet_v2.model",
    },
    device="cuda",          # or "cpu"
)

The pipeline offers an inference function to score a single pair of video and speech. For fair comparison, the input speech should be a denoised vocal source extracted from your audio. You can use seperator like Kim_Vocal_2 for general noise remvoal and Demucs_mdx_extra for music removal

av_off, sync_confs, sync_dists, best_conf, min_dist, s3fd_json, has_face = pipe.inference(
    video_path="path to video.mp4",   # RGB video
    audio_path="path to speech.wav",   # speech track (must be denoised from audio, ffmpeg-readable format)
    cache_dir= "path to store intermediate output",    # optional; omit to auto-cleanup intermediates
)

Benchmark files

We provide the benchmark files in the benchmark/ directory, organized by data type and category.

Each file follows the structure:
benchmark/<data_type>/<category>/<context_id>.<ext>

Directory Structure

├─benchmark
│  ├─audios
│  │  ├─1p_camera_movement
│  │  │  ├─ 10_man_basketball_camera_push_in.wav
│  │  │  ...
│  │  ├─1p_closeup_facingcamera
│  │  ├─1p_emotion
│  │  ├─1p_generalize_chinese
│  │  ├─1p_mediumshot_actioncontrol
│  │  ├─1p_protrait
│  │  ├─2p_1clip_1talk
│  │  └─2p_2clip_2talk
│  ├─first-frames-from-mocha-generation
│  │  ├─1p_camera_movement
│  │  │  ├─ 10_man_basketball_camera_push_in.png
│  │  │  ...
│  │  ├─1p_closeup_facingcamera
│  │  ├─1p_emotion
│  │  ├─1p_generalize_chinese
│  │  ├─1p_mediumshot_actioncontrol
│  │  ├─1p_protrait
│  │  ├─2p_1clip_1talk
│  │  └─2p_2clip_2talk
│  └─speeches
│      ├─1p_camera_movement
|      |  ├─ 10_man_basketball_camera_push_in_speech.wav
│      │  ...
│      ├─1p_closeup_facingcamera
│      ├─1p_emotion
│      ├─1p_generalize_chinese
│      ├─1p_mediumshot_actioncontrol
│      ├─1p_protrait
│      ├─2p_1clip_1talk
│      └─2p_2clip_2talk
└──benchmark.csv

benchmark.csv contains metadata for each sample, with each row specifying:
idx_in_category, category, context_id, prompt.

We use benchmark.csv to connect files. Any file in the benchmark can be located using the combination:
/benchmark/<data_type>/<category>/<context_id>.<ext>
speeches files are generated from audios files by using Demucs_mdx_extra. For fair comparison, speeches should also be used as the input to your own model instead of audios.
We also provie first-frames-from-mocha-generation to facilitate fair comparison for (image + text + audio → video) models.

How to Use

Download this repo

SyncNet Weights, Benchmark and MoCha's Generation Results are embedded in this git repo

git clone https://github.com/congwei1230/MoChaBench.git

Dependencies

conda create -n mochabench_eval python=3.8
conda activate mochabench_eval
cd eval-lipsync
pip install -r requirements.txt
# require ffmpeg installed

Example Script to run SyncNetPipeline on single pair of (video, speech)

cd script
python run_syncnet_pipeline_on_1example.py

AV offset:      1
Min dist:       9.255
Confidence:     4.497
best-confidence   : 4.4973907470703125
lowest distance   : 9.255396842956543
per-crop offsets  : [1]

Running SyncNetPipeline on MoCha-Generated Videos for MoChaBench Evaluation

We provide the MoCha-Generated videos in the mocha-generation/ directory

Each file follows the structure:
mocha-generation/<category>/<context_id>.<ext>

mocha-generation
    ├─1p_camera_movement
    │   ├─ 10_man_basketball_camera_push_in.mp4
    │   │  ...
    ├─1p_closeup_facingcamera
    ├─1p_emotion
    ├─1p_generalize_chinese
    ├─1p_mediumshot_actioncontrol
    ├─1p_protrait
    ├─2p_1clip_1talk
    └─2p_2clip_2talk

To evaluate the results, simply run the pipeline below.
This script will print the score for each category, as well as the average scores for Monologue and Dialogue. It will also output a CSV file at eval-lipsync/mocha-eval-results/sync_scores.csv, recording each example’s score.

cd eval-lipsync/script
python run_syncnet_pipeline_on_mocha_generation_on_mocha_bench.py

Running SyncNetPipeline on Your Model’s Outputs for MoChaBench

To evaluate your own model’s outputs with MoChaBench, first use the following inputs to generate videos:

Speech input: benchmark/speeches
Text input: prompt from benchmark.csv
Image input: benchmark/first-frames-from-mocha-generation (if your model requires an image condition)

You can also use our HF version to generate videos.

Then organize your generated videos in a folder that matches the structure of mocha-generation/:

<your_outputs_dir>/
    ├─ 1p_camera_movement/
    │    ├─ 10_man_basketball_camera_push_in.mp4
    │    │ ...
    ├─ 1p_closeup_facingcamera/
    ├─ 1p_emotion/
    ├─ 1p_generalize_chinese/
    ├─ 1p_mediumshot_actioncontrol/
    ├─ 1p_protrait/
    ├─ 2p_1clip_1talk/
    └─ 2p_2clip_2talk/

Each video should be named as <context_id>.mp4 within the corresponding category folder. You don’t need to provide an mp4 for every category—the script will skip any missing videos and report scores for the rest.

Next, modify the script run_syncnet_pipeline_on_your_own_model_results.py to point to your video folder.

Then, run:

cd eval-lipsync/script
python run_syncnet_pipeline_on_your_own_model_results.py

The script will output a CSV file at eval-lipsync/your own model-eval-results/sync_scores.csv with the evaluation scores for each example.

🧩 Custom Benchmark Evaluation

Since our pipeline provides an API to score a pair of (video, audio), you can easily adapt it for other benchmark datasets by looping through your examples:

# Loop through your dataset
for example in dataset:
    video_fp = example["video_path"]
    audio_fp = example["audio_path"]
    id = example["context_id"]

    av_off, sync_confs, sync_dists, best_conf, min_dist, s3fd_json, has_face = pipe.inference(
        video_path=str(video_fp),
        audio_path=str(audio_fp),
        cache_dir="YOUR INPUT"
    )
    # Store or process the results as needed

# After processing all samples, compute average results.

▶️ Evaluating VIEScore

Evaluating with GPT-4o

We provide example scripts for running GPT-4o-based evaluation on 20 examples from MoChaBench, covering 4 models and 4 evaluation aspects.

conda activate mochabench
pip install openai opencv-python
cd eval-viescore
python eval_gpt_viescore.py

Evaluating Alignment with Human Ratings

We also provide a script to compute the agreement between GPT-4o scores and human majority vote ratings:

conda activate mochabench
pip install scikit-learn
cd eval-viescore
python compute_alignment.py

This script outputs alignment metrics (QWK, Spearman ρ, Footrule, MAE) for each aspect and overall.

📚 Citation

🌟 If you find our work helpful, please leave us a star and cite our paper.

@article{wei2025mocha,
  title={MoCha: Towards Movie-Grade Talking Character Synthesis},
  author={Wei, Cong and Sun, Bo and Ma, Haoyu and Hou, Ji and Juefei-Xu, Felix and He, Zecheng and Dai, Xiaoliang and Zhang, Luxin and Li, Kunpeng and Hou, Tingbo and others},
  journal={arXiv preprint arXiv:2503.23307},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
benchmark		benchmark
eval-lipsync		eval-lipsync
eval-viescore		eval-viescore
mocha-generation		mocha-generation
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoChaBench

🔔News

📑 Table of Contents

🏆 MoChaBench Leaderboard

🧑 Single-Character Monologue (English)

👥 Multi-Character Turn-based Dialogue (English)

Per-Category Averages

▶️ Evaluating Lip Sync Scores

Overview

Benchmark files

How to Use

Download this repo

Dependencies

Example Script to run SyncNetPipeline on single pair of (video, speech)

Running SyncNetPipeline on MoCha-Generated Videos for MoChaBench Evaluation

Running SyncNetPipeline on Your Model’s Outputs for MoChaBench

🧩 Custom Benchmark Evaluation

▶️ Evaluating VIEScore

Evaluating with GPT-4o

Evaluating Alignment with Human Ratings

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

congwei1230/MoChaBench

Folders and files

Latest commit

History

Repository files navigation

MoChaBench

🔔News

📑 Table of Contents

🏆 MoChaBench Leaderboard

🧑 Single-Character Monologue (English)

👥 Multi-Character Turn-based Dialogue (English)

Per-Category Averages

▶️ Evaluating Lip Sync Scores

Overview

Benchmark files

How to Use

Download this repo

Dependencies

Example Script to run SyncNetPipeline on single pair of (video, speech)

Running SyncNetPipeline on MoCha-Generated Videos for MoChaBench Evaluation

Running SyncNetPipeline on Your Model’s Outputs for MoChaBench

🧩 Custom Benchmark Evaluation

▶️ Evaluating VIEScore

Evaluating with GPT-4o

Evaluating Alignment with Human Ratings

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages