This repo contains the Benchmark, Standard Evaluation Codebase and MoCha's Generation Results for MoCha
Towards Movie-Grade Talking Character Synthesis.
Many thanks to the community for sharing β
An emotional narrative, created with light manual editing on clips generated by MoCha, has surpassed 1 million views on X.
- π₯[2025-12-27]: Released a demo implementation built on HunyuanVideo β π€ Checkpoints and Code
- π MoChaBench Leaderboard
βΆοΈ Evaluating Lip Sync ScoresβΆοΈ Evaluating VIEScore- π Citation
Including categories: 1p_camera_movement, 1p_closeup_facingcamera, 1p_emotion, 1p_mediumshot_actioncontrol, 1p_portrait, 2p_1clip_1talk
| Method | Sync-Conf. β | Sync-Dist. β |
|---|---|---|
| MoCha | 6.333 | 8.185 |
| Hallo3 | 4.866 | 8.963 |
| SadTalker | 4.727 | 9.239 |
| AniPortrait | 1.740 | 11.383 |
Including categories: 2p_2clip_2talk
| Method | Sync-Conf. β | Sync-Dist. β |
|---|---|---|
| MoCha | 4.951 | 8.601 |
| Category | Model | Sync-Dist. β | Sync-Conf. β | Example (n) |
|---|---|---|---|---|
| 1p_camera_movement | MoCha | 8.455 | 5.432 | 18 |
| 1p_closeup_facingcamera | MoCha | 7.958 | 6.298 | 27 |
| 1p_emotion | MoCha | 8.073 | 6.214 | 34 |
| 1p_generalize_chinese | MoCha | 8.273 | 4.398 | 4 |
| 1p_mediumshot_actioncontrol | MoCha | 8.386 | 6.241 | 52 |
| 1p_protrait | MoCha | 8.125 | 6.892 | 38 |
| 2p_1clip_1talk | MoCha | 8.082 | 6.493 | 30 |
| 2p_2clip_2talk | MoCha | 8.601 | 4.951 | 15 |
We use SyncNet for evaluation. The codebase is adapted from joonson/syncnet_python with improved code structure and a unified API to facilitate evaluation for the community.
The implementation follows a Hugging Face Diffusers-style structure.
We provided a
SyncNetPipeline Class, located at eval-lipsync\script\syncnet_pipeline.py.
You can initialize SyncNetPipeline by providing the weights and configs:
pipe = SyncNetPipeline(
{
"s3fd_weights": "path to sfd_face.pth",
"syncnet_weights": "path to syncnet_v2.model",
},
device="cuda", # or "cpu"
)The pipeline offers an inference function to score a single pair of video and speech. For fair comparison, the input speech should be a denoised vocal source extracted from your audio. You can use seperator like Kim_Vocal_2 for general noise remvoal and Demucs_mdx_extra for music removal
av_off, sync_confs, sync_dists, best_conf, min_dist, s3fd_json, has_face = pipe.inference(
video_path="path to video.mp4", # RGB video
audio_path="path to speech.wav", # speech track (must be denoised from audio, ffmpeg-readable format)
cache_dir= "path to store intermediate output", # optional; omit to auto-cleanup intermediates
)We provide the benchmark files in the benchmark/ directory, organized by data type and category.
Each file follows the structure:
benchmark/<data_type>/<category>/<context_id>.<ext>
Directory Structure
ββbenchmark
β ββaudios
β β ββ1p_camera_movement
β β β ββ 10_man_basketball_camera_push_in.wav
β β β ...
β β ββ1p_closeup_facingcamera
β β ββ1p_emotion
β β ββ1p_generalize_chinese
β β ββ1p_mediumshot_actioncontrol
β β ββ1p_protrait
β β ββ2p_1clip_1talk
β β ββ2p_2clip_2talk
β ββfirst-frames-from-mocha-generation
β β ββ1p_camera_movement
β β β ββ 10_man_basketball_camera_push_in.png
β β β ...
β β ββ1p_closeup_facingcamera
β β ββ1p_emotion
β β ββ1p_generalize_chinese
β β ββ1p_mediumshot_actioncontrol
β β ββ1p_protrait
β β ββ2p_1clip_1talk
β β ββ2p_2clip_2talk
β ββspeeches
β ββ1p_camera_movement
| | ββ 10_man_basketball_camera_push_in_speech.wav
β β ...
β ββ1p_closeup_facingcamera
β ββ1p_emotion
β ββ1p_generalize_chinese
β ββ1p_mediumshot_actioncontrol
β ββ1p_protrait
β ββ2p_1clip_1talk
β ββ2p_2clip_2talk
βββbenchmark.csv-
benchmark.csvcontains metadata for each sample, with each row specifying:
idx_in_category,category,context_id,prompt.We use
benchmark.csvto connect files. Any file in the benchmark can be located using the combination:
/benchmark/<data_type>/<category>/<context_id>.<ext> -
speechesfiles are generated fromaudiosfiles by using Demucs_mdx_extra. For fair comparison,speechesshould also be used as the input to your own model instead ofaudios. -
We also provie
first-frames-from-mocha-generationto facilitate fair comparison for (image + text + audio β video) models.
SyncNet Weights, Benchmark and MoCha's Generation Results are embedded in this git repo
git clone https://github.com/congwei1230/MoChaBench.gitconda create -n mochabench_eval python=3.8
conda activate mochabench_eval
cd eval-lipsync
pip install -r requirements.txt
# require ffmpeg installedcd script
python run_syncnet_pipeline_on_1example.pyYou are expected to get values close (Β±0.1 due to ffmpeg version, the version i am using ffmpeg version 7.1.1-essentials_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developers) to:
AV offset: 1
Min dist: 9.255
Confidence: 4.497
best-confidence : 4.4973907470703125
lowest distance : 9.255396842956543
per-crop offsets : [1]
We provide the MoCha-Generated videos in the mocha-generation/ directory
Each file follows the structure:
mocha-generation/<category>/<context_id>.<ext>
mocha-generation
ββ1p_camera_movement
β ββ 10_man_basketball_camera_push_in.mp4
β β ...
ββ1p_closeup_facingcamera
ββ1p_emotion
ββ1p_generalize_chinese
ββ1p_mediumshot_actioncontrol
ββ1p_protrait
ββ2p_1clip_1talk
ββ2p_2clip_2talkTo evaluate the results, simply run the pipeline below.
This script will print the score for each category, as well as the average scores for Monologue and Dialogue.
It will also output a CSV file at eval-lipsync/mocha-eval-results/sync_scores.csv, recording each exampleβs score.
cd eval-lipsync/script
python run_syncnet_pipeline_on_mocha_generation_on_mocha_bench.pyTo evaluate your own modelβs outputs with MoChaBench, first use the following inputs to generate videos:
- Speech input:
benchmark/speeches - Text input:
promptfrombenchmark.csv - Image input:
benchmark/first-frames-from-mocha-generation(if your model requires an image condition)
You can also use our HF version
to generate videos.
Then organize your generated videos in a folder that matches the structure of mocha-generation/:
<your_outputs_dir>/
ββ 1p_camera_movement/
β ββ 10_man_basketball_camera_push_in.mp4
β β ...
ββ 1p_closeup_facingcamera/
ββ 1p_emotion/
ββ 1p_generalize_chinese/
ββ 1p_mediumshot_actioncontrol/
ββ 1p_protrait/
ββ 2p_1clip_1talk/
ββ 2p_2clip_2talk/Each video should be named as <context_id>.mp4 within the corresponding category folder. You donβt need to provide an mp4 for every categoryβthe script will skip any missing videos and report scores for the rest.
Next, modify the script run_syncnet_pipeline_on_your_own_model_results.py to point to your video folder.
Then, run:
cd eval-lipsync/script
python run_syncnet_pipeline_on_your_own_model_results.pyThe script will output a CSV file at eval-lipsync/your own model-eval-results/sync_scores.csv with the evaluation scores for each example.
Since our pipeline provides an API to score a pair of (video, audio), you can easily adapt it for other benchmark datasets by looping through your examples:
# Loop through your dataset
for example in dataset:
video_fp = example["video_path"]
audio_fp = example["audio_path"]
id = example["context_id"]
av_off, sync_confs, sync_dists, best_conf, min_dist, s3fd_json, has_face = pipe.inference(
video_path=str(video_fp),
audio_path=str(audio_fp),
cache_dir="YOUR INPUT"
)
# Store or process the results as needed
# After processing all samples, compute average results.We provide example scripts for running GPT-4o-based evaluation on 20 examples from MoChaBench, covering 4 models and 4 evaluation aspects.
conda activate mochabench
pip install openai opencv-python
cd eval-viescore
python eval_gpt_viescore.py
We also provide a script to compute the agreement between GPT-4o scores and human majority vote ratings:
conda activate mochabench
pip install scikit-learn
cd eval-viescore
python compute_alignment.py
This script outputs alignment metrics (QWK, Spearman Ο, Footrule, MAE) for each aspect and overall.
π If you find our work helpful, please leave us a star and cite our paper.
@article{wei2025mocha,
title={MoCha: Towards Movie-Grade Talking Character Synthesis},
author={Wei, Cong and Sun, Bo and Ma, Haoyu and Hou, Ji and Juefei-Xu, Felix and He, Zecheng and Dai, Xiaoliang and Zhang, Luxin and Li, Kunpeng and Hou, Tingbo and others},
journal={arXiv preprint arXiv:2503.23307},
year={2025}
}