[CVPR2025] Official implementation of paper "Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing"
Our python version is 3.8.18 and cuda version 11.8. It's possible to have another compatible version.
Both training and inference are implemented with PyTorch on a
GeForce RTX 4090 GPU.
conda create -n dubbing python=3.8.18
conda activate dubbing
pip install -r requirements.txt
python train_first.py -p Configs/config_v2c_stage1.yml # V2C-Animation benchmark
python train_first.py -p Configs/config_grid_stage1.yml # GRID benchmark
python train_second.py -p Configs/config_v2c.yml # V2C-Animation benchmark
python train_second_grid.py -p Configs/config_grid.yml # GRID benchmark
We provide the first stage and second stage pre-trained checkpoints on V2C-Animation and GRID benchmarks as follows, respectively:
-
V2C-Animation benchmark: Baidu Drive (b5wy), Google Drive.
-
GRID benchmark: Baidu Drive (wj25), Google Drive
-
V2C-Animation benchmark: Baidu Drive (3k4h), Google Drive.
-
GRID benchmark: Baidu Drive (23vd), Google Drive
There is three generation settings in V2C-Animation benchmark:
python inference_v2c.py -n 'YOUR_EXP_NAME' --epoch 'YOUR_EPOCH' --setting 1
python inference_v2c.py -n 'YOUR_EXP_NAME' --epoch 'YOUR_EPOCH' --setting 2
python inference_v2c.py -n 'YOUR_EXP_NAME' --epoch 'YOUR_EPOCH' --setting 3
There is two generation settings in GRID benchmark:
python inference_grid.py -n 'YOUR_EXP_NAME' --epoch 'YOUR_EPOCH' --setting 1
python inference_grid.py -n 'YOUR_EXP_NAME' --epoch 'YOUR_EPOCH' --setting 2
- GRID (BaiduDrive (code: GRID) / GoogleDrive)
- V2C-Animation dataset (chenqi-Denoise2) (BaiduDrive (code: k9mb) / GoogleDrive)
We would like to thank the authors of previous related projects for generously sharing their code and insights: StyleTTS, StyleTTS2, StyleDubber, PL-BERT, and HiFi-GAN.
If you find our work useful, please consider citing:
@InProceedings{zhang_2025_produbber,
author = {Zhang, Zhedong and Li, Liang and Yan, Chenggang and Liu, Chunshan and van den Hengel, Anton and Qi, Yuankai},
title = {Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {172-182}
}
