This package contains the accompanying code for the following paper:
"StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing", which has appeared as long paper in the Findings of the ACL, 2024.
- Release StyleDubber's training and inference code.
- Release pretrained weights.
- Release the raw data and preprocessed data features of the GRID dataset.
- Metrics Testing Scripts (SECS, WER_Whisper).
- Release Demo Pages.
- Release the preprocessed data features of the V2C-Animation dataset.
- Update README.md.
- Upload the dataset to Google Drive.
- GRID BaiduDrive (code: GRID) / GoogleDrive
├── Lip_Grid_Gray
│ └── [GRID's Lip Region Images in Gray-scale]
├── Lip_Grid_Color
│ └── [GRID's Lip Region Images in RGB]
├── Grid_resample_ABS (GoogleDrive ✅)
│ └── [22050 Hz Ground Truth Audio Files in .wav] (The original data of GRID is 25K Hz)
├── Grid_lip_Feature
│ └── [Lip Feature extracted from ```Lip_Grid_Gray``` via Lipreading_using_Temporal_Convolutional_Networks]
├── Grid_Face_Image
│ └── [GRID's Face Region Images]
├── Grid_dataset_Raw
│ └── [GRID's raw data from Website]
├── Grad_eachframe
│ └── [Each frame files of Grid dataset]
├── Gird_FaceVAFeature
│ └── [Face Feature extracted from ```Grid_Face_Image``` via EmoFAN]
├── 0_Grid_Wav_22050_Abs_Feature (GoogleDrive ✅)
└── [Contains all the data features for train and inference in the GRID dataset]
Note: If you just want to train StyleDubber
on the GRID dataset, you only need to download the files in 0_Grid_Wav_22050_Abs_Feature
(Preprocessed data features) and Grid_resample_ABS
(Ground truth waveform used for testing). If you're going to plot and display, use it for other tasks (lip reading, ASV, etc.), or re-preprocess features on your way, you can download the rest of the files you need 😊.
- V2C-Animation dataset (chenqi-Denoise2) BaiduDrive (code: k9mb) / GoogleDrive
├── Phoneme_level_Feature (GoogleDrive ✅)
│ └── [Contains all the data features for train and inference in the V2C-Animation dataset]
├── GT_Wav (GoogleDrive ✅)
└── [22050 Hz ground truth Audio Files in .wav]
Note: For training on V2C-Animation, you need to download the files in Phoneme_level_Feature
(Preprocessed data features) and GT_Wav
(Ground truth waveform used for testing).
Other visual images (e.g., face and lip regions) in intermediate processes can be accessed at HPMDubbing.
Quick Q&A: HPMDubbing also has pre-processed features. Are they the same? Can I use it to train StyleDubber?
No, you need to re-download to train StyleDubber. HPMDubbing needs frame frame-level feature with 220 hop length and 880 window length for the desired upsampling manner.
StyleDubber
currently only supports phoneme-level features and we adjust the hop length (256) and window length (1024) during pre-processing.
We provide the pre-trained checkpoints on GRID and V2C-Animation datasets as follows, respectively:
-
GRID: https://pan.baidu.com/s/1Mj3MN4TuAEc7baHYNqwbYQ (y8kb), Google Drive
-
V2C-Animation dataset (chenqi-Denoise2): https://pan.baidu.com/s/1hZBUszTaxCTNuHM82ljYWg (n8p5), Google Drive
Our python version is 3.8.18
and cuda version 11.5
. It's possible to have other compatible version.
Both training and inference are implemented with PyTorch on a
GeForce RTX 4090 GPU.
conda create -n style_dubber python=3.8.18
conda activate style_dubber
pip install -r requirements.txt
You need repalce tha path in preprocess_config
(see "./ModelConfig_V2C/model_config/MovieAnimation/config_all.txt") to you own path.
Training V2C-Animation dataset (153 cartoon speakers), please run:
python train_StyleDubber_V2C.py
You need repalce tha path in preprocess_config
(see "./ModelConfig_GRID/model_config/GRID/config_all.txt") to you own path.
Training GRID dataset (33 real-world speakers), please run:
python train_StyleDubber_GRID.py
There are three kinds of dubbing settings in this paper. The first setting is the same as in V2C-Net (Chen et al., 2022a), which uses target audio as reference audio from test set. However, this is impractical in real-world applications. Thus, we design two new and more reasonable settings: “Dub 2.0” uses non-ground truth audio of the same speaker as reference audio; “Dub 3.0” uses the audio of unseen characters (from another dataset) as reference audio.
Inference Setting1: V2C & GRID
python 0_evaluate_V2C_Setting1.py --restore_step <checkpoint_step>
or
python 0_evaluate_GRID_Setting1.py --restore_step <checkpoint_step>
Inference Setting2: V2C
python 0_evaluate_V2C_Setting2.py --restore_step <checkpoint_step>
Inference Setting3: V2C
python 0_evaluate_V2C_Setting3.py --restore_step <checkpoint_step>
-
👉 Word Error Rate (WER)
Please download pre-trained model of whisper-large-v3 (Calculating V2C-Animation dataset) and whisper-base (Calculating GRID dataset), and
pip install jiwer
.For Setting1 and Setting2: Please run:
python Dub_Metric/WER_Whisper/Setting_test.py -p <Generated_wav_path> -t <GT_Wav_Path>
Note: If you need test GRID dataset, please replace
model = whisper.load_model("large-v3")
tomodel = whisper.load_model("base")
(see line 102 in./Dub_Metric/WER_Whisper/Setting_test.py
).For Setting3 (only for V2C): Please run:
python Dub_Metric/WER_Whisper/Setting3_test.py -p <Generated_wav_path> -t <GT_Wav_Path>
❓ Quick Q&A: Why does V2C use whisper-large-v3, while GRID uses whisper-base?
Considering the challenges of the
V2C-Animation dataset
, the reviewer of ACL ARR suggested using whisper_large to enhance convincing. Through comparison, we finally choosewhisper-large-v3
as the WER testing benchmark. Considering the inference speed and memory, the GRID dataset still retains the “Whisper-base” as the test benchmark to calculate WER (22%), which is similar to the VDTTS (Hassid et al., 2022) result (26%) in Table 2 (GRID evaluation), so this is sufficient to ensure a fair comparison. -
👉 SPK-SIM / SECS (Speaker Encoder Cosine Similarity)
Please download
wav2mel.pt
anddvector.pt
and save in./ckpts
For Setting1: Please run:
python Dub_Metric/SECS/Setting1.py -p <Generated_wav_path> -t <GT_Wav_Path>
For Setting2: Please run:
python Dub_Metric/SECS/Setting2_V2C.py -p <Generated_wav_path> -t <GT_Wav_Path>
or:
python Dub_Metric/SECS/Setting2_GRID.py -p <Generated_wav_path> -t <GT_Wav_Path>
For Setting3 (only for V2C): Please run:
python Dub_Metric/SECS/Setting3.py -p <Generated_wav_path> -t <GT_Wav_Path>
-
👉 MCD-DTW and MCD-DTW-SL
The MCD-DTW and MCD-DTW-SL are calculated by running 0_evaluate_V2C_Setting*.py and 0_evaluate_GRID_Setting*.py, see
⭕ Inference Wav.
-
👉 Sim-O & Sim-R by WavLM-TDNN
-
👉 EMO-ACC
If you find our work useful, please consider citing:
@inproceedings{cong-etal-2024-styledubber,
title = "{S}tyle{D}ubber: Towards Multi-Scale Style Learning for Movie Dubbing",
author = "Cong, Gaoxiang and
Qi, Yuankai and
Li, Liang and
Beheshti, Amin and
Zhang, Zhedong and
Hengel, Anton and
Yang, Ming-Hsuan and
Yan, Chenggang and
Huang, Qingming",
booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
month = aug,
year = "2024",
pages = "6767--6779",
}
We would like to thank the authors of previous related projects for generously sharing their code and insights: CDFSE_FastSpeech2, Multimodal Transformer, SMA, Meta-StyleSpeech, and FastSpeech2.