Skip to content

[ACL 2024] This is the Pytorch code for our paper "StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing"

License

Notifications You must be signed in to change notification settings

GalaxyCong/StyleDubber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StyleDubber

This package contains the accompanying code for the following paper:

"StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing", which has appeared as long paper in the Findings of the ACL, 2024.

Illustration

📣 News

🗒 TODOs

  • Release StyleDubber's training and inference code.
  • Release pretrained weights.
  • Release the raw data and preprocessed data features of the GRID dataset.
  • Metrics Testing Scripts (SECS, WER_Whisper).
  • Release Demo Pages.
  • Release the preprocessed data features of the V2C-Animation dataset.
  • Update README.md.
  • Upload the dataset to Google Drive.

📊 Dataset

├── Lip_Grid_Gray
                          
    │       └── [GRID's Lip Region Images in Gray-scale] 

├── Lip_Grid_Color
                          
    │       └── [GRID's Lip Region Images in RGB] 

├── Grid_resample_ABS (GoogleDrive ✅)
                          
    │       └── [22050 Hz Ground Truth Audio Files in .wav] (The original data of GRID is 25K Hz)
    
├── Grid_lip_Feature
                          
    │       └── [Lip Feature extracted from ```Lip_Grid_Gray``` via Lipreading_using_Temporal_Convolutional_Networks] 

├── Grid_Face_Image
                          
    │       └── [GRID's Face Region Images] 

├── Grid_dataset_Raw
                          
    │       └── [GRID's raw data from Website] 

├── Grad_eachframe
                          
    │       └── [Each frame files of Grid dataset] 

├── Gird_FaceVAFeature 
                          
    │       └── [Face Feature extracted from ```Grid_Face_Image``` via EmoFAN] 

├── 0_Grid_Wav_22050_Abs_Feature (GoogleDrive ✅)
                          
            └── [Contains all the data features for train and inference in the GRID dataset]  

Note: If you just want to train StyleDubber on the GRID dataset, you only need to download the files in 0_Grid_Wav_22050_Abs_Feature (Preprocessed data features) and Grid_resample_ABS (Ground truth waveform used for testing). If you're going to plot and display, use it for other tasks (lip reading, ASV, etc.), or re-preprocess features on your way, you can download the rest of the files you need 😊.

├── Phoneme_level_Feature (GoogleDrive ✅)
                  
    │       └── [Contains all the data features for train and inference in the V2C-Animation dataset] 

├── GT_Wav (GoogleDrive ✅)
                  
            └── [22050 Hz ground truth Audio Files in .wav] 

Note: For training on V2C-Animation, you need to download the files in Phoneme_level_Feature (Preprocessed data features) and GT_Wav (Ground truth waveform used for testing). Other visual images (e.g., face and lip regions) in intermediate processes can be accessed at HPMDubbing.

Quick Q&A: HPMDubbing also has pre-processed features. Are they the same? Can I use it to train StyleDubber?

No, you need to re-download to train StyleDubber. HPMDubbing needs frame frame-level feature with 220 hop length and 880 window length for the desired upsampling manner. StyleDubber currently only supports phoneme-level features and we adjust the hop length (256) and window length (1024) during pre-processing.

💡 Checkpoints

We provide the pre-trained checkpoints on GRID and V2C-Animation datasets as follows, respectively:

⚒️ Environment

Our python version is 3.8.18 and cuda version 11.5. It's possible to have other compatible version. Both training and inference are implemented with PyTorch on a GeForce RTX 4090 GPU.

conda create -n style_dubber python=3.8.18
conda activate style_dubber
pip install -r requirements.txt

🔥 Train Your Own Model

You need repalce tha path in preprocess_config (see "./ModelConfig_V2C/model_config/MovieAnimation/config_all.txt") to you own path. Training V2C-Animation dataset (153 cartoon speakers), please run:

python train_StyleDubber_V2C.py

You need repalce tha path in preprocess_config (see "./ModelConfig_GRID/model_config/GRID/config_all.txt") to you own path. Training GRID dataset (33 real-world speakers), please run:

python train_StyleDubber_GRID.py

⭕ Inference Wav

There are three kinds of dubbing settings in this paper. The first setting is the same as in V2C-Net (Chen et al., 2022a), which uses target audio as reference audio from test set. However, this is impractical in real-world applications. Thus, we design two new and more reasonable settings: “Dub 2.0” uses non-ground truth audio of the same speaker as reference audio; “Dub 3.0” uses the audio of unseen characters (from another dataset) as reference audio.

Illustration

Inference Setting1: V2C & GRID

python 0_evaluate_V2C_Setting1.py --restore_step <checkpoint_step>

or

python 0_evaluate_GRID_Setting1.py --restore_step <checkpoint_step>

Inference Setting2: V2C

python 0_evaluate_V2C_Setting2.py --restore_step <checkpoint_step>

Inference Setting3: V2C

python 0_evaluate_V2C_Setting3.py --restore_step <checkpoint_step>

🤖️ Output Result

  • 👉 Word Error Rate (WER)

    Please download pre-trained model of whisper-large-v3 (Calculating V2C-Animation dataset) and whisper-base (Calculating GRID dataset), and pip install jiwer.

    Illustration

    For Setting1 and Setting2: Please run:

    python Dub_Metric/WER_Whisper/Setting_test.py  -p <Generated_wav_path> -t <GT_Wav_Path>

    Note: If you need test GRID dataset, please replace model = whisper.load_model("large-v3") to model = whisper.load_model("base") (see line 102 in ./Dub_Metric/WER_Whisper/Setting_test.py).

    For Setting3 (only for V2C): Please run:

    python Dub_Metric/WER_Whisper/Setting3_test.py  -p <Generated_wav_path> -t <GT_Wav_Path>

    ❓ Quick Q&A: Why does V2C use whisper-large-v3, while GRID uses whisper-base?

    Considering the challenges of the V2C-Animation dataset, the reviewer of ACL ARR suggested using whisper_large to enhance convincing. Through comparison, we finally choose whisper-large-v3 as the WER testing benchmark. Considering the inference speed and memory, the GRID dataset still retains the “Whisper-base” as the test benchmark to calculate WER (22%), which is similar to the VDTTS (Hassid et al., 2022) result (26%) in Table 2 (GRID evaluation), so this is sufficient to ensure a fair comparison. Illustration

  • 👉 SPK-SIM / SECS (Speaker Encoder Cosine Similarity)

    Please download wav2mel.pt and dvector.pt and save in ./ckpts

    For Setting1: Please run:

    python Dub_Metric/SECS/Setting1.py  -p <Generated_wav_path> -t <GT_Wav_Path>

    For Setting2: Please run:

    python Dub_Metric/SECS/Setting2_V2C.py  -p <Generated_wav_path> -t <GT_Wav_Path>

    or:

    python Dub_Metric/SECS/Setting2_GRID.py  -p <Generated_wav_path> -t <GT_Wav_Path>

    For Setting3 (only for V2C): Please run:

    python Dub_Metric/SECS/Setting3.py  -p <Generated_wav_path> -t <GT_Wav_Path>
  • 👉 MCD-DTW and MCD-DTW-SL

    The MCD-DTW and MCD-DTW-SL are calculated by running 0_evaluate_V2C_Setting*.py and 0_evaluate_GRID_Setting*.py, see ⭕ Inference Wav.

  • 👉 Sim-O & Sim-R by WavLM-TDNN

  • 👉 EMO-ACC

✏️ Citing

If you find our work useful, please consider citing:

@inproceedings{cong-etal-2024-styledubber,
    title = "{S}tyle{D}ubber: Towards Multi-Scale Style Learning for Movie Dubbing",
    author = "Cong, Gaoxiang  and
      Qi, Yuankai  and
      Li, Liang  and
      Beheshti, Amin  and
      Zhang, Zhedong  and
      Hengel, Anton  and
      Yang, Ming-Hsuan  and
      Yan, Chenggang  and
      Huang, Qingming",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    pages = "6767--6779",
}

🙏 Acknowledgments

We would like to thank the authors of previous related projects for generously sharing their code and insights: CDFSE_FastSpeech2, Multimodal Transformer, SMA, Meta-StyleSpeech, and FastSpeech2.

About

[ACL 2024] This is the Pytorch code for our paper "StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published