Skip to content

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

License

Notifications You must be signed in to change notification settings

nii-yamagishilab/ZMM-TTS

Repository files navigation

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Introduction

This is the code for the ZMM-TTS submitted to the IEEE TASLP. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper is the first to incorporate the representations from text-based and speech-based self-supervised learning models into multilingual speech synthesis tasks. We conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has been proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetical low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.



Overview

Welcome to try our code and pre-trained model on different languages!

Release

  • [20/01] 🔥 We released code and model pre-trained on 6 language (English, French, German, Portuguese, Spanish and Swedish) public datasets.

Samples

Samples are provided on our demo page.

Installation

ZMM-TTS requires Python>=3.8, and a reasonly recent version of PyTorch. To install ZMM-TTS and make a quick synthesis, you can run from this repository:

git clone https://github.com/nii-yamagishilab-visitors/ZMM-TTS.git

cd ZMM-TTS
pip3 install -r requirements.txt
#In addition, you may need to install these libraries to support full functionality.
pip install transformers  #For support XLSR-53 and XphoneBERT model.
pip install speechbrain   #For extracting speaker embedding.

If you want to try IPA representations, you need to install Epitran.

Pre-trained self-supervised model

Model Modality Lang Training data
XLSR-53 Audio 53 56K hours
ECAPA-TDNN Audio > 5 2794 hours
XPhoneBERT Text 94 330M sentences

Usage

Multilingual multispeaker dataset MM6

In my paper, the training data we used contained GlobalPhone, and unfortunately that is not an open source data. Considering the scarcity of publicly multilingual and multilingual speaker databases for speech synthesis, I designed the following training database based on the MLS and NHT Swedish databases and called it MM6. (It seems that NST is no longer open for downloads in Swedish, in which case you should apply this data from The Norwegian Language Bank). If you have GlobalPhone dataset, you can try the same training data Dataset/train_paper.txt as our paper.

Language Gender Speakers Sentences Durations (h) Database
English Female 20 4000 13.9 MLS
English Male 20 4000 13.9 MLS
French Female 20 4000 13.9 MLS
French Male 20 4000 13.9 MLS
German Female 20 4000 13.9 MLS
German Male 20 4000 13.9 MLS
Portuguese Female 16 3741 13.0 MLS
Portuguese Male 20 4175 14.5 MLS
Spanish Female 20 3519 12.2 MLS
Spanish Male 20 3786 13.1 MLS
Swedish Female 0 0 0
Swedish Male 20 4000 13.9 NST

Download and norm data

You can generate MM6 dataset through following download and norm scripts:

bash scripts/download.sh   #download the MLS data.
python prepare_data/creat_meta_data_mls.py #Generate speaker-gender-language balance data.
#We recommend that you use sv56 to normalize the MLS audio.
bash scripts/norm_wav.sh

Please contact The Norwegian Language Bank if you want to get NHT Swedish data, and extract it to the Dataset/origin_data/. Or, you could simply consider excluding the Swedish language.

#The Swedish audio already normalize
python prepare_data/creat_meta_data_swe.py

This MM6 is a multilingual dataset with a largely balanced mix of speakers and genders, and we encourage you to experiment with other tasks as well.

Preprocess

After you download and nom the wav, you can generate in Dataset folder as:

|--Dataset
     |--MM6
         |--wavs          #Store audio files
     |--preprocessed_data #Store preprocessed data: text, features,...
         |--MM6
             |--train.txt      

you can find wav in Dataset/MM6/wavs/ and meta file in Dataset/preprocessed_data/ZMM6/train.txt. The train.txt looks like:

Name|Database|Language|Speaker|text
7756_9025_000004|MM6|English|7756|on tiptoe also i followed him and just as his hands were on the wardrobe door my hands were on his throat he was a little man and no match for me
    1. Extract discrete code index and representations:
bash scripts/extract_discrete.sh
    1. Extract speaker embeddings:
bash scripts/extract_spk.sh
    1. Extract text sequences:
python prepare_data/extract_text_seq_from_raw_text.py
    1. Extract mel spectrograms:
python prepare_data/compute_mel.py
    1. Compute a priori alignment probabilities:
python prepare_data/compute_attention_prior.py

Train model

    1. Train txt2vec model:
#Using XphoneBERT:
python txt2vec/train.py --dataset MM6 --config MM6_XphoneBERT
#Using Characters (Letters):
python txt2vec/train.py --dataset MM6 --config MM6_Letters
#Using IPA:
python txt2vec/train.py --dataset MM6 --config MM6_IPA
#If you want to train a model without a language layer, you could use xxx_wo config like:
python txt2vec/train.py --dataset MM6 --config MM6_XphoneBERT_wo

NOTE: Please set needUpdate: True in model.yaml after 1/4 iteration, when you use XphoneBERT.

    1. Train vec2mel model:
python vec2mel/train.py --dataset MM6 --config MM6

For the training of txt2vec and vec2mel model, we used a batch_size of 16 and trained for 1.2M steps. It took about 3 days on 1 Tesla A100 GPU.

    1. Train vec2wav model:
python prepare_data/creat_lists.py
python vec2wav/train.py -c Config/vec2wav/vec2wav.yaml
#If you want to train a model without a language layer:
python vec2wav/train.py -c Config/vec2wav/vec2wav_wo.yaml

For the training of vec2wav , we used a batch_size of 16 and trained for 1M steps. It took about 3 days on 1 Tesla A100 GPU.

    1. Train HifiGAN model:
python Vocoder_HifiGAN_Model/train.py --config Config/config_16k_mel.json

For the training of HifiGAN, we used a batch_size of 16 and trained for 1M steps. It took about 3 days on 1 Tesla A100 GPU.

Test model

    1. Prepare test data:
    • a. test meta file Dataset/MM6/test.txt.
    • b. ref speaker embedding in Dataset/MM6/test_spk_emb/.
    1. Generate sample
    bash test_scripts/quick_test.sh

    Of course, you can download our pre-trained model from google driver. And put it in the corresponding Train_log directory. The training log can be found in the corresponding Train_log files.

    1. The result would be found in test_result files.

To do

  • Scripts for few-shot training.
  • Scripts for zero-shot inference on any language.

Citation

If you use this code, result, or MM6 dataset in your paper, please cite our work as:

@article{gong2023zmm,
  title={ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations},
  author={Gong, Cheng and Wang, Xin and Cooper, Erica and Wells, Dan and Wang, Longbiao and Dang, Jianwu and Richmond, Korin and Yamagishi, Junichi},
  journal={arXiv preprint arXiv:2312.14398},
  year={2023}
}

References

License

The code in this repository is released under the BSD-3-Clause license as found in the LICENSE file. The txt2vec, vec2mel and vec2wav subfolder have MIT License. The sv56scripts has GPL License.

About

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published