@gzfffff has provided an updated version of this repo with mulitple bugs corrected (thanks!).
Thanks for those who have pointed out bugs in this repo. I was surprised to find that many of you were interested in this project. As I did not expect that people would run my training code, I am sorry that detailed steps for training the model were not provided and that some common bugs were not fixed. Now I have fixed a bug in this repo.
Another common problem about replication is that training the model from scratch does not result in natural synthesized speech. I encountered the same issue too. So, before training, I initialized the model with the weights from a pre-trained English model (link). With the pre-trained weights initialization, the Chinese model converged very fast and was able to produce natural speech.
The training steps are also updated (see below).
Audio samples can be found here: online demo
All synthesized stimuli can be accessed here.
Traning data can be found here.
You can directly run the TTS models (Tacotron2 and WaveGlow) on Google Colab (with a powerful GPU).
torch == 1.1.0 (latest version will not work!)
- Download pre-trained Mandarin models at this folder.
- Download pre-trained Chinese BERT (
BERT-wwm-ext, Chinese
). - Run ``inference_bert.ipynb''
Or:
Use the following command line.
python synthesize.py --text ./stimuli/tone3_stimuli --use_bert --bert_folder path_to_bert_folder
--tacotron_path path_to_pre-trained_tacotron2 --waveglow_path path_to_pre-trained_waveglow
--out_dir path_output_dir
Note. The current implementation is based on the Nvidia's public implementation of Tacotron2 and Waveglow
torch == 1.1.0 (latest version will not work!)
- Download the dataset;
- Download pre-trained Chinese BERT (
BERT-wwm-ext, Chinese
). - Run scripts in the preprocessing folder;
- partition.py
- preprocess_audio.py
- preprocess_text.py
- extract_bert.py
- Run the training script (detailed descriptions of each argument can be found in the source code).
This project has benefited immensely from the following works.
Pre-Trained Chinese BERT with Whole Word Masking
Tacotron 2 - PyTorch implementation with faster-than-realtime inference
WaveGlow: a Flow-based Generative Network for Speech Synthesis
A Demo of MTTS Mandarin/Chinese Text to Speech FrontEnd
Open-source mandarin speech synthesis data
只用同一声调的字可以造出哪些句子?