A Pytorch implementation of Google's Tacotron speech synthesis network.
This implementation also includes the Location-Sensitive Attention and the Stop Token features from Tacotron 2.
Furthermore, the model is trained on the LJ Speech dataset, with trained model provided.
Audio samples can be found in the result directory.
This implementation is based on r9y9/tacotron_pytorch, the main differences are:
- Adds Location-Sensitive Attention and the Stop Token from the Tacotron 2 paper. This can greatly reduce the amount of time and data required to train a model.
- Remove all TensorFlow dependencies that r9y9 uses, now it runs on PyTorch and PyTorch only.
- Adds a loss module, and use L2 (MSE) loss instead of L1 loss.
- Adds a data loader module.
- Incorporate the LJ Speech data preprocessing script from keithito.
- Code factoring and optimization for easier debug and extend in the furture.
Furthermore, some differences from the original Tacotron paper are:
- Predict r=5 non-overlapping consecutive out-put frames at each decoder step instead of r=2.
- Feed all r frames to the next decoder input step instead of just the last frame of r frames.
- Scale the loss on predicted linear spectrograms so that lower frequencies that corresponds to human speech (0 to 3000 Hz) weighs more.
- Did not use a loss mask in sequence-to-sequence learning, this forces the model to learn when to stop synthesis.
- Disable bias for the 1-Dimensional convolution unit in the CBHG modulehas. These implementation details helps the model's convergence.
Audio quality isn't as good as Google's demo yet, but hopefully it will improve eventually. Pull requests are welcome!
- Clone this repo:
git clone git@github.com:andi611/Tacotron-Pytorch.git
- CD into this repo:
cd Tacotron-Pytorch
-
Install Python 3.
-
Install the latest version of Pytorch according to your platform. For better performance, install with GPU support (CUDA) if viable. This code works with Pytorch 0.4 and later.
-
Install requirements:
pip3 install -r requirements.txt
Warning: you need to install torch depending on your platform. Here list the Pytorch version used when built this project was built.
-
Download the LJ Speech dataset.
You can use other datasets if you convert them to the right format. See TRAINING_DATA.md for more info.
-
Unpack the dataset into
~/Tacotron-Pytorch/data
After unpacking, your tree should look like this for LJ Speech:
|- Tacotron-Pytorch |- data |- LJSpeech-1.1 |- metadata.csv |- wavs
-
Preprocess the LJ Speech dataset and make model-ready meta files using preprocess.py:
python3 preprocess.py --mode make
After preprocessing, your tree will look like this:
|- Tacotron-Pytorch |- data |- LJSpeech-1.1 (The downloaded dataset) |- metadata.csv |- wavs |- meta (generate by preprocessing) |- meta_text.txt |- meta_mel_xxxxx.npy ... |- meta_spec_xxxxx.npy ... |- test_transcripts.txt (provided)
-
Train a model using train.py
python3 train.py --ckpt_dir ckpt/ --log_dir log/
Restore training from a previous checkpoint:
python3 train.py --ckpt_dir ckpt/ --log_dir log/ --model_name 500000
Tunable hyperparameters are found in config.py.
You can adjust these parameters and setting by editing the file, the default hyperparameters are recommended for LJ Speech.
-
Monitor with Tensorboard (OPTIONAL)
tensorboard --logdir 'path to log_dir'
The trainer dumps audio and alignments every 2000 steps by default. You can find these in
tacotron/ckpt/
.
Testing: Using a pre-trained model and test.py
- Run the testing environment with interactive mode:
python3 test.py --interactive --plot --model_name 500000
- Run the testing algorithm on a set of transcripts (Results can be found in the result/500000 directory) :
python3 test.py --plot --model_name 500000 --test_file_path ./data/test_transcripts.txt
Credits to Ryuichi Yamamoto for a wonderful Pytorch implementation of Tacotron, which this work is mainly based on. This work is also inspired by NVIDIA's Tacotron 2 PyTorch implementation.
- Add more configurable hparams