This is GrapheneOS's fork of π΅ Matcha-TTS [ICASSP 2024] (https://arxiv.org/abs/2309.03199) with major speed and efficiency improvements compared to the original.
The original code and README.md can be found at https://github.com/shivammehta25/Matcha-TTS.
Please use Python 3.11 for this repository.
π΅ Matcha-TTS is a new approach to non-autoregressive neural TTS that uses conditional flow matching (similar to rectified flows) to speed up ODE-based speech synthesis:
- Is probabilistic
- Has compact memory footprint
- Sounds highly natural
- Is very fast to synthesise from
This fork adds major speed and efficiency improvements to training and inference.
We added usage of torch.compile() to boost training speeds.
We set out_size to 64, which we've found allows a smaller model to attain much higher quality than when it's not set. We haven't compared other values for out_size yet, but we chose 64 because it allows a very fast time-to-first-audio when inferencing in chunks.
During inference, the decoder should be run in chunks of out_size. This way, you can decode in a streaming fashion to provide fast time-to-first-audio during inference. This is especially useful for edge use-cases which require low latency such as an on-device text-to-speech engine. The exported ONNX model's decoder only accepts input in chunks of out_size.
We train with precomputed durations and prior_loss set to False as it seems to prevent the duration predictor part of the model from overfitting for some reason. We use the first stage model when it has the lowest duration prediction validation loss to compute durations. Also, training with precomputed durations is faster.
Parameter size has been significantly reduced from 18.2 million to 4.7 million, meaning it's only around 1/4 the parameters of the original! This further increases training and inference speed and is achieved through training with precomputed durations and the out_size previously mentioned.
Currently, only training on LJSpeech has been tested and used. Follow the directions below to train on LJSpeech.
-
Clone our fork of Misaki and install it using
pip install -e misaki. -
Also clone our graphemes_to_phonemes and make sure it contains models as our fork of Misaki depends on it.
-
Clone and enter this repository.
cd Matcha-TTS-
Download the dataset from here, extract it to
data/LJSpeech-1.1, and prepare the file lists to point to the extracted data like for item 5 in the setup of the NVIDIA Tacotron 2 repo. -
Install this package from source
pip install -e .- Go to
configs/data/ljspeech.yamland change to the paths of your train and validation filelists
The default is the names of the files used by the NVIDIA Tacotron 2 repo; you just need to download them from https://github.com/NVIDIA/tacotron2/tree/master/filelists, rename by removing "_filelist" at the end before ".txt", and place them at the correct paths.
train_filelist_path: data/filelists/ljs_audio_text_train.txt
valid_filelist_path: data/filelists/ljs_audio_text_val.txt
test_filelist_path: data/filelists/ljs_audio_text_test.txt- Generate normalisation statistics with the yaml file of dataset configuration
matcha-data-stats -i ljspeech.yaml
# Output:
#{'mel_mean': -5.51702880859375, 'mel_std': 2.064393997192383}Update these values in configs/data/ljspeech.yaml under data_statistics key.
data_statistics: # Computed for ljspeech dataset
mel_mean: -5.51702880859375
mel_std: 2.064393997192383- Run initial training to compute durations
python -m matcha.train experiment=ljspeechor for multi-gpu training, run
python -m matcha.train experiment=ljspeech trainer.devices=[0,1]Around 499 epochs seems to be a good stopping point. At that point, train_dur_loss was ~0.3722, and val_dur_loss was ~0.3627 and val_dur_loss had been stabilized. Please make sure the checkpoint you use does not have a loss spike.
After that point, the model seems to start overfitting on duration prediction as the train_dur_loss continues going down at a slow pace, while val_dur_loss slowly goes up.
- Synthesise from the initial custom trained model
Make sure that the initial model works at least OK, the quality will get better in the final model.
matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT> --vocoder hifi-gan/upstream-trained-models/LJ_V3/generator_v3- Generate durations
Follow the section extract phoneme alignments from Matcha-TTS and put the
durations inside the data/LJSpeech-1.1/durations directory.
- Run final training with precomputed durations
python -m matcha.train experiment=ljspeech_from_durationsor for multi-gpu training, run
python -m matcha.train experiment=ljspeech_from_durations trainer.devices=[0,1]We stopped at 1899 epochs for the model currently deployed in GrapheneOS Speech Services. At that point, train_epoch loss was ~0.6706 and val_epoch loss was ~0.6823 and had been stabilized. Please make sure the checkpoint you use does not have a loss spike.
- Synthesize from the final custom trained model
matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT> --vocoder hifi-gan/upstream-trained-models/LJ_V3/generator_v3To export a checkpoint to ONNX, first install ONNX Runtime with
pip install onnxruntimethen run the following:
python3 -m matcha.onnx.export matcha.ckpt onnx_model_folder --n-timesteps 5Note that n_timesteps is treated as a hyper-parameter rather than a model input. This means you should specify it
during export (not during inference). If not specified, n_timesteps is set to 5.
Additionally, the exported decoder includes the pretrained HiFi-GAN LJ_V3 model so it's usable for speech synthesis out-of-the-box. It doesn't quite match with our model, so there is some low-pitch static noise present in the synthesized speech. We plan to train a vocoder that's fine-tuned to our model to improve speech synthesis fidelity and eliminate static noise.
To run inference on the exported model, first install onnxruntime using
pip install onnxruntime
pip install onnxruntime-gpu # for GPU inferencethen use the following:
python3 -m matcha.onnx.infer onnx_model_folder --text "hey" --output-dir ./outputsThis will write .wav audio files to the output directory.
You can also control synthesis parameters:
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --temperature 0.4 --speaking_rate 0.9 --spk 0To run inference on GPU, make sure to install onnxruntime-gpu package, and then pass --gpu to the inference command:
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --gpuIf the dataset is structured as (minimum example)
data/
βββ LJSpeech-1.1
βββ wavsThen you can extract the phoneme level alignments from a Trained Matcha-TTS model using:
python matcha/utils/get_durations_from_trained_model.py -i dataset_yaml -c <checkpoint>Example:
python matcha/utils/get_durations_from_trained_model.py -i ljspeech.yaml -c matcha_ljspeech.ckptor simply:
matcha-tts-get-durations -i ljspeech.yaml -c matcha_ljspeech.ckptIn the datasetconfig turn on load duration.
Example: ljspeech.yaml
load_durations: True
or see an examples in configs/experiment/ljspeech_from_durations.yaml