Model Release: Tacotron2 with Forward Attention - LJSpeech #345

erogol · 2020-02-06T16:17:07Z

Model Link: https://drive.google.com/open?id=10ymOlWHutqTtfDYhIbHULn2IKDKP0O9m
Colab example: https://colab.research.google.com/drive/1cpofjnfKSpFhiREgExENIsum4MrqxyPR

This model is trained with Forward Attention enabled until ~400K iters and then finetuned with Batch Norm prenet until the end. It is the best model so far trained.

I observe once again that using BN based prenet improves the spectrogram quality considerablly but if you train it from scratch, model does not learn the attention.

You can also use this TTS model with PWGAN or WaveRNN vocoders. PWGAn provides real-time voice synthesis and WaveRNN is slower but provides better quality.

https://github.com/erogol/ParallelWaveGAN
https://github.com/erogol/WaveRNN

You can see the TB figures below:

m-toman · 2020-02-07T11:41:50Z

For testing the model, this worked for me:

git clone https://github.com/erogol/WaveRNN.git
git clone https://github.com/mozilla/TTS.git
cd TTS
git checkout dev
mkdir demo_models
cd demo_models
mkdir -p wavernn_models tts_models
wavernn_pretrained_model=wavernn_models/checkpoint_433000.pth.tar
gdown -O ${wavernn_pretrained_model} https://drive.google.com/uc?id=12GRFk5mcTDXqAdO5mR81E-DpTk8v2YS9
wavernn_pretrained_model_config=wavernn_models/config.json
gdown -O ${wavernn_pretrained_model_config} https://drive.google.com/uc?id=1kiAGjq83wM3POG736GoyWOOcqwXhBulv
 tts_pretrained_model=tts_models/checkpoint_670000.pth.tar
gdown -O ${tts_pretrained_model} https://drive.google.com/uc?id=1_mbQDLHekiearftLaraJaPl-FuNgOzKV
tts_pretrained_model_config=tts_models/config.json
gdown -O ${tts_pretrained_model_config} https://drive.google.com/uc?id=19FQscticcxQIFH4MwnxQ950LyxcN8kli
cd ../..
mkdir TTS/demo_output
python -m TTS.synthesize --use_cuda true --vocoder_config_path TTS/demo_models/wavernn_models/config.json --vocoder_path TTS/demo_models/wavernn_models/checkpoint_433000.pth.tar "Evil is Evil. Lesser, greater, middling… Makes no difference. The degree is arbitary. The definition’s blurred. If I’m to choose between one evil and another… I’d rather not choose at all." TTS/demo_models/tts_models/config.json TTS/demo_models/tts_models/checkpoint_670000.pth.tar TTS/demo_output/

but requires #349

Interesting result for that long Witcher quote:
evil.zip
Seems dot is mapped to the breathing sound ;)

erogol · 2020-02-07T12:03:19Z

You can now test this model with PWGAN using:
https://github.com/mozilla/TTS/blob/dev/notebooks/Benchmark-PWGAN.ipynb

erogol · 2020-02-10T12:00:33Z

I added a colab example running this model with PWGAN vocoder
https://colab.research.google.com/drive/1cpofjnfKSpFhiREgExENIsum4MrqxyPR

reuben · 2020-02-12T17:58:23Z

I released a new server package with this model embedded in it: https://github.com/mozilla/TTS/wiki/Released-Models#simple-packaging---self-contained-package-that-runs-an-http-api-for-a-pre-trained-tts-model

erogol · 2020-02-13T15:46:31Z

I also created an example colab using MelGAN as a vocoder. It's been trained by changing the PWGAN generator with MelGAN's architecture. It performs a bit better and slightly faster.

https://colab.research.google.com/drive/1Zg9jR27Pr-ziVa0krjtdoy2dKv6whv7b

nmstoker · 2020-02-14T00:16:55Z

The quality with this latest colab is amazing and it does well even with longer sentences:slightly_smiling_face:

Btw there was a reference to pwgan model that needed to be switched to Megan, but otherwise this is so straightforward to use.

erogol · 2020-02-14T12:35:58Z

@nmstoker good to hear that :)

What do you mean by "reference to pwgan"? Do you mean the server release?

nmstoker · 2020-02-14T12:44:15Z

What do you mean by "reference to pwgan"? Do you mean the server release?

Sorry, I wasn't clear last night. It's a tiny thing, but in the last but one cell of the MelGAN Colab above, there's this line

vocoder_model.load_state_dict(torch.load(PWGAN_MODEL,` map_location="cpu")["model"]["generator"])

And PWGAN_MODEL isn't defined (it's a simple matter of updating it to MELGAN_MODEL)

nmstoker · 2020-02-16T13:21:15Z

Also, I noticed that the checkout given in the Colab appears not to exist. In the cell with this:

%cd ../ParallelWaveGAN/
! git checkout 22018e6

it didn't seem to cause a problem, it just quietly failed with:

error: pathspec '22018e6' did not match any file(s) known to git.

Presumably, right now, nothing has changed to break it since then.

It looks like it's probably meant to be git checkout a22018e, given there's this hash from a commit around the right time: a22018e6e6be1f9381b003496cc285bdd5a4a284 and it's just offset by one character.

nmstoker · 2020-02-16T18:28:52Z

I've got what may be a silly question (if so, sorry! 🙂 )

Comparing the training stats charts above with the values set in the config.json for the released model, I see that for the orange line the stats change as if they're undergoing gradual training (ie they move at 50k, 130k, 290k) and then you've switched to BN fine-tuning with the blue line at 400k.

That orange gradual training pattern is consistent the "gradual_training" values in the config file released with the model, but I see the comment mention that gradual training is only for Tacotron, and yet this is Tacotron 2. Perhaps the comment simply hasn't been updated? (it's like that in all the configs I've seen since it was introduced)

Q. Does gradual training work for both models now? Or am I missing something about how you set the config file up for the initial orange training run? (eg that was changed when you switched to BN fine-tuning)

Thanks!

Edit: actually it looks like maybe the comment has been removed from the config.json here: https://github.com/mozilla/TTS/blob/master/config.json#L41 so presumably it does now work for both

erogol · 2020-02-17T01:04:40Z

@nmstoker yeah it works for the both now :)

george-roussos · 2020-02-21T09:43:59Z

Model produces great results and shall try to adapt it to a new speaker with an average-sized dataset (6hrs), male voice, no silences, clean audio. Will report results.

I was able to test the model yesterday, however I keep getting an AttributeError: 'AttrDict' object has no attribute 'mulaw' error, even though I have defined mulaw in both config files I use (do I define it as true or what). I might be doing something wrong. Anybody care enough to chime in?

erogol · 2020-02-21T09:59:12Z

@george-roussos you are training which model exactly.

If something is missing in the config file just add it. In the worse case, you can try what seems logical, but for mu-law thing, it is about WaveRNN vocoder which is not related.

george-roussos · 2020-02-21T10:21:01Z

@george-roussos you are training which model exactly.

If something is missing in the config file just add it. In the worse case, you can try what seems logical, but for mu-law thing, it is about WaveRNN vocoder which is not related.

I am not training anything right now, I am testing the model. The implementation I have is the TTS model trained on forward attention and batch normalization and the WaveRNN vocoder, which I am guessing is universal. My thought was I could first try and finetune the TTS model and see how it performs when adapted on a new voice when the data is clean and not sparse. Do you think it would be possible and, if so, what would your expectation be with a good quality dataset of 12 hours?

george-roussos · 2020-02-21T12:52:14Z

Back again. What branch/commit should we use to retrain the TTS model? I am trying to run distribute.py and use the config checkpoint_670000.pth.tar has, but is that the correct way to do it?

erogol · 2020-02-21T14:06:58Z

@george-roussos it is not the right place to go with this topic. You better post it on discord.

Try the commit version given with the model (model table) and yes it is the right way.

george-roussos · 2020-02-25T09:35:20Z

Is there any way we can make this model compatible with a universal WaveRNN vocoder after fine-tuning to a new voice? I tried to plug in the universal checkpoint from the git repo, but I get RuntimeError: Error(s) in loading state_dict for Model: size mismatch for upsample.up_layers.5.weight: copying a param with shape torch.Size([1, 1, 1, 17]) from checkpoint, the shape in current model is torch.Size([1, 1, 1, 23]). The model @m-toman links above only works with LJSpeech.

By the way, the results I am getting after fine-tuning on a 5 hour long dataset with transcription errors, is pretty good...

erogol · 2020-02-25T11:13:16Z

The difference between universal WaveRNN and the TTS model you are using is the sampling rate. WaveRNN model uses 16K and TTS model uses 22050. So maybe you need to finetune WaveRNN too with this rate. Or you can reduce the sampling rate as you finetune TTS with your dataset.

You also need to check out the right version of WaveRNN given with the model checkpoint.

george-roussos · 2020-02-25T11:44:19Z

Thanks. Do I checkout in the commit given? I imagine fine-tuning to 22050 is not as simple as editing the rate in config.json and restoring the checkpoint?

Jackiexiao · 2020-08-13T01:57:41Z

@george-roussos it is not the right place to go with this topic. You better post it on discord.

Try the commit version given with the model (model table) and yes it is the right way.

is there any discord server for tts topic?

nmstoker · 2020-08-13T09:57:47Z

@Jackiexiao yes, please have a look at the main page of this repo https://github.com/mozilla/TTS and you'll see the link to the Discourse forum there

erogol added the model-release explanation for new model releases label Feb 6, 2020

erogol closed this as completed Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Release: Tacotron2 with Forward Attention - LJSpeech #345

Model Release: Tacotron2 with Forward Attention - LJSpeech #345

erogol commented Feb 6, 2020 •

edited

Loading

m-toman commented Feb 7, 2020 •

edited

Loading

erogol commented Feb 7, 2020

erogol commented Feb 10, 2020

reuben commented Feb 12, 2020

erogol commented Feb 13, 2020

nmstoker commented Feb 14, 2020

erogol commented Feb 14, 2020

nmstoker commented Feb 14, 2020

nmstoker commented Feb 16, 2020

nmstoker commented Feb 16, 2020 •

edited

Loading

erogol commented Feb 17, 2020

george-roussos commented Feb 21, 2020

erogol commented Feb 21, 2020

george-roussos commented Feb 21, 2020

george-roussos commented Feb 21, 2020 •

edited

Loading

erogol commented Feb 21, 2020

george-roussos commented Feb 25, 2020 •

edited

Loading

erogol commented Feb 25, 2020

george-roussos commented Feb 25, 2020 •

edited

Loading

Jackiexiao commented Aug 13, 2020

nmstoker commented Aug 13, 2020

Model Release: Tacotron2 with Forward Attention - LJSpeech #345

Model Release: Tacotron2 with Forward Attention - LJSpeech #345

Comments

erogol commented Feb 6, 2020 • edited Loading

m-toman commented Feb 7, 2020 • edited Loading

erogol commented Feb 7, 2020

erogol commented Feb 10, 2020

reuben commented Feb 12, 2020

erogol commented Feb 13, 2020

nmstoker commented Feb 14, 2020

erogol commented Feb 14, 2020

nmstoker commented Feb 14, 2020

nmstoker commented Feb 16, 2020

nmstoker commented Feb 16, 2020 • edited Loading

erogol commented Feb 17, 2020

george-roussos commented Feb 21, 2020

erogol commented Feb 21, 2020

george-roussos commented Feb 21, 2020

george-roussos commented Feb 21, 2020 • edited Loading

erogol commented Feb 21, 2020

george-roussos commented Feb 25, 2020 • edited Loading

erogol commented Feb 25, 2020

george-roussos commented Feb 25, 2020 • edited Loading

Jackiexiao commented Aug 13, 2020

nmstoker commented Aug 13, 2020

erogol commented Feb 6, 2020 •

edited

Loading

m-toman commented Feb 7, 2020 •

edited

Loading

nmstoker commented Feb 16, 2020 •

edited

Loading

george-roussos commented Feb 21, 2020 •

edited

Loading

george-roussos commented Feb 25, 2020 •

edited

Loading

george-roussos commented Feb 25, 2020 •

edited

Loading