Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Release: Tacotron2 with Forward Attention - LJSpeech #345

Closed
erogol opened this issue Feb 6, 2020 · 21 comments
Closed

Model Release: Tacotron2 with Forward Attention - LJSpeech #345

erogol opened this issue Feb 6, 2020 · 21 comments
Labels
model-release explanation for new model releases

Comments

@erogol
Copy link
Contributor

erogol commented Feb 6, 2020

Model Link: https://drive.google.com/open?id=10ymOlWHutqTtfDYhIbHULn2IKDKP0O9m
Colab example: https://colab.research.google.com/drive/1cpofjnfKSpFhiREgExENIsum4MrqxyPR

This model is trained with Forward Attention enabled until ~400K iters and then finetuned with Batch Norm prenet until the end. It is the best model so far trained.

I observe once again that using BN based prenet improves the spectrogram quality considerablly but if you train it from scratch, model does not learn the attention.

You can also use this TTS model with PWGAN or WaveRNN vocoders. PWGAn provides real-time voice synthesis and WaveRNN is slower but provides better quality.

https://github.com/erogol/ParallelWaveGAN
https://github.com/erogol/WaveRNN

You can see the TB figures below:

image

image

@erogol erogol added the model-release explanation for new model releases label Feb 6, 2020
@m-toman
Copy link
Contributor

m-toman commented Feb 7, 2020

For testing the model, this worked for me:

git clone https://github.com/erogol/WaveRNN.git
git clone https://github.com/mozilla/TTS.git
cd TTS
git checkout dev
mkdir demo_models
cd demo_models
mkdir -p wavernn_models tts_models
wavernn_pretrained_model=wavernn_models/checkpoint_433000.pth.tar
gdown -O ${wavernn_pretrained_model} https://drive.google.com/uc?id=12GRFk5mcTDXqAdO5mR81E-DpTk8v2YS9
wavernn_pretrained_model_config=wavernn_models/config.json
gdown -O ${wavernn_pretrained_model_config} https://drive.google.com/uc?id=1kiAGjq83wM3POG736GoyWOOcqwXhBulv
 tts_pretrained_model=tts_models/checkpoint_670000.pth.tar
gdown -O ${tts_pretrained_model} https://drive.google.com/uc?id=1_mbQDLHekiearftLaraJaPl-FuNgOzKV
tts_pretrained_model_config=tts_models/config.json
gdown -O ${tts_pretrained_model_config} https://drive.google.com/uc?id=19FQscticcxQIFH4MwnxQ950LyxcN8kli
cd ../..
mkdir TTS/demo_output
python -m TTS.synthesize --use_cuda true --vocoder_config_path TTS/demo_models/wavernn_models/config.json --vocoder_path TTS/demo_models/wavernn_models/checkpoint_433000.pth.tar "Evil is Evil. Lesser, greater, middling… Makes no difference. The degree is arbitary. The definition’s blurred. If I’m to choose between one evil and another… I’d rather not choose at all." TTS/demo_models/tts_models/config.json TTS/demo_models/tts_models/checkpoint_670000.pth.tar TTS/demo_output/

but requires #349

Interesting result for that long Witcher quote:
evil.zip
Seems dot is mapped to the breathing sound ;)

@erogol
Copy link
Contributor Author

erogol commented Feb 7, 2020

You can now test this model with PWGAN using:
https://github.com/mozilla/TTS/blob/dev/notebooks/Benchmark-PWGAN.ipynb

@erogol
Copy link
Contributor Author

erogol commented Feb 10, 2020

I added a colab example running this model with PWGAN vocoder
https://colab.research.google.com/drive/1cpofjnfKSpFhiREgExENIsum4MrqxyPR

@reuben
Copy link
Contributor

reuben commented Feb 12, 2020

@erogol
Copy link
Contributor Author

erogol commented Feb 13, 2020

I also created an example colab using MelGAN as a vocoder. It's been trained by changing the PWGAN generator with MelGAN's architecture. It performs a bit better and slightly faster.

https://colab.research.google.com/drive/1Zg9jR27Pr-ziVa0krjtdoy2dKv6whv7b

@nmstoker
Copy link
Contributor

The quality with this latest colab is amazing and it does well even with longer sentences:slightly_smiling_face:

Btw there was a reference to pwgan model that needed to be switched to Megan, but otherwise this is so straightforward to use.

@erogol
Copy link
Contributor Author

erogol commented Feb 14, 2020

@nmstoker good to hear that :)

What do you mean by "reference to pwgan"? Do you mean the server release?

@nmstoker
Copy link
Contributor

What do you mean by "reference to pwgan"? Do you mean the server release?

Sorry, I wasn't clear last night. It's a tiny thing, but in the last but one cell of the MelGAN Colab above, there's this line

vocoder_model.load_state_dict(torch.load(PWGAN_MODEL,` map_location="cpu")["model"]["generator"])

And PWGAN_MODEL isn't defined (it's a simple matter of updating it to MELGAN_MODEL)

@nmstoker
Copy link
Contributor

Also, I noticed that the checkout given in the Colab appears not to exist. In the cell with this:

%cd ../ParallelWaveGAN/
! git checkout 22018e6

it didn't seem to cause a problem, it just quietly failed with:

error: pathspec '22018e6' did not match any file(s) known to git.

Presumably, right now, nothing has changed to break it since then.

It looks like it's probably meant to be git checkout a22018e, given there's this hash from a commit around the right time: a22018e6e6be1f9381b003496cc285bdd5a4a284 and it's just offset by one character.

@nmstoker
Copy link
Contributor

nmstoker commented Feb 16, 2020

I've got what may be a silly question (if so, sorry! 🙂 )

Comparing the training stats charts above with the values set in the config.json for the released model, I see that for the orange line the stats change as if they're undergoing gradual training (ie they move at 50k, 130k, 290k) and then you've switched to BN fine-tuning with the blue line at 400k.

That orange gradual training pattern is consistent the "gradual_training" values in the config file released with the model, but I see the comment mention that gradual training is only for Tacotron, and yet this is Tacotron 2. Perhaps the comment simply hasn't been updated? (it's like that in all the configs I've seen since it was introduced)

Q. Does gradual training work for both models now? Or am I missing something about how you set the config file up for the initial orange training run? (eg that was changed when you switched to BN fine-tuning)

Thanks!

Edit: actually it looks like maybe the comment has been removed from the config.json here: https://github.com/mozilla/TTS/blob/master/config.json#L41 so presumably it does now work for both

@erogol
Copy link
Contributor Author

erogol commented Feb 17, 2020

@nmstoker yeah it works for the both now :)

@george-roussos
Copy link
Contributor

Model produces great results and shall try to adapt it to a new speaker with an average-sized dataset (6hrs), male voice, no silences, clean audio. Will report results.

I was able to test the model yesterday, however I keep getting an AttributeError: 'AttrDict' object has no attribute 'mulaw' error, even though I have defined mulaw in both config files I use (do I define it as true or what). I might be doing something wrong. Anybody care enough to chime in?

@erogol
Copy link
Contributor Author

erogol commented Feb 21, 2020

@george-roussos you are training which model exactly.

If something is missing in the config file just add it. In the worse case, you can try what seems logical, but for mu-law thing, it is about WaveRNN vocoder which is not related.

@george-roussos
Copy link
Contributor

@george-roussos you are training which model exactly.

If something is missing in the config file just add it. In the worse case, you can try what seems logical, but for mu-law thing, it is about WaveRNN vocoder which is not related.

I am not training anything right now, I am testing the model. The implementation I have is the TTS model trained on forward attention and batch normalization and the WaveRNN vocoder, which I am guessing is universal. My thought was I could first try and finetune the TTS model and see how it performs when adapted on a new voice when the data is clean and not sparse. Do you think it would be possible and, if so, what would your expectation be with a good quality dataset of 12 hours?

@george-roussos
Copy link
Contributor

george-roussos commented Feb 21, 2020

Back again. What branch/commit should we use to retrain the TTS model? I am trying to run distribute.py and use the config checkpoint_670000.pth.tar has, but is that the correct way to do it?

@erogol
Copy link
Contributor Author

erogol commented Feb 21, 2020

@george-roussos it is not the right place to go with this topic. You better post it on discord.

Try the commit version given with the model (model table) and yes it is the right way.

@george-roussos
Copy link
Contributor

george-roussos commented Feb 25, 2020

Is there any way we can make this model compatible with a universal WaveRNN vocoder after fine-tuning to a new voice? I tried to plug in the universal checkpoint from the git repo, but I get RuntimeError: Error(s) in loading state_dict for Model: size mismatch for upsample.up_layers.5.weight: copying a param with shape torch.Size([1, 1, 1, 17]) from checkpoint, the shape in current model is torch.Size([1, 1, 1, 23]). The model @m-toman links above only works with LJSpeech.

By the way, the results I am getting after fine-tuning on a 5 hour long dataset with transcription errors, is pretty good...

@erogol
Copy link
Contributor Author

erogol commented Feb 25, 2020

The difference between universal WaveRNN and the TTS model you are using is the sampling rate. WaveRNN model uses 16K and TTS model uses 22050. So maybe you need to finetune WaveRNN too with this rate. Or you can reduce the sampling rate as you finetune TTS with your dataset.

You also need to check out the right version of WaveRNN given with the model checkpoint.

@george-roussos
Copy link
Contributor

george-roussos commented Feb 25, 2020

Thanks. Do I checkout in the commit given? I imagine fine-tuning to 22050 is not as simple as editing the rate in config.json and restoring the checkpoint?

@erogol erogol closed this as completed Mar 11, 2020
@Jackiexiao
Copy link

@george-roussos it is not the right place to go with this topic. You better post it on discord.

Try the commit version given with the model (model table) and yes it is the right way.

is there any discord server for tts topic?

@nmstoker
Copy link
Contributor

@Jackiexiao yes, please have a look at the main page of this repo https://github.com/mozilla/TTS and you'll see the link to the Discourse forum there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model-release explanation for new model releases
Projects
None yet
Development

No branches or pull requests

6 participants