-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model Release: Tacotron2 with Forward Attention - LJSpeech #345
Comments
For testing the model, this worked for me:
but requires #349 Interesting result for that long Witcher quote: |
You can now test this model with PWGAN using: |
I added a colab example running this model with PWGAN vocoder |
I released a new server package with this model embedded in it: https://github.com/mozilla/TTS/wiki/Released-Models#simple-packaging---self-contained-package-that-runs-an-http-api-for-a-pre-trained-tts-model |
I also created an example colab using MelGAN as a vocoder. It's been trained by changing the PWGAN generator with MelGAN's architecture. It performs a bit better and slightly faster. https://colab.research.google.com/drive/1Zg9jR27Pr-ziVa0krjtdoy2dKv6whv7b |
The quality with this latest colab is amazing and it does well even with longer sentences:slightly_smiling_face: Btw there was a reference to pwgan model that needed to be switched to Megan, but otherwise this is so straightforward to use. |
@nmstoker good to hear that :) What do you mean by "reference to pwgan"? Do you mean the server release? |
Sorry, I wasn't clear last night. It's a tiny thing, but in the last but one cell of the MelGAN Colab above, there's this line vocoder_model.load_state_dict(torch.load(PWGAN_MODEL,` map_location="cpu")["model"]["generator"]) And PWGAN_MODEL isn't defined (it's a simple matter of updating it to MELGAN_MODEL) |
Also, I noticed that the checkout given in the Colab appears not to exist. In the cell with this:
it didn't seem to cause a problem, it just quietly failed with:
Presumably, right now, nothing has changed to break it since then. It looks like it's probably meant to be git checkout a22018e, given there's this hash from a commit around the right time: a22018e6e6be1f9381b003496cc285bdd5a4a284 and it's just offset by one character. |
I've got what may be a silly question (if so, sorry! 🙂 ) Comparing the training stats charts above with the values set in the config.json for the released model, I see that for the orange line the stats change as if they're undergoing gradual training (ie they move at 50k, 130k, 290k) and then you've switched to BN fine-tuning with the blue line at 400k. That orange gradual training pattern is consistent the "gradual_training" values in the config file released with the model, but I see the comment mention that gradual training is only for Tacotron, and yet this is Tacotron 2. Perhaps the comment simply hasn't been updated? (it's like that in all the configs I've seen since it was introduced) Q. Does gradual training work for both models now? Or am I missing something about how you set the config file up for the initial orange training run? (eg that was changed when you switched to BN fine-tuning) Thanks! Edit: actually it looks like maybe the comment has been removed from the config.json here: https://github.com/mozilla/TTS/blob/master/config.json#L41 so presumably it does now work for both |
@nmstoker yeah it works for the both now :) |
Model produces great results and shall try to adapt it to a new speaker with an average-sized dataset (6hrs), male voice, no silences, clean audio. Will report results. I was able to test the model yesterday, however I keep getting an |
@george-roussos you are training which model exactly. If something is missing in the config file just add it. In the worse case, you can try what seems logical, but for mu-law thing, it is about WaveRNN vocoder which is not related. |
I am not training anything right now, I am testing the model. The implementation I have is the TTS model trained on forward attention and batch normalization and the WaveRNN vocoder, which I am guessing is universal. My thought was I could first try and finetune the TTS model and see how it performs when adapted on a new voice when the data is clean and not sparse. Do you think it would be possible and, if so, what would your expectation be with a good quality dataset of 12 hours? |
Back again. What branch/commit should we use to retrain the TTS model? I am trying to run distribute.py and use the config checkpoint_670000.pth.tar has, but is that the correct way to do it? |
@george-roussos it is not the right place to go with this topic. You better post it on discord. Try the commit version given with the model (model table) and yes it is the right way. |
Is there any way we can make this model compatible with a universal WaveRNN vocoder after fine-tuning to a new voice? I tried to plug in the universal checkpoint from the git repo, but I get By the way, the results I am getting after fine-tuning on a 5 hour long dataset with transcription errors, is pretty good... |
The difference between universal WaveRNN and the TTS model you are using is the sampling rate. WaveRNN model uses 16K and TTS model uses 22050. So maybe you need to finetune WaveRNN too with this rate. Or you can reduce the sampling rate as you finetune TTS with your dataset. You also need to check out the right version of WaveRNN given with the model checkpoint. |
Thanks. Do I checkout in the commit given? I imagine fine-tuning to 22050 is not as simple as editing the rate in config.json and restoring the checkpoint? |
is there any discord server for tts topic? |
@Jackiexiao yes, please have a look at the main page of this repo https://github.com/mozilla/TTS and you'll see the link to the Discourse forum there |
Model Link: https://drive.google.com/open?id=10ymOlWHutqTtfDYhIbHULn2IKDKP0O9m
Colab example: https://colab.research.google.com/drive/1cpofjnfKSpFhiREgExENIsum4MrqxyPR
This model is trained with Forward Attention enabled until ~400K iters and then finetuned with Batch Norm prenet until the end. It is the best model so far trained.
I observe once again that using BN based prenet improves the spectrogram quality considerablly but if you train it from scratch, model does not learn the attention.
You can also use this TTS model with PWGAN or WaveRNN vocoders. PWGAn provides real-time voice synthesis and WaveRNN is slower but provides better quality.
https://github.com/erogol/ParallelWaveGAN
https://github.com/erogol/WaveRNN
You can see the TB figures below:
The text was updated successfully, but these errors were encountered: