-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model Release: Tacotron2 with Discrete Graves Attention - LJSpeech #346
Comments
Cool, overall do you prefer the forward attention model over this one? |
He mentioned "It is the best model so far trained." on the forward model post. |
Yes, perhaps I should be more specific. I assume this might be more because of the specific training regimen (switching to batch norm, training longer...) and handholding and not necessarily because of the attention mechanism itself. Like I got many more better wavernn 10 bit mulaw models in practices although overall I think MoL leads to better results. But I assume that can not really be answered before lots of experiments with different datasets etc. Also, the "more natural-sounding" seemed to be a comparison to the forward attention model. |
My two cents are Graves easier to train in different datasets and it is more natural sounding with a better prosody Disclaimer: I was about to release the graves model but then I removed the whole model by mistake. |
@ergol, why do you prefer PWGAN over MelGAN? It is faster, while the quality seems fine. Btw, on https://github.com/kan-bayashi/ParallelWaveGAN they provide now MelGAN as well. Any plans to try it, adapt for TTS? |
@vcjob interesting, I find even the PWGAN official samples of just vocoded recordings already exhibit some artefacts. r9y9s taco-wavenet (MoL) samples definitely sound better. I think the difference in the PWGAN paper is just because they used the espnet Gaussian Wavenet. I tried all their models and they are definitely not as good as r9y9s Wavenet. Also interesting how more or less nobody uses the original wavernn formulation. Even the amazon papers use a simple GRU followed by FCs predicting quantized output via softmax. Well, in the end they're all annoying for some different reason ;) EDIT: just realized the main author of PWGAN is r9y9. Even stranger he didn't use his own Wavenet implementation for comparison |
@vcjob PWGAN is easier to adapt to TTS and the model is smaller. Now, I also train MelGAN type generator as the official repo suggested. But it'd be nice to try original MelGAN with TTS if you are interested. A paper is a paper :). |
I'm trying your implementation of graves attention with my fork of Nvidia tacotron2. |
@hadaev8 is it the latest implementation? |
@erogol |
@hadaev8 try the one in dev branch |
@erogol |
because you are normalizing it. Actually this reduces the quality at inference time I guess. If you have solution for this, I'd like to know. |
@erogol |
@hadaev8 it is not an explicit normalization. Since values are bounded in [0, 1] even without discretization, with discretization they are also bounded in the same range. And because we do subtraction between time steps, the effective range comes close to zero. In our case it is [0, ~0.4]. So we could find a trick to expand this range. |
I released the model finally with couple of changes. This moel uses Batch Norm prenet from the beginning. |
One interesting problem with Graves's attention is that actually after the model converges only one of the attention heads is actively used suppressing the other heads. Which is an indicator of using only one head would also work fine with faster run-time. Or dropout might be used to randomized the behavior of the heads in training assuming that would learn the other heads. |
Awesome work! Following https://arxiv.org/pdf/1906.01083.pdf , shouldnt it be |
It is actually true. Yet it worked?. Thx for the catch. I'll fix it and try again. |
@erogol A unexpected but welcome surprise! `
def call(self, query, state):
|
not sure, maybe you can try the broken version as in my code. |
If I use your version, attention weights are computed negative. It is weird. |
I think I know whats happening. Your earlier implementation used a distribution that was monotonically decreasing, but your (mu_t - j) was flipped(possibly because you thought you were using exp instead of sigmoid), so it worked out just fine. |
yeah that's a great return. I totally missed that. |
@Shikherneo2 as I changed the implementation as you said and I had the same problem. After 10K iterations all the alignment turns out zero. |
@erogol That is very weird. I have tried a bunch of small tweaks, and the values always quickly go to zero. In my case they even go to zero with your earlier implementation. |
In my case, network goes to zero sometimes after 10K and sometimes 60K. I checked the layer statistics through the training but I could not see something explanatory. |
It is interesting. The function I used previously is a reverse sigmoid with a squashed range around 2/3. So mathematically it makes no sense but it worked. |
What's the benefit to discritize attention weights? Why don't directly use the original version? |
It mathematically makes more sense to me and it works better. |
|
@WhiteFu No. I wasn't able to. When I looked at the statistics, I realized that the encoder gradients were going to zero after a few thousand iterations. So I added a highway network (like in Tacotron-1), which stabilized the training. But the weights still all go to zero. |
@Shikherneo2 this is weird, I will follow up and let you know if there is any progress! |
should I reopen the issue if anyone working on it? |
@erogol This version is a good one, right? I tried graves attention in my own tts work(only add while loop to process all time steps) but alignment failed. I am tring to figure out the problem. |
In Mozilla/TTS, Graves Attention is discrete. Now you can use codes in this Repo to implement DCA or GMM attention. |
Model Link: https://drive.google.com/drive/folders/12Ct0ztVWHpL7SrEbUammGMmDopOKL9X_?usp=sharing
This model is trained with Discrete Grave attention with BatchNorm prenet. It produces good examples with robust attention alignment without any inference time tricks. You can even hear breathing effects with this model in between pauses.
You can also use this TTS model with PWGAN or WaveRNN vocoders. PWGAn provides real-time voice synthesis and WaveRNN is slower but provides better quality.
https://github.com/erogol/ParallelWaveGAN
https://github.com/erogol/WaveRNN
(Ignore the small jiggle on the figures caused by TB)
The text was updated successfully, but these errors were encountered: