-
-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WaveGAN training on Tacotron outputs. #178
Comments
The use of GTA may improve the quality but I think the use of natural features is enough. For example:
I'm not sure the Rayhane's repository's feature extraction setting. The following issues may help you. |
@kan-bayashi thank you very much for your reply! I changed the vocoder parameters to the same ones used in tacotron 2. I will train on natural features, but still, if I used GTA mels, would I need to first bring them to the same kind that the ./run.sh --stage 1 leads to? Is there a function that does this in your repository? |
Please carefully check the feature extraction function definition of both repositories. Bit complicated, but you can follow the following procedure:
|
Thanks! |
@kan-bayashi , I am trying to do this, but when I execute the script for length matching, sometimes I have situations that the audio size is smaller than the size of the GTA mel spectrogram:
What to do in this case? Also, sometimes the synthesized speech does not coincide with the pauses of real audio, for example, in the place where the comma is, in real audio the pause lasts 0.5 seconds, and in the synthesized spectrogram the pause can be a second. Will this affect the quality of training and are there ways to somehow align these moments? |
I think your code is OK to deal with such a case. ParallelWaveGAN/parallel_wavegan/bin/preprocess.py Lines 168 to 171 in bb32b19
It can be happened and it is difficult to explicitly control the pause length. |
ParallelWaveGAN/parallel_wavegan/bin/preprocess.py Lines 158 to 171 in bb32b19
I get the mel spectrogram from the tacotron synthesized with the same text as in ground truth audio, not from logmelfilterbank. So, sometimes i get this:
and get assert error.
Sorry, I don't quite understand how your vocoder works. In this case, I want to train your vocoder on GTA spectrograms and original audio. When we get the mel spectrum from ground truth audio, the mel spectrum contains the correct pause and pronunciation. In the case of GTA, some words will be pronounced with a shift in time, for example, the word "hello" will be in the third second, and not in the second as in the original. If you train the model this way, will it train correctly? |
I think you may wrongly understand GTA mel-spectrogram. |
@kan-bayashi To generate the mel spectrograms, I took text from ground truth audio and made a synthesis of the mel spectrum using a Tacotron 2. I thought vocoder could be trained on such synthesized spectrograms. |
Tacotron2 is an auto-regressive (AR) model, so to generate the mel-spectrogram whose length matches with groundtruth of the waveform, you need to input groundtruth of mel-spectrogram as the AR input. |
Now I understand, thanks for the help! |
@kan-bayashi , I am trying to run this training command without using the run.sh. What structure should audio files have? I made a GTA dump and brought the audio to the form that is obtained in step 1 in the file (npy format and matched the length with GTA mel) training_feats.scp: and 10% for dev part in the same format. When I run the command parallel_wagegan_train with parameters above, I get the error: |
The file in |
@kan-bayashi , Config params for vocoder the same as in melgan.v1 There is Tensorboard losses for eval: Synthesized audio: |
|
@kan-bayashi , thanks i will try it. |
Hmm. One of the possible reasons is the quality of GTA mel-spectrogram. |
Hey. I trained a Rayhane-mamah Tacotron 2 synthesizer without vocoder. As a vocoder, I wanted to use your repository,
could you please tell me how to properly train WaveGAN? Need to train on GTA mels? If so, how to do it,
if the preprocessing procedure in run.sh itself prepares mel spectrograms from ground truth audio on step 1?
The text was updated successfully, but these errors were encountered: