WaveGAN training on Tacotron outputs. #178

Alexey322 · 2020-07-06T08:15:16Z

Hey. I trained a Rayhane-mamah Tacotron 2 synthesizer without vocoder. As a vocoder, I wanted to use your repository,
could you please tell me how to properly train WaveGAN? Need to train on GTA mels? If so, how to do it,
if the preprocessing procedure in run.sh itself prepares mel spectrograms from ground truth audio on step 1?

kan-bayashi · 2020-07-06T11:22:57Z

The use of GTA may improve the quality but I think the use of natural features is enough.
So you need to check the feature settings carefully:

For example:

Mel range
FFT / shift size
Log basis
Mel basis
Normalize

I'm not sure the Rayhane's repository's feature extraction setting.
Please check by yourself.

The following issues may help you.

Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2 #169

Alexey322 · 2020-07-06T12:00:34Z

@kan-bayashi thank you very much for your reply! I changed the vocoder parameters to the same ones used in tacotron 2.

I will train on natural features, but still, if I used GTA mels, would I need to first bring them to the same kind that the ./run.sh --stage 1 leads to? Is there a function that does this in your repository?

kan-bayashi · 2020-07-06T13:23:29Z

I changed the vocoder parameters to the same ones used in tacotron 2.

Please carefully check the feature extraction function definition of both repositories.
There may be a difference which cannot be controlled by the config.

Bit complicated, but you can follow the following procedure:

Dump GTA outputs as .npy format for training and dev set
Make feats.scp files of training and dev set like
```
utt_id_1 /path/to/utt_id_1_npy_file.npy
utt_id_2 /path/to/utt_id_2_npy_file.npy
...	
```
You need to keep the same utt_id as the same as wav.scp created in stage 0.

Run training function, e.g.,

parallel-wavegan-train \
        --config "${conf}" \
        --train-wav-scp /path/to/train_wav_scp \
        --train-feats-scp /path/to/train_feats_scp \
        --dev-wav-scp /path/to/dev_wav_scp \
        --dev-feats-scp /path/to/dev_feats_scp \
        --outdir "${expdir}" \
        --resume "${resume}" \
        --verbose "${verbose}"

Alexey322 · 2020-07-07T16:35:38Z

Thanks!

Alexey322 · 2020-07-31T11:17:13Z

@kan-bayashi , I am trying to do this, but when I execute the script for length matching, sometimes I have situations that the audio size is smaller than the size of the GTA mel spectrogram:

audio = np.pad(audio, (0, hparams.filter_length), mode="edge")
audio = audio[:len(mel) * hparams.hop_length]

What to do in this case?

Also, sometimes the synthesized speech does not coincide with the pauses of real audio, for example, in the place where the comma is, in real audio the pause lasts 0.5 seconds, and in the synthesized spectrogram the pause can be a second. Will this affect the quality of training and are there ways to somehow align these moments?

kan-bayashi · 2020-07-31T11:58:13Z

What to do in this case?

I think your code is OK to deal with such a case.
I do the similar processing.

ParallelWaveGAN/parallel_wavegan/bin/preprocess.py

Lines 168 to 171 in bb32b19

    
           # make sure the audio length and feature length are matched 
        
           audio = np.pad(audio, (0, config["fft_size"]), mode="reflect") 
        
           audio = audio[:len(mel) * config["hop_size"]] 
        
           assert len(mel) * config["hop_size"] == len(audio)

Also, sometimes the synthesized speech does not coincide with the pauses of real audio, for example, in the place where the comma is, in real audio the pause lasts 0.5 seconds, and in the synthesized spectrogram the pause can be a second. Will this affect the quality of training and are there ways to somehow align these moments?

It can be happened and it is difficult to explicitly control the pause length.
Do you mean the synthesis with teacher forcing?

Alexey322 · 2020-07-31T12:39:09Z

@kan-bayashi ,

I think your code is OK to deal with such a case.
I do the similar processing.

ParallelWaveGAN/parallel_wavegan/bin/preprocess.py

Lines 158 to 171 in bb32b19

    
           mel = logmelfilterbank(x, 
        
                                  sampling_rate=sampling_rate, 
        
                                  hop_size=hop_size, 
        
                                  fft_size=config["fft_size"], 
        
                                  win_length=config["win_length"], 
        
                                  window=config["window"], 
        
                                  num_mels=config["num_mels"], 
        
                                  fmin=config["fmin"], 
        
                                  fmax=config["fmax"]) 
        
           # make sure the audio length and feature length are matched 
        
           audio = np.pad(audio, (0, config["fft_size"]), mode="reflect") 
        
           audio = audio[:len(mel) * config["hop_size"]] 
        
           assert len(mel) * config["hop_size"] == len(audio)

I get the mel spectrogram from the tacotron synthesized with the same text as in ground truth audio, not from logmelfilterbank. So, sometimes i get this:

len(mel) * hparams.hop_length: 27136
len(audio): 23552

and get assert error.

It can be happened and it is difficult to explicitly control the pause length.
Do you mean the synthesis with teacher forcing?

Sorry, I don't quite understand how your vocoder works. In this case, I want to train your vocoder on GTA spectrograms and original audio. When we get the mel spectrum from ground truth audio, the mel spectrum contains the correct pause and pronunciation. In the case of GTA, some words will be pronounced with a shift in time, for example, the word "hello" will be in the third second, and not in the second as in the original. If you train the model this way, will it train correctly?

kan-bayashi · 2020-07-31T12:51:27Z

I get the mel spectrogram from the tacotron synthesized with the same text as in ground truth audio, not from logmelfilterbank. So, sometimes i get this:

I think you may wrongly understand GTA mel-spectrogram.
In the case of GTA (= synthesis with teacher forcing), the length of generated mel-spectrogram is not changed.
Please explain how you generated your mel-spectrogram.

Alexey322 · 2020-07-31T13:40:46Z

@kan-bayashi
Yes, it looks like I misunderstood the concept of GTA. I retrained the nvidia Tacotron 2 synthesizer on a specific voice. If you look at the mel spectrograms in the image, you will see that the predicted spectrogram is highly smoothed, so I think the voice will be robotic based on my past experience.

To generate the mel spectrograms, I took text from ground truth audio and made a synthesis of the mel spectrum using a Tacotron 2. I thought vocoder could be trained on such synthesized spectrograms.

kan-bayashi · 2020-07-31T13:53:36Z

To generate the mel spectrograms, I took text from ground truth audio and made a synthesis of the mel spectrum using a Tacotron 2. I thought vocoder could be trained on such synthesized spectrograms.

text from ground truth audio is very misleading.
Maybe text from the training data is correct?
If you generate mel-spectrogram with only text (i.e., free-running), the length of mel-spectrogram will be changed from groundtruth of the waveform, as you saw in your post.
Then, you can not use such mel-spectrogram for vocoder training.

Tacotron2 is an auto-regressive (AR) model, so to generate the mel-spectrogram whose length matches with groundtruth of the waveform, you need to input groundtruth of mel-spectrogram as the AR input.
(This is teacher forcing, i.e., groundtruth aligned mel-spectrogram).

Alexey322 · 2020-07-31T14:02:23Z

Now I understand, thanks for the help!

Alexey322 · 2020-08-04T08:10:43Z

I changed the vocoder parameters to the same ones used in tacotron 2.

Please carefully check the feature extraction function definition of both repositories.
There may be a difference which cannot be controlled by the config.

Bit complicated, but you can follow the following procedure:

1. Dump GTA outputs as `.npy` format for training and dev set

2. Make `feats.scp` files of training and dev set like
   ```
   utt_id_1 /path/to/utt_id_1_npy_file.npy
   utt_id_2 /path/to/utt_id_2_npy_file.npy
   ...	
   ```
   
   
   You need to keep the same `utt_id` as the same as `wav.scp` created in stage 0.

3. Run training function, e.g.,
   ```shell
   parallel-wavegan-train \
           --config "${conf}" \
           --train-wav-scp /path/to/train_wav_scp \
           --train-feats-scp /path/to/train_feats_scp \
           --dev-wav-scp /path/to/dev_wav_scp \
           --dev-feats-scp /path/to/dev_feats_scp \
           --outdir "${expdir}" \
           --resume "${resume}" \
           --verbose "${verbose}"
   ```

@kan-bayashi , I am trying to run this training command without using the run.sh. What structure should audio files have? I made a GTA dump and brought the audio to the form that is obtained in step 1 in the file (npy format and matched the length with GTA mel)
Split files into 4 parts

training_feats.scp:
utt_id_1 /path/to/utt_id_1.mel.npy
utt_id_2 /path/to/utt_id_2.mel.npy
training_wavs.scp:
utt_id_1 /path/to/utt_id_1.wav.npy
utt_id_2 /path/to/utt_id_2.wav.npy

and 10% for dev part in the same format.

When I run the command parallel_wagegan_train with parameters above, I get the error:
...
python3.6.7 \ lib \ site-packages \ kaldiio \ utils.py: 376: UserWarning: An error happens at loading "wavs / 0x000f4ed7.wav.npy
...
'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte
Sounds like the audio should be in a different format. Can you please tell me what I'm doing wrong?

kan-bayashi · 2020-08-04T12:09:21Z

The file in wav.scp must be .wav format, not npy (feats.scp is OK).

Alexey322 · 2020-08-05T09:39:26Z

@kan-bayashi ,
I train a model on GTA mel spectrograms without standardization for vocoder and log10, and use synthesized mel spectrograms from nvidia tacotron 2, which, after synthesis, are normalized and converted using the natural logarithm. Vocoder melgan.v1, after 260k iterations, audio quality is not very good. Also, the losses stopped decreasing after a certain number of iterations. How to improve the quality?

Config params for vocoder the same as in melgan.v1
Tacotron was trained on 80/7600 mel basis.

There is Tensorboard losses for eval:

train:

Synthesized audio:
https://drive.google.com/file/d/1AyaF46aSdFSwF0SahYRmm32K9NZ0bmur/view?usp=sharing

kan-bayashi · 2020-08-05T12:48:39Z

melgan.v1 uses MelGAN generator + PWG discriminator.
If you want to use normal MelGAN, full_band_melgan based config will give better results.
https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/ljspeech/voc1/conf/full_band_melgan.v2.yaml

Alexey322 · 2020-08-05T13:22:52Z

@kan-bayashi , thanks i will try it.
But why does audio sound so bad? Could this be due to the fact that I did not apply centering and normalization to mel spectrograms? Or maybe the mel basis is not suitable for a male voice? I listened to how your generated audio for ljspeech sounds with the melgan.v1 config, the quality is much better.

kan-bayashi · 2020-08-05T13:26:57Z

Hmm. One of the possible reasons is the quality of GTA mel-spectrogram.
If you compare with the model using groundtruth mel-spectrogram, you can get more insight.

kan-bayashi added the question Further information is requested label Jul 6, 2020

Alexey322 closed this as completed Jul 7, 2020

Alexey322 reopened this Jul 31, 2020

Alexey322 closed this as completed Jul 31, 2020

Alexey322 reopened this Aug 4, 2020

Alexey322 closed this as completed Aug 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WaveGAN training on Tacotron outputs. #178

WaveGAN training on Tacotron outputs. #178

Alexey322 commented Jul 6, 2020

kan-bayashi commented Jul 6, 2020

Alexey322 commented Jul 6, 2020

kan-bayashi commented Jul 6, 2020 •

edited

Loading

Alexey322 commented Jul 7, 2020

Alexey322 commented Jul 31, 2020

kan-bayashi commented Jul 31, 2020 •

edited

Loading

Alexey322 commented Jul 31, 2020 •

edited

Loading

kan-bayashi commented Jul 31, 2020 •

edited

Loading

Alexey322 commented Jul 31, 2020

kan-bayashi commented Jul 31, 2020

Alexey322 commented Jul 31, 2020

Alexey322 commented Aug 4, 2020 •

edited

Loading

kan-bayashi commented Aug 4, 2020

Alexey322 commented Aug 5, 2020 •

edited

Loading

kan-bayashi commented Aug 5, 2020

Alexey322 commented Aug 5, 2020

kan-bayashi commented Aug 5, 2020

WaveGAN training on Tacotron outputs. #178

WaveGAN training on Tacotron outputs. #178

Comments

Alexey322 commented Jul 6, 2020

kan-bayashi commented Jul 6, 2020

Alexey322 commented Jul 6, 2020

kan-bayashi commented Jul 6, 2020 • edited Loading

Alexey322 commented Jul 7, 2020

Alexey322 commented Jul 31, 2020

kan-bayashi commented Jul 31, 2020 • edited Loading

Alexey322 commented Jul 31, 2020 • edited Loading

kan-bayashi commented Jul 31, 2020 • edited Loading

Alexey322 commented Jul 31, 2020

kan-bayashi commented Jul 31, 2020

Alexey322 commented Jul 31, 2020

Alexey322 commented Aug 4, 2020 • edited Loading

kan-bayashi commented Aug 4, 2020

Alexey322 commented Aug 5, 2020 • edited Loading

kan-bayashi commented Aug 5, 2020

Alexey322 commented Aug 5, 2020

kan-bayashi commented Aug 5, 2020

kan-bayashi commented Jul 6, 2020 •

edited

Loading

kan-bayashi commented Jul 31, 2020 •

edited

Loading

Alexey322 commented Jul 31, 2020 •

edited

Loading

kan-bayashi commented Jul 31, 2020 •

edited

Loading

Alexey322 commented Aug 4, 2020 •

edited

Loading

Alexey322 commented Aug 5, 2020 •

edited

Loading