Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaveGAN training on Tacotron outputs. #178

Closed
Alexey322 opened this issue Jul 6, 2020 · 17 comments
Closed

WaveGAN training on Tacotron outputs. #178

Alexey322 opened this issue Jul 6, 2020 · 17 comments
Labels
question Further information is requested

Comments

@Alexey322
Copy link

Hey. I trained a Rayhane-mamah Tacotron 2 synthesizer without vocoder. As a vocoder, I wanted to use your repository,
could you please tell me how to properly train WaveGAN? Need to train on GTA mels? If so, how to do it,
if the preprocessing procedure in run.sh itself prepares mel spectrograms from ground truth audio on step 1?

@kan-bayashi kan-bayashi added the question Further information is requested label Jul 6, 2020
@kan-bayashi
Copy link
Owner

The use of GTA may improve the quality but I think the use of natural features is enough.
So you need to check the feature settings carefully:

For example:

  • Mel range
  • FFT / shift size
  • Log basis
  • Mel basis
  • Normalize

I'm not sure the Rayhane's repository's feature extraction setting.
Please check by yourself.

The following issues may help you.

@Alexey322
Copy link
Author

@kan-bayashi thank you very much for your reply! I changed the vocoder parameters to the same ones used in tacotron 2.

I will train on natural features, but still, if I used GTA mels, would I need to first bring them to the same kind that the ./run.sh --stage 1 leads to? Is there a function that does this in your repository?

@kan-bayashi
Copy link
Owner

kan-bayashi commented Jul 6, 2020

I changed the vocoder parameters to the same ones used in tacotron 2.

Please carefully check the feature extraction function definition of both repositories.
There may be a difference which cannot be controlled by the config.

Bit complicated, but you can follow the following procedure:

  1. Dump GTA outputs as .npy format for training and dev set
  2. Make feats.scp files of training and dev set like
    utt_id_1 /path/to/utt_id_1_npy_file.npy
    utt_id_2 /path/to/utt_id_2_npy_file.npy
    ...	
    
    You need to keep the same utt_id as the same as wav.scp created in stage 0.
  3. Run training function, e.g.,
    parallel-wavegan-train \
            --config "${conf}" \
            --train-wav-scp /path/to/train_wav_scp \
            --train-feats-scp /path/to/train_feats_scp \
            --dev-wav-scp /path/to/dev_wav_scp \
            --dev-feats-scp /path/to/dev_feats_scp \
            --outdir "${expdir}" \
            --resume "${resume}" \
            --verbose "${verbose}"

@Alexey322
Copy link
Author

Thanks!

@Alexey322 Alexey322 reopened this Jul 31, 2020
@Alexey322
Copy link
Author

@kan-bayashi , I am trying to do this, but when I execute the script for length matching, sometimes I have situations that the audio size is smaller than the size of the GTA mel spectrogram:

audio = np.pad(audio, (0, hparams.filter_length), mode="edge")
audio = audio[:len(mel) * hparams.hop_length]

What to do in this case?

Also, sometimes the synthesized speech does not coincide with the pauses of real audio, for example, in the place where the comma is, in real audio the pause lasts 0.5 seconds, and in the synthesized spectrogram the pause can be a second. Will this affect the quality of training and are there ways to somehow align these moments?

@kan-bayashi
Copy link
Owner

kan-bayashi commented Jul 31, 2020

What to do in this case?

I think your code is OK to deal with such a case.
I do the similar processing.

# make sure the audio length and feature length are matched
audio = np.pad(audio, (0, config["fft_size"]), mode="reflect")
audio = audio[:len(mel) * config["hop_size"]]
assert len(mel) * config["hop_size"] == len(audio)

Also, sometimes the synthesized speech does not coincide with the pauses of real audio, for example, in the place where the comma is, in real audio the pause lasts 0.5 seconds, and in the synthesized spectrogram the pause can be a second. Will this affect the quality of training and are there ways to somehow align these moments?

It can be happened and it is difficult to explicitly control the pause length.
Do you mean the synthesis with teacher forcing?

@Alexey322
Copy link
Author

Alexey322 commented Jul 31, 2020

@kan-bayashi ,

I think your code is OK to deal with such a case.
I do the similar processing.

mel = logmelfilterbank(x,
sampling_rate=sampling_rate,
hop_size=hop_size,
fft_size=config["fft_size"],
win_length=config["win_length"],
window=config["window"],
num_mels=config["num_mels"],
fmin=config["fmin"],
fmax=config["fmax"])
# make sure the audio length and feature length are matched
audio = np.pad(audio, (0, config["fft_size"]), mode="reflect")
audio = audio[:len(mel) * config["hop_size"]]
assert len(mel) * config["hop_size"] == len(audio)

I get the mel spectrogram from the tacotron synthesized with the same text as in ground truth audio, not from logmelfilterbank. So, sometimes i get this:

len(mel) * hparams.hop_length: 27136
len(audio): 23552

and get assert error.

It can be happened and it is difficult to explicitly control the pause length.
Do you mean the synthesis with teacher forcing?

Sorry, I don't quite understand how your vocoder works. In this case, I want to train your vocoder on GTA spectrograms and original audio. When we get the mel spectrum from ground truth audio, the mel spectrum contains the correct pause and pronunciation. In the case of GTA, some words will be pronounced with a shift in time, for example, the word "hello" will be in the third second, and not in the second as in the original. If you train the model this way, will it train correctly?

@kan-bayashi
Copy link
Owner

kan-bayashi commented Jul 31, 2020

I get the mel spectrogram from the tacotron synthesized with the same text as in ground truth audio, not from logmelfilterbank. So, sometimes i get this:

I think you may wrongly understand GTA mel-spectrogram.
In the case of GTA (= synthesis with teacher forcing), the length of generated mel-spectrogram is not changed.
Please explain how you generated your mel-spectrogram.

@Alexey322
Copy link
Author

@kan-bayashi
Yes, it looks like I misunderstood the concept of GTA. I retrained the nvidia Tacotron 2 synthesizer on a specific voice. If you look at the mel spectrograms in the image, you will see that the predicted spectrogram is highly smoothed, so I think the voice will be robotic based on my past experience.

изображение

To generate the mel spectrograms, I took text from ground truth audio and made a synthesis of the mel spectrum using a Tacotron 2. I thought vocoder could be trained on such synthesized spectrograms.

@kan-bayashi
Copy link
Owner

To generate the mel spectrograms, I took text from ground truth audio and made a synthesis of the mel spectrum using a Tacotron 2. I thought vocoder could be trained on such synthesized spectrograms.

text from ground truth audio is very misleading.
Maybe text from the training data is correct?
If you generate mel-spectrogram with only text (i.e., free-running), the length of mel-spectrogram will be changed from groundtruth of the waveform, as you saw in your post.
Then, you can not use such mel-spectrogram for vocoder training.

Tacotron2 is an auto-regressive (AR) model, so to generate the mel-spectrogram whose length matches with groundtruth of the waveform, you need to input groundtruth of mel-spectrogram as the AR input.
(This is teacher forcing, i.e., groundtruth aligned mel-spectrogram).

@Alexey322
Copy link
Author

Now I understand, thanks for the help!

@Alexey322
Copy link
Author

Alexey322 commented Aug 4, 2020

I changed the vocoder parameters to the same ones used in tacotron 2.

Please carefully check the feature extraction function definition of both repositories.
There may be a difference which cannot be controlled by the config.

Bit complicated, but you can follow the following procedure:

1. Dump GTA outputs as `.npy` format for training and dev set

2. Make `feats.scp` files of training and dev set like
   ```
   utt_id_1 /path/to/utt_id_1_npy_file.npy
   utt_id_2 /path/to/utt_id_2_npy_file.npy
   ...	
   ```
   
   
   You need to keep the same `utt_id` as the same as `wav.scp` created in stage 0.

3. Run training function, e.g.,
   ```shell
   parallel-wavegan-train \
           --config "${conf}" \
           --train-wav-scp /path/to/train_wav_scp \
           --train-feats-scp /path/to/train_feats_scp \
           --dev-wav-scp /path/to/dev_wav_scp \
           --dev-feats-scp /path/to/dev_feats_scp \
           --outdir "${expdir}" \
           --resume "${resume}" \
           --verbose "${verbose}"
   ```

@kan-bayashi , I am trying to run this training command without using the run.sh. What structure should audio files have? I made a GTA dump and brought the audio to the form that is obtained in step 1 in the file (npy format and matched the length with GTA mel)
Split files into 4 parts

training_feats.scp:
utt_id_1 /path/to/utt_id_1.mel.npy
utt_id_2 /path/to/utt_id_2.mel.npy
training_wavs.scp:
utt_id_1 /path/to/utt_id_1.wav.npy
utt_id_2 /path/to/utt_id_2.wav.npy

and 10% for dev part in the same format.

When I run the command parallel_wagegan_train with parameters above, I get the error:
...
python3.6.7 \ lib \ site-packages \ kaldiio \ utils.py: 376: UserWarning: An error happens at loading "wavs / 0x000f4ed7.wav.npy
...
'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte
Sounds like the audio should be in a different format. Can you please tell me what I'm doing wrong?

@Alexey322 Alexey322 reopened this Aug 4, 2020
@kan-bayashi
Copy link
Owner

The file in wav.scp must be .wav format, not npy (feats.scp is OK).

@Alexey322
Copy link
Author

Alexey322 commented Aug 5, 2020

@kan-bayashi ,
I train a model on GTA mel spectrograms without standardization for vocoder and log10, and use synthesized mel spectrograms from nvidia tacotron 2, which, after synthesis, are normalized and converted using the natural logarithm. Vocoder melgan.v1, after 260k iterations, audio quality is not very good. Also, the losses stopped decreasing after a certain number of iterations. How to improve the quality?

Config params for vocoder the same as in melgan.v1
Tacotron was trained on 80/7600 mel basis.

There is Tensorboard losses for eval:
изображение

train:
изображение

Synthesized audio:
https://drive.google.com/file/d/1AyaF46aSdFSwF0SahYRmm32K9NZ0bmur/view?usp=sharing

@kan-bayashi
Copy link
Owner

melgan.v1 uses MelGAN generator + PWG discriminator.
If you want to use normal MelGAN, full_band_melgan based config will give better results.
https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/ljspeech/voc1/conf/full_band_melgan.v2.yaml

@Alexey322
Copy link
Author

@kan-bayashi , thanks i will try it.
But why does audio sound so bad? Could this be due to the fact that I did not apply centering and normalization to mel spectrograms? Or maybe the mel basis is not suitable for a male voice? I listened to how your generated audio for ljspeech sounds with the melgan.v1 config, the quality is much better.

@kan-bayashi
Copy link
Owner

Hmm. One of the possible reasons is the quality of GTA mel-spectrogram.
If you compare with the model using groundtruth mel-spectrogram, you can get more insight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants