Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to inference using MelGAN given a tacotron mel spec output? #46

Open
OswaldoBornemann opened this issue Mar 9, 2020 · 11 comments
Open

Comments

@OswaldoBornemann
Copy link

OswaldoBornemann commented Mar 9, 2020

When i trained melgan with original wav's mel spec, the result went well.

But when i tried to feed tacotron mel spec output into trained melgan model, the sound just all bee. Would you mind sharing some advice? thanks a lot. @seungwonpark

@CookiePPP
Copy link

upload sound samples?

@OswaldoBornemann
Copy link
Author

@CookiePPP Please set the volume into lowest... I don't want to hurt your ears...

bad result.wav.zip

@CookiePPP
Copy link

CookiePPP commented Mar 9, 2020

Do you have the code you used to feed the tacotron outputs into melgan uploaded somewhere?
That's definitely bugged out.

@OswaldoBornemann
Copy link
Author

OswaldoBornemann commented Mar 9, 2020

@CookiePPP The process are kind like below:

First i get the mel spec output from tacotron, using like

# mel sent shape is (spec_length, 80)
mel_sent = tacotron_out(model, sentence, CONFIG, use_cuda, ap, use_gl=use_gl, figures=True)

Then i unsqueeze and transpose the mel result to feed into MelGAN.

checkpoint_path = "./melgan/chkpt/id_test1/id_test1_aca5990_0700.pt"
config = "./melgan/config/id_test1.yaml"

checkpoint = torch.load(checkpoint_path)
# if args.config is not None:
#     hp = HParam(config)
# else:
hp = load_hparam_str(checkpoint['hp_str'])

melgan_model = Generator(hp.audio.n_mel_channels).cuda()
melgan_model.load_state_dict(checkpoint['model_g'])
melgan_model.eval()

with torch.no_grad():
    mel = torch.from_numpy(mel_sent).unsqueeze(0).transpose(2, 1)
    mel = mel.cuda()

    audio = model.inference(mel)
    audio = audio.cpu().detach().numpy()

@CookiePPP
Copy link

CookiePPP commented Mar 9, 2020

mel_sent = tacotron_out(model, sentence, CONFIG, use_cuda, ap, use_gl=use_gl, figures=True)

Where does this line come from? This repo is designed to inferface with NVIDIA/Tacotron.
Nvidia uses their own Spectrogram conversion that I believe outputs values between -12 and 2.

@OswaldoBornemann
Copy link
Author

OswaldoBornemann commented Mar 9, 2020

@CookiePPP I see. I use mozilla tts instead.

@OswaldoBornemann
Copy link
Author

@CookiePPP I would like to know that whether could we use tacotron gta output to train melgan

@CookiePPP
Copy link

CookiePPP commented Mar 9, 2020

@tsungruihon
You should be able to scale the output and get an audible result. I don't know what range Mozilla TTS has, but try to transform the Mozilla output to match the Nvidia one.
e.g

mel_sent = tacotron_out(model, sentence, CONFIG, use_cuda, ap, use_gl=use_gl, figures=True)
mel_sent = (mel_sent * 0.5) + 2

and replace 0.5 and +2 with the values that move the spectrogram between -12 and 2.

@CookiePPP I would like to know that whether could we use tacotron gta output to train melgan

Note sure, I'm busy today so I can't really help you there.

@OswaldoBornemann
Copy link
Author

@CookiePPP Really appreciated. Thanks a lot.

@mennatallah644
Copy link

I face the same problem Did you find a solution?
@tsungruihon

@OswaldoBornemann
Copy link
Author

Please visit https://github.com/mozilla/TTS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants