-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplified upsampling #4
Comments
@bshall Well, the main reason for simplified upsampling was to improve data flow. The upsampling part contains a 5 tap convolution, which requires padding the input mels on both sides with at least 2 empty frames on each side. It adds significant amount of work when doing parallel synthesis (by splitting the input mels in time and synthesizing in parallel - each piece has to be padded), and one has to be very careful when stitching padded waveform pieces together. It turned out that network based upsampling is actually shifting the mels in time a bit, which simple interpolation wasn't doing. This resulted in slightly lower quality speech. Keep in mind that upsampling is a tiny part of overall timing. Most of the work is done in RNN and post-net FC layers. I'm starting to thing about implementing streaming synthesis for the C++ library (i.e. don't wait for all the mel frames to be ready, instead generate as mel frames are added), so I may take another look at upsampling to avoid doing convolutions. |
Thanks for the response @geneing. Yeah, streaming synthesis would be really cool. I was wondering whether simple "nearest" upsampling would be good enough to replace the upsampling network. |
Hi @geneing |
Hi @geneing, thanks for all your hard work! I was wondering why you decided to abandon the simplified upsampling in your model_simplification branch. Was the audio quality significantly worse?
The text was updated successfully, but these errors were encountered: