Open
Description
Paper
Link: https://arxiv.org/pdf/2106.09660.pdf
Year: 2021
Summary
- text-to-speech synthesis, synthesizes the waveform directly without using hand-designed intermediate features (e.g., spectrograms)
Methods
3 modules
- encoder: sequence input, extracts representations
- resampling: match input to output
- decoder: generate waveform
encoder:
- 3 conv + batchnorm + dropout
- LSTM
- zoneout regularization
resampling
- Gaussian upsampling introduced in the non-attentive Tacotron
decoder
- consist upsampling blocks and downsampling blocks
Results
- tradeoff between fidelity and speed by varying the number of refinement steps
- experiments demonstrate that WaveGrad 2 is capable of generating high fidelity audio, comparable to strong baselines