You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
text to speech synthesis, sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms
Methods
network is composed of an encoder and a decoder with attention
encoder converts a character sequence into representations
decoder consumes to predict a spectrogram
attention network which summarizes the full encoded sequence as a fixed-length context vector for each decoder output step
concatenation of the LSTM output and the attention context vector is projected through a linear transform to predict the target spectrogram frame