Open
Description
Paper
Link: https://arxiv.org/pdf/1711.10433.pdf
Year: 2017
Summary
- high-fidelity speech synthesis based on WaveNet using Probability Density Distillation
Contributions and Distinctions from Previous Works
- generating high-fidelity speech samples at more than 20 times faster than real-time compared to the original WaveNet with no significant difference in quality
Methods
- modify WaveNet for parallel training with inverse-autoregressive flows
- uses an already trained WaveNet as a ‘teacher’ from which a parallel WaveNet ‘student’ can efficiently learn
- 'student' cooperates by attempting to match the teacher’s probabilities
- to minimise the KL-divergence between its distribution and that of the teacher by maximising the log-likelihood of its samples under the teacher and maximising its own entropy at the same time
- introduce 3 loss terms, power loss, perceptual loss, contrastive loss
Results
- used in Google Assistant queries
- modelling a sample rate of 24kHz instead of 16kHz