Parallel wavenet has been implemented, partial codes will be placed here soon.
Citing 1: Parallel WaveNet: Fast High-Fidelity Speech Synthesis
Citing 2: WAVENET: A GENERATIVE MODEL FOR RAW AUDIO
Citing 3: Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
Citing 4: TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS
Citing 5: PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications
Citing 6: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth
Citing 7: https://github.com/keras-team/keras/blob/master/keras/backend/tensorflow_backend.py#L3254
Citing 8: https://github.com/openai/pixel-cnn
Citing 9: https://github.com/keithito/tacotron
You should read citing6's codes first, then you can implement the original wavenet.
We use mel-scale spectrogram transforming from real wav as local conditions for convenience. You can train a tacotron model to get predicted mel-scale spectrogram.
A good teacher network is VERY VERY VERY important for training the student network.
-
Replace casual conv1d in citing6(masked.py) with Keras's implement. Refer to citing7.
-
Implement a datafeeder to provide mel and wav. Refer to citing9's datafeeder.py.
-
Using discretized mixture of logistics distribution instead of 256-way categorical distribution. Refer ro citing8's nn.py.
-
Modify citing6's h512_bo16.py to build original wavenet with local condition.
-
Training with Adam.
-
Modify Teacher's datafeeder to provider white noises Z. One mixture logistic, np.random.logistic(size=wav.shape)
-
Modify teacher's h512_bo16.py to build parallel wavenet.
-
Add power loss, cross entropy loss and etc...
-
Restore teacher weights, and then train student.
Data:
encoding: mel-scale spectrogram
x: real wav
θe: encoding's parameters
θt: teacher's parameters
Result:
mu_t: teacher's output
scale_t: teacher's output
Procedure:
for x,encoding in X,ENCODING:
new_x = shiftright(x)
new_enc = F(encoding,θe)
for i in layers-1:
new_x_i = H_i(new_x_i,θt_i)
new_x_i += new_enc
mu_t, scale_t = H_i(new_x_i,θt_i) #last layer
predict_x = logistic(mu_t,scale_t) #citing8
loss = cross_entropy(predict_x,x) #citing8
Data:
encoding: mel-scale spectrogram
z: white noise, z~logistic distribution L(0,1), one mixture
x: real wav
θe: encoding's parameters
θt: teacher's parameters
θs: student's parameters
mu_t: teacher's output
scale_t: teacher's output
Result:
mu_tot: student's output
scale_tot: student's output
Procedure:
for x,z,encoding in X,Z,ENCODING:
new_enc = F(encoding,θe)
### student ###
mu_tot=0
scale_tot=1
for f in flow:
new_z = shiftright(z)
for i in layers-1:
new_z_i = H_i(new_z_i,θs_i)
new_z_i += new_enc
mu_s_f, scale_s_f = H_i(new_z_i,θs_i) #last layer
mu_tot = mu_s_f + mu_tot*scale_s_f
scale_tot = scale_tot*scale_s_f
z = z*scale_s_f + mu_s_f
sample_x = logistic(mu_tot,scale_tot)
Power_loss = (|stft(z)|-|stft(x)|)**2
H(Ps)_loss = log(scale_tot) + 2
### teacher ###
new_z = shiftright(z)
for i in layers-1:
new_z_i = H_i(new_z_i,θt_i)
new_z_i += new_enc
mu_t, scale_t = H_i(new_z_i,θt_i) #last layer
predict_x = logistic(mu_t,scale_t)
H(Ps,Pt)_loss = cross_entropy(predict_x,sample_x)
loss = H(Ps,Pt) - H(Ps) + Power_loss