Signal Processing with Python and Librosa
#1) Voice Reconstruction Using Vq-VAE
This notebook proposes a method on how to reconstruct speech using vq-vae which has been first introduced by Oord et. al.
#2) Vq-VAE vs VAE Main difference between Vq-VAE & VAE is that VAE learns a continuous latent representation of a given dataset, but Vq-VAE learns a discrete latent representation of dataset.
#3) Architecture
-
At the begining, encoder takes a batch of images with input shape of
$X:(n, h, w, c)$ and outputs$Z_{e}:(n, h, w, d)$ -
Then vector quantization layer takes
$Z_{e}$ and for each vector in$Z_{e}$ it selects the nearest vector from the codebook based on$L_{2}$ norm and outputs$Z_{q}$ -
Finally decoder takes
$Z_{q}$ and reconstructs the input$X$ .
#4) Detailed View on Vq-VAE Architecture
-
Reshaping: First of all we need to reshape input from
$(n, h, w, d)$ to $(nhw, d)$. -
Calculating Distances: For each of d-dimensional vectors, we calculate their distance from each k, d-dimensional vectors in codebook and get a matrix of $(nhw, k)$.
-
Argmin: Next for each row of the matrix, we apply argmin function to get the nearest vector index from codebook and do one-hot encoding no each row (in fact the value of the nearest vector will be 1 and rest would be 0).
-
Index from Codebook: After that we multiply the one-hotted matrix to the whole codebook and we get a matrix of $(nhw, d)$ dimension.
-
Finally we reshape $(nhw, d)$ to
$(n, h, w, d)$ and give it to the decoder to reconstruct the input data
#5) Some High Resolution Constructed Images