Open
Description
Paper
Link: https://arxiv.org/pdf/2006.11477v2.pdf
Year: 2020
Summary
a framework for self-supervised learning of representations from raw audio data, wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations
Methods
- Feature encoder. composed of a multi-layer convolutional feature encoder, which takes
as input raw audio and outputs latent speech representations
- encodes speech audio via a multi-layer convolutional neural network and then masks spans of the resulting latent speech representations
- Contextualized representations with Transformers. then fed to a Transformer to build representations capturing information from the entire sequence
- latent representations are fed to a Transformer network to build contextualized representations and the model is trained via a contrastive task where the true latent is to be distinguished from distractors
- consists of several blocks containing a temporal convolution followed by layer normalization and a GELU activation function
- Instead of fixed positional embeddings which encode absolute positional information, we use a convolutional layer with kernel size 128 and 16 groups which acts as relative positional embedding
- add the output of the convolution followed by a GELU to the inputs and then apply layer normalization
- Quantization module. learn discrete linguistic units via a gumbel softmax to represent the latent representations in the contrastive task
- Gumbel softmax enables choosing discrete codebook entries in a fully differentiable way
Results
- ultra-low resource speech recognition: when using only 10 minutes of labeled data, our approach achieves word error rate (WER) 5.2/8.6 on the clean/noisy test sets of Librispeech