Open
Description
Paper
Link: https://arxiv.org/pdf/2002.02562.pdf
Year: 2020
Summary
- use the attention in Transformer-XL and apply to speech recognition
- end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system
Contributions and Distinctions from Previous Works
- replacing the LSTM encoders with Transformers
- Unlike a typical attention-based sequence-to-sequence model, which attends over the entire input for every prediction in the output sequence, the RNN-T model gives a probability distribution over the label space at every time step, and the output label space includes an additional null label to indicate the lack of output for that time step — similar to the Connectionist Temporal Classification (CTC) framework. But unlike CTC, this label distribution is also conditioned on the previous label history
Methods
- RNN-T architecture parameterizes P(z|x) with an audio encoder, a label encoder, and a joint network. The encoders are two neural networks that encode the input sequence and the target output sequence, respectively
- In addition, so as to model sequential order, we use the relative positional encoding proposed in Transformer-xl. With relative positional encoding, the encoding only affects the attention score instead of the Values being summed.
Comments
Conformer: Convolution-augmented Transformer for Speech Recognition shows they beat their performance