Skip to content

Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss #44

Open
@jinglescode

Description

@jinglescode

Paper

Link: https://arxiv.org/pdf/2002.02562.pdf
Year: 2020

Summary

  • use the attention in Transformer-XL and apply to speech recognition
  • end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system

Contributions and Distinctions from Previous Works

  • replacing the LSTM encoders with Transformers
  • Unlike a typical attention-based sequence-to-sequence model, which attends over the entire input for every prediction in the output sequence, the RNN-T model gives a probability distribution over the label space at every time step, and the output label space includes an additional null label to indicate the lack of output for that time step — similar to the Connectionist Temporal Classification (CTC) framework. But unlike CTC, this label distribution is also conditioned on the previous label history

Methods

image

  • RNN-T architecture parameterizes P(z|x) with an audio encoder, a label encoder, and a joint network. The encoders are two neural networks that encode the input sequence and the target output sequence, respectively
  • In addition, so as to model sequential order, we use the relative positional encoding proposed in Transformer-xl. With relative positional encoding, the encoding only affects the attention score instead of the Values being summed.

Comments

Conformer: Convolution-augmented Transformer for Speech Recognition shows they beat their performance

image

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions