Skip to content

Conformer: Convolution-augmented Transformer for Speech Recognition #43

Open
@jinglescode

Description

@jinglescode

Paper

Link: https://arxiv.org/abs/2005.08100
Year: 2020

Summary

  • combine convolutions with self-attention in ASR models
  • self-attention learns the global interaction whilst the convolutions efficiently capture the relative-offset-based local correlations

Methods

image

  • Multi-Headed Self-Attention Module: employ multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL, the relative sinusoidal positional encoding scheme

image

  • Convolution Module:

image

  • Feed Forward Module:

image

  • Conformer Block: two Feed Forward modules sandwiching the Multi-Headed Self-Attention module and the Convolution module, as shown in Figure 1

  • Conformer block differs from a Transformer block in a number of ways, in particular, the inclusion of a convolution block and having a pair of FFNs surrounding the block in the Macaron-style. having a Macaron-style FFN pair is also more effective than a single FFN of the same number of parameters

Results

They demonstrate the effectiveness of combining Transformer and convolution in a single neural network. Outperform Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss

image

  • model exhibits better accuracy with fewer parameters than previous work on the LibriSpeech dataset, and achieves a new state-of-the-art performance

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions