Open
Description
Paper
Link: https://arxiv.org/abs/2005.08100
Year: 2020
Summary
- combine convolutions with self-attention in ASR models
- self-attention learns the global interaction whilst the convolutions efficiently capture the relative-offset-based local correlations
Methods
- Multi-Headed Self-Attention Module: employ multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL, the relative sinusoidal positional encoding scheme
- Convolution Module:
- Feed Forward Module:
-
Conformer Block: two Feed Forward modules sandwiching the Multi-Headed Self-Attention module and the Convolution module, as shown in Figure 1
-
Conformer block differs from a Transformer block in a number of ways, in particular, the inclusion of a convolution block and having a pair of FFNs surrounding the block in the Macaron-style. having a Macaron-style FFN pair is also more effective than a single FFN of the same number of parameters
Results
They demonstrate the effectiveness of combining Transformer and convolution in a single neural network. Outperform Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss
- model exhibits better accuracy with fewer parameters than previous work on the LibriSpeech dataset, and achieves a new state-of-the-art performance