Skip to content

Exploring Transformers for Large-Scale Speech Recognition #61

Open
@jinglescode

Description

@jinglescode

Paper

Link: https://arxiv.org/abs/2005.09684
Year: 2020

Summary

  • perform ASL, with streaming approach base on the Transformer-XL network
  • compare BLSTM to Transformer and Transformer-XL

Findings

  • Depth-scale Initialization and Warmup Training
    • did not experience convergence issues when training a shallow Transformer with random initialization on a single machine with 4-8 GPUs
    • observed poor convergence or even divergence when performed parallelization across multiple machines with 32-64 GPUs -> perform warmup training on single machine until X hours before switching to cross-machine
    • increased the depth of the Transformers to 24 layers and beyond, the model did not converge even during the warmup stage -> depth-scale model initialization
    • with depth-scale initialization, we did not observe convergence issues during warmup training, and were able to train a Transformer-XL with 100 layers and over 200 million parameters in the warmup stage
  • Pre-Norm vs. Post-Norm
    • Layer normalization (LN) has been a de facto in Transformers for a smooth model convergence during training
    • Post-Norm, is observed to results in poor convergence in machine translation tasks for deeper Transformers
    • Post-Norm approach worked well for 12-layer Transformers
    • run cross-machine parallelization with 32 or 64 GPUs, we observed divergence during training with Post-Norm -> switching to Pre-Norm
  • Convolution Layers
  • Transformer model performed poorly without any convolution layers, compared with having interleaved 1D convolution with self-attention or VGG encoder
  • applying a VGG as the encoder, it is more beneficial to remove the interleaved convolutions but increase the model dimension of self-attention layers

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions