Open
Description
Paper
Link: https://arxiv.org/abs/2005.09684
Year: 2020
Summary
- perform ASL, with streaming approach base on the Transformer-XL network
- compare BLSTM to Transformer and Transformer-XL
Findings
- Depth-scale Initialization and Warmup Training
- did not experience convergence issues when training a shallow Transformer with random initialization on a single machine with 4-8 GPUs
- observed poor convergence or even divergence when performed parallelization across multiple machines with 32-64 GPUs -> perform warmup training on single machine until X hours before switching to cross-machine
- increased the depth of the Transformers to 24 layers and beyond, the model did not converge even during the warmup stage -> depth-scale model initialization
- with depth-scale initialization, we did not observe convergence issues during warmup training, and were able to train a Transformer-XL with 100 layers and over 200 million parameters in the warmup stage
- Pre-Norm vs. Post-Norm
- Layer normalization (LN) has been a de facto in Transformers for a smooth model convergence during training
- Post-Norm, is observed to results in poor convergence in machine translation tasks for deeper Transformers
- Post-Norm approach worked well for 12-layer Transformers
- run cross-machine parallelization with 32 or 64 GPUs, we observed divergence during training with Post-Norm -> switching to Pre-Norm
- Convolution Layers
- Transformer model performed poorly without any convolution layers, compared with having interleaved 1D convolution with self-attention or VGG encoder
- applying a VGG as the encoder, it is more beneficial to remove the interleaved convolutions but increase the model dimension of self-attention layers