Open
Description
Paper
Link: https://arxiv.org/abs/1910.05895
Year: 2019
Summary
- ScaleNorm: normalization with a single scale parameter for faster training and better performance
Results
- ScaleNorm is faster than LayerNorm
- warmup free training
- author propose 3 changes to Transformer: PreNorm + FixNorm + ScaleNorm
Comments
presentation: https://tnq177.github.io/data/transformers_without_tears.pdf