Skip to content

Transformers without Tears: Improving the Normalization of Self-Attention #62

Open
@jinglescode

Description

@jinglescode

Paper

Link: https://arxiv.org/abs/1910.05895
Year: 2019

Summary

  • ScaleNorm: normalization with a single scale parameter for faster training and better performance

Results

  • ScaleNorm is faster than LayerNorm
  • warmup free training
  • author propose 3 changes to Transformer: PreNorm + FixNorm + ScaleNorm

Comments

presentation: https://tnq177.github.io/data/transformers_without_tears.pdf

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions