PyTorch implementation of RealFormer: Transformer Likes Residual Attention.
Clone this repository.
git clone https://github.com/jaketae/realformer.git
Navigate to the cloned directory. You can start using the model via
>>> from realformer import RealFormerEncoder
>>> model = RealFormerEncoder()
By default, the model comes with the following parameters:
RealFormerEncoder(
d_model=512,
num_heads=8,
expansion_factor=2,
dropout=0.5,
max_len=512,
num_layers=6,
)
Residual Attention Layer Transformer, shortened as RealFormer, is a transformer variant that incorporatess residual skip connections to allow previous attention scores to pass through the entire network. It outperforms canonical transformers on a variety of tasks and datasets, including masked language modeling (MLM), GLUE, and SQuAD.
- Just like
torch.nn.TransformerEncoder
, theRealFormerEncoder
does not include any embedding layers. It is recommended that you implemenet positional encoding schemes (e.g. sinusodial tables, learnable embeddings) as needed. - The authors mention that RealFormer layers can be used as drop-in replacements for any transformer model, whether they be autoencoding (encoders) or auto-regressive (decoders). We closely follow the flow of the paper and include only an encoder version of the implementation for now.