In this repo, I built two types of language models: recurrent and transformer. I have a pipeline for training and doing hparam sweeps using W&B. Below the implementation notes, I give a brief intro into language models and transformer architecture, then outline the key components in the implementation.
Note: I am actively working on and improving this repo.
git clone https://github.com/khajash/language-models.git
cd language-models
python -m venv .env
source .env/bin/activate
pip install -r requirements.txt
I am using WikiText2 from torchtext. Below is an excerpt from the dataset. As you can see, it is already preprocessed having the rare words replaced with the <unk> token. The vocabulary has a total of 28,782 tokens it.
= Valkyria Chronicles III =
Senjō no Valkyria 3 : Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Raven " .
- Create a json config file to specify network parameters and learning rate schedule in
lmlib/configs - Recurrent Network
python train-recurrent.py --config ./configs/simple-gru-invsqrt.json- Select different layer types, e.g. RNN, LSTM, GRU under
cell_typein the config file
- Select different layer types, e.g. RNN, LSTM, GRU under
- Transformer
python train.py --config ./configs/simple-transformer-cosine.json
- Command line configs include:
seed: Random seed. (int, default = 0)n_epochs: Number of epochs to run the training. (int, default = 50)batch_size: Batch size for mini-batch training. (int, default = 20)eval_batch_size: Batch size for mini-batch training. (int, default = 20)seq_len: Max length of a sequence. (int, default = 35)save_model: Save best model while training and last model when done.dryrun: Run in dryrun mode without wandb.
- Go to WandB project on web console. If you don't have a project already, create a new one.
- Select Sweeps > Create Sweep
- Copy yaml config sweep from
sweep-bayes.ymlor write custom one. The autogenerated yaml is not helpful in my setup. PressInitialize Sweep - Make sure default params in
parsers.pyare correct and especially the base config file - Set
RUN_WANDB_SWEEP = Trueat beginning oftrain.pyfile. This allows wandb to override hparams in the sweep, otherwise it will keep the default configuration. - In the WandB sweep console, copy launch agent
wandb agent user/project/agentIDand run in an open terminal. This will start your sweep. Do this on as many separate machines as you want for distrubuted tuning.
-
Sinusoidal Positional Encoding - same as Vaswani et al. (2017)
- In PyTorch implementation, we take advantage of the log-exp trick to make the math a bit easier.
-
Learning Rate Schedulers - configure in the json config file
- StepLr
- Inverse Square Root with Warm-up
- Cosine with Warm-up
- Generating new text is currently only setup for the transformer architecture. It supports two decoding methods:
- Greedy Search - greedily selects the word at each step with the highest probability - this method tends to generate sequences that may repeat words or subsequences multiple times
- Sampling - randomly selects the word at each timestep based on its conditional probability distribution
python generate_text.py --model path/to/model.pt --config ./configs/config-file.json --seq_len 20 --decoding sampling
A language model estimates the probability distribution over a sequence of tokens, e.g. words. Given a previous set of tokens
In practice, a langauge model takes in a sequence of tokens, feeds them through an embedding layer, decoder model and softmax function to output the probabilities over the vocabulary (vocab size = ntoken). The decoder model is typically either a recurrent model (RNN, LSTM, GRU, etc.) or transformer. Recurrent models will process each word in sequence, while a transformer can process the sequence in parallel using a mask.
Fig.1 High-Level Diagram of Language Model
The Transformer architecture used here is similar to that employed in (Liu et al., 2018) and the original GPT (Radford et al., 2018). For a language model, we do not need the encoder-decoder architecture needed in neural machine translation (NMT) as in Vaswani et al. (2017). Instead, we can just use a decoder network. This decoder block is similar to the encoder block in Vaswani et al. (2017), as it only consists of two sublayers: Self-Attention and Feed-Forward. One key difference between the encoder block in Vaswani et al. (2017) and the decoder here is that we use Masked Self-Attention rather than unmasked.
Fig.2 Diagram of Transformer Decoder Language Model
Below, I'll dive into the three important components of the transformer architecture: positional encoding, scaled dot-product attention, and multi-head attention. Here are some of the key parameters we'll be using in this doc.
-
$d_\text{model}$ : dimension of the embedding size and the layers within the model -
$d_\text{vocab}$ : size of vocabulary - listed asntokenin diagrams
We use positional encodings to inject information about relative or absolute position into the model. It is the size
Sinusoidal Positional Encoding
For this method, we precompute a PE matrix and store it in the buffer.
- Each row represents the encoding for a specific word at position
$i$ - Each column represents a different sinusoidal function at a different wavelength - every other column is alternating between sine and cosine - which is why there is banding in the upper dimensions because the wavelength is much larger.
Fig.3 Positional Encoding Matrix
Before looking at Multi-Head Attention, it's important to understand Scaled Dot-Product Attention.
Queries, Keys and Values
So, what are Queries
Operations
-
MatMul -
$QK^T$ - Calculate the alignment score to see how much the two word embeddings match - calculate between the each query$Q$ and key$K$ -
Scale -
$\frac{1}{\sqrt{d_k}}$ - Divide by$\sqrt{d_k}$ for more stable gradients, used for regularization and improves performance for larger models -$d_k$ is the dimension of the keys - Mask - (optional) Mask out future positions
-
Softmax - Apply softmax function to obtain the weights for the values
$V$ -
MatMul - Apply weights to values
$V$
Fig.4 Scaled Dot Product Attention. Diagram from Vaswani et al. (2017)
Rather than performing a single attention function with the scaled dot-product attention function, we linearly project QKV
Operations
-
Linear - Linearly project QKV each with its own set of weights. Do not use an activation fucntion here.
- This is where we project into different subspaces and learn alignment for different representations
- Scaled Dot-Product Attention - For each projected version, perform the scaled dot-product attention function in parallel
-
Concat - Concatenate all of the scaled dot-product attention heads
$(\text{head}_1, \dots,\text{head}_h)$ - Linear - Project the concatenated heads back to the original space to produce the final values
Why Multi-head attention?
- Word representations encode many different characteristics of the word. A single Scaled Dot-Product Attention layer would only be able to query these characteristics in one shot. E.g. maybe it determines its a verb but not that its past tense.
- Multi-Head Attention applies multiple linear transformations to the
$\mathbf{Q}, \mathbf{K}, \mathbf{V}$ - allowing the model to apply many different projections of the word representations into different subspaces, each focusing on a subset of the word’s characteristics - Vaswani et al. (2017) used
$h=8$ parallel attention layers$d_k = d_v = d_{\text{model}} / h = 64$
- Due to reduced dimension of each head, total computation cost is similar to that of a single-head attention with full dimensionality
-
Architecture
- Encoder-Decoder Transformer model for Machine Translation
- At beginning of both encoder and decoder models is an embedding layer for each relevant vocabulary and a sinusoidal positional embedding layer.
- Embedding multiply by
$\sqrt{d_{\text{model}}}$
- Embedding multiply by
- (512 dimensional states with 8 attention heads)
-
Encoder
- Stack of
$N = 6$ identical layers consisting of two sublayers: self-attention and feed-forward network. - Around each sublayer is a residual connection.
- Following the residual connection is layer normalization.
- Stack of
-
Decoder
- Stack of
$N = 6$ identical layers consisting of three sublayers: masked self-attention, encoder-decoder attention, and feed-forward network. - Around each sublayer is a residual connection.
- Following the residual connection is layer normalization.
- Stack of
-
Position-wise Feed-Forward Networks
- Two linear layers with a ReLU activation in between
$$\text{FFN}(x) = \text{max}(0, xW_1 + b_1)W_2 + b_2$$ - Input and output dims:
$d_{\text{model}}=512$ - Inner-layer dims:
$d_{ff}=2048$
- Two linear layers with a ReLU activation in between
-
Optimization
- Adam optimizer with
$\beta_1 = 0.9, \beta_2=0.98$ and$\epsilon=10^{-9}$ - Used linear warmup with inverse square root decay afterwards
- Adam optimizer with
- Used bytepair encoding (BPE) vocabulary with target vocabulary of ~37000 tokens
-
Architecture
- 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads)
- Position-wise feed forward networks - used 3072 dimensional inner states
-
Optimization
- Adam optimizer with max lr of 2.5e-4
- lr scheduler: increased linearly from zero over the first 2000 updates, annealing to 0 using a cosine schedule
- 100 epochs using minibatches of 64 randomly sampled, contiguous sequences of 512 tokens
-
Weight initialization of
$N(0, 0.02)$ is sufficient b/c Layer norm is used throughout - Used bytepair encoding (BPE) vocabulary with 40,000 merges
- Residual, embedding and attention dropouts with rate of 0.1 for regularization
- Modified version of L2 regularization with
$w=0.01$ on all non-bias or gain weights - GELU activation function
- Used learned position embeddings instead of sinusoidal
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn and TensorFlow. O’Reilly.
- Karpathy, A. (2023). MinGPT [Python]. https://github.com/karpathy/minGPT (Original work published 2020)
- Language Modeling with nn.Transformer and TorchText. (2022). PyTorch. https://pytorch.org/tutorials/beginner/transformer_tutorial.html
- Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L., & Shazeer, N. (2018). Generating Wikipedia by Summarizing Long Sequences. https://doi.org/10.48550/arXiv.1801.10198
- Platen, P. (2020). How to generate text: Using different decoding methods for language generation with Transformers. Hugging Face. https://huggingface.co/blog/how-to-generate
- Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. http://arxiv.org/abs/1706.03762