GitHub - MyDarapy/gpt-1-from-scratch: Rewriting and pretraining GPT-1 from scratch. Implementing Multihead Attention (MHA) in pyTorch from the original paper Improving Language Understanding by Generative Pre-Training (https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)

Welcome to the GPT-1 from Scratch with PyTorch repository! This project is an educational endeavor to recreate the original GPT-1 model from scratch using PyTorch, albeit with fewer parameters to accommodate training on a modest GPU setup (first pre-trained on a T4 GPU and then on an A10G). The primary goal was to deepen my understanding of key concepts such as multi-head attention, tokenization, checkpointing, learning rate scheduling, and the intricacies involved in model pre-training from the ground up.

Complete Google Colab script to run this repo:

Features

Custom Implementation: Recreated GPT-1 architecture from scratch using PyTorch.
Multi-Head Attention: Implemented based on the GPT-1 paper.
Tokenization: Trained a custom tokenizer with HF tokenizer library tailored for the dataset.
Training Utilities: Includes checkpointing, learning rate scheduling, and more.
Educational Focus: Designed to serve as a learning resource for those interested in transformer-based models.

Architecture

The model architecture closely follows the original GPT-1 design, consisting of:

Embedding Layer: Converts input tokens into dense vectors.
Positional Encoding: Adds postion information about the position of tokens in the sequence.
Eight Transformer Blocks each containing:
- Multiple Heads of Self-Attention
- Feed-Forward Neural Networks
- Layer Normalization and Residual Connections
Output Layer: Generates logits (probabilities) for the next token prediction.

Despite having fewer parameters, the simplified architecture maintains the core functionalities of GPT-1, making it suitable for educational purposes and experimentation on limited hardware.

How It Works

Tokenization

Tokenization is the process of converting raw text into a sequence of tokens that the model can understand. This implementation uses a custom tokenizer trained with Hugging Face tokenizer library that builds a vocabulary based on the dataset, assigning unique integer IDs to each token. The vocabulary consists of 15k unique tokens.

Multi-Head Attention

Multi-Head Attention (MHA) is a critical component of the transformer architecture, allowing the model to focus on different parts of the input sequence simultaneously. This implementation follows the approach outlined in the GPT-1 paper, involving:

Scaled Dot-Product Attention: Computes attention scores between queries and keys, scaled by the square root of their dimension.
12 Attention Heads per block: Splits the queries, keys, and values into multiple heads to capture diverse contextual relationships between words. For each word (or token) in the input, the model generates three vectors: Query (Q): Represents the current word’s perspective. Key (K): Represents how each word can be referenced. Value (V): Represents the actual information each word holds. I like to think of the Q, K, V as: Query = What you're searching for. Key = Labels that help identify where to find what you need. Value:= The actual information you retrieve once you find the right labels.

Calculating Attention Scores: The model computes how much focus to place on other words by taking the dot product of the Query with all Keys. This determines the relevance of each word to the current word. Weighted Sum of Values: The attention scores are normalized (usually with softmax) to create weights. These weights are then used to compute a weighted sum of the Values, which becomes the output for that word. Multiple Heads: This process is done multiple times in parallel (hence "multi-head") to capture different types of relationships and patterns in the data. This is very important for context understanding. Queries, Keys, and Values allow the model to understand the context by focusing on relevant parts of the input when generating each word.

Parallel Processing: By using multiple heads, the model can attend to different aspects of the data simultaneously, making the learning process more efficient and comprehensive.

Concatenation and Linear Transformation: Merges the outputs from all heads and projects them back to the original dimension.

Positional Encoding

Since transformers lack inherent positional awareness, positional encodings are added to the token embeddings to provide information about the token positions within the sequence. This implementation uses learned positional embeddings.

Transformer Blocks

The entire architecture has 12 transformer blocks in total. Each transformer block comprises of

8 Multi-Head Self-Attention layers: This allows the model to attend to different parts of the input sequence.
Feed-Forward Neural Network: Processes the attention outputs through two linear layers with a GELU activation.
Layer Normalization: Applied before each sub-layer to stabilize and accelerate training.
Residual Connections: Adds the input of each sub-layer to its output to facilitate gradient flow.

Model Hyperparameters

context_length = 512 pad_index = 0 EOS_token = 1 embed_dim = 768 # also called model dimension d_model (512 in the GPT 1 paper) num_of_heads = 12 # Attention heads dropout = 0.1 #regularization batch_size = 6 device = 'cuda' if torch.cuda.is_available() else 'cpu' n_layers = 12 #Transformer block eval_iters = 200 max_iter = 7000 evaluation_intervals = 200 learning_rate = 2.5e-4 vocab_size = 20_000

Training Process

The model is trained to predict the next token in a sequence using the following steps:

Data Preparation: Tokenize the input text and create batches. I used a batch size of 6
Forward Pass: Compute the logits for each token in the sequence.
Loss Calculation: Use Cross-Entropy Loss to measure the discrepancy between predicted and actual tokens.
Backward Pass: Compute gradients and update model parameters using an optimizer with learning rate scheduling.
Checkpointing: Save model states at intervals to prevent loss of progress and facilitate resuming training.

Getting Started

Prerequisites

Ensure you have the following installed:

Python: 3.7 or higher
PyTorch: 1.7.1 or higher
CUDA: For GPU acceleration (optional but recommended)
Other Dependencies: Listed in requirements.txt

Installation

Clone the Repository

git clone https://github.com/MyDarapy/gpt-1-from-scratch.git
cd gpt-1-from-scratch

Create a Virtual Environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies
```
pip install -r requirements.txt
```

Usage

Training the Model

To train the GPT-1 model from scratch:

Prepare Your Dataset

Ensure your dataset is in a suitable text format. You can place your dataset in the data/ directory.
Run the Training Script
```
python train.py --config config/train_config.yaml
```
Optional arguments can be specified in the configuration file or via command-line parameters.

Generating Text

After training, you can generate text using the trained model:

python generate.py --model_path checkpoints/model_epoch_X.pth --prompt "Once upon a time"

Replace model_epoch_X.pth with the path to your trained model checkpoint and provide a suitable prompt.

Data Loader

The data loader is responsible for:

Loading the Dataset: Reads the text data from the specified directory.
Tokenization: Converts raw text into sequences of token IDs.
Batching: Organizes data into batches for efficient training.
Shuffling: Randomizes data order to improve training robustness.

Implementation details can be found in data_loader.py. It leverages PyTorch's Dataset and DataLoader classes to handle large datasets efficiently.

Hyperparameters

Key hyperparameters for the model and training process include:

Model Parameters
- vocab_size: Size of the tokenizer vocabulary.
- embedding_dim: Dimension of token embeddings.
- num_heads: Number of attention heads in MHA.
- num_layers: Number of transformer blocks.
- hidden_dim: Dimension of the feed-forward network.
- max_seq_length: Maximum sequence length.
Training Parameters
- batch_size: Number of samples per batch.
- learning_rate: Initial learning rate for the optimizer.
- num_epochs: Total number of training epochs.
- weight_decay: Weight decay (L2 regularization) factor.
- dropout: Dropout rate for regularization.
- gradient_clip: Maximum gradient norm for clipping.

All hyperparameters are configurable via the config/train_config.yaml file.

Project Structure

gpt-1-from-scratch/
├── data/
│   ├── raw/
│   └── processed/
├── checkpoints/
├── src/
│   ├── __init__.py
│   ├── model.py
│   ├── data_loader.py
│   ├── train.py
│   ├── generate.py
│   └── utils.py
├── config/
│   └── train_config.yaml
├── requirements.txt
├── README.md
└── LICENSE

data/: Contains raw and processed datasets.
checkpoints/: Stores model checkpoints during training.
src/: Source code for the model, data loading, training, and generation scripts.
config/: Configuration files for training and model parameters.
requirements.txt: Python dependencies.
README.md: Project documentation.
LICENSE: License information.

Contributing

Contributions are welcome! If you have suggestions, improvements, or bug fixes, feel free to open an issue or submit a pull request.

Fork the Repository
Create a Feature Branch
```
git checkout -b feature/YourFeature
```
Commit Your Changes
```
git commit -m "Add Your Feature"
```
Push to the Branch
```
git push origin feature/YourFeature
```
Open a Pull Request

License

This project is licensed under the MIT License.

Acknowledgements

OpenAI GPT-1 Paper
PyTorch
Inspired by the original GPT-1 architecture and various educational resources on transformer models.

This repository is solely for educational purposes. All rights to the original GPT-1 model and associated materials belong to OpenAI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Features

Architecture

How It Works

Tokenization

Multi-Head Attention

Positional Encoding

Transformer Blocks

Model Hyperparameters

Training Process

Getting Started

Prerequisites

Installation

Usage

Training the Model

Generating Text

Data Loader

Hyperparameters

Project Structure

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src		src
LICENSE		LICENSE
README.md		README.md
the_real_tinygpt (1).py		the_real_tinygpt (1).py
train.py		train.py

License

MyDarapy/gpt-1-from-scratch

Folders and files

Latest commit

History

Repository files navigation

Features

Architecture

How It Works

Tokenization

Multi-Head Attention

Positional Encoding

Transformer Blocks

Model Hyperparameters

Training Process

Getting Started

Prerequisites

Installation

Usage

Training the Model

Generating Text

Data Loader

Hyperparameters

Project Structure

Contributing

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages