This project is an implementation of a GPT model built entirely from scratch, inspired by LLM from Scratch by Sebastian. It covers key concepts like attention mechanisms, transformer architectures, and fine-tuning large language models for real-world applications.
| Topic | Description | README Section | Code File |
|---|---|---|---|
| Understanding Large Language Models | Introduction to LLMs, transformer, and training objectives. | Read More | 📜 No Code - add papers |
| Working with Text Data | Covers tokenization, Byte-pair encoding, word embeddings, and positional embeddings. | Read More | 📜 Code |
| Coding Attention Mechanisms | Explains self-attention, causal masking, and multi-head attention. | Read More | 📜 Code |
| Implementing a GPT Model from Scratch | Step-by-step implementation of a GPT model, including transformer blocks and text generation. | Read More | 📜 Code |
| Pretraining on Unlabeled Data | Covers loss functions, decoding strategies, and loading pre-trained weights. | Read More | 📜 Code |
| Finetuning for Text Classification | Adapts the model for supervised tasks like spam detection, adding classification heads, and loss calculation. | Read More | 📜 Code |
| Instruction Finetuning | Covers supervised instruction tuning, dataset preparation, and response extraction. | Read More | 📜 Code |
- A Large Language Model (LLM) is a deep neural network trained on vast amounts of text data to understand and generate human-like text. LLMs use the transformer architecture, which enables them to focus on different parts of the input using an attention mechanism. These models, trained via next-word prediction, power applications like chatbots, text summarization, and code generation.
- LLMs are widely used for:
- Text generation
- Machine translation
- Sentiment analysis
- Summarization
- Question answering
- Conversational AI (e.g., ChatGPT, Gemini, Claude)
LLM training generally involves two key stages:
- Pretraining – Training on a massive dataset to learn general language structures using next-word prediction.
- Finetuning – Adapting the pretrained model to specific tasks using labeled datasets (e.g., instruction tuning or classification).
This two-step approach allows LLMs to be customized for specific applications while leveraging the knowledge learned from large-scale text corpora.
Most modern LLMs rely on the transformer architecture, introduced in the 2017 paper Attention Is All You Need. The original transformer was developed for machine translation.
Architecture
- Consists of two submodules: Encoder and Decoder
- Encoder processes input text into numerical representations (embeddings)
- Decoder generates the output text from these embeddings
- GPT architecture is relatively simple. It's just the decoder part without the encoder. Since decoder-style models like GPT generate text by predicting text one word at a time, they are considered a type of autoregressive model.
- Autoregressive models incorporate their previous outputs as inputs for future predictions. Consequently, in GPT, each new word is chosen based on the sequence that precedes it, which improves coherence of the resulting text.
- GPT models, though designed for next-word prediction, unexpectedly perform translation—a phenomenon called "emergent behavior." This arises from exposure to multilingual data, enabling diverse tasks without specialized training, showcasing the power of large-scale generative models.
- Deep neural networks can't process raw text directly, as it must be converted into numerical form. Embeddings map words or other discrete data into continuous vector space, enabling neural networks to handle text, images, or audio efficiently.
When training a Large Language Model (LLM), we need to convert raw text into a numerical format that the model can process. This involves several key steps:
Tokenizing Text
Before converting words into numerical representations, we split text into tokens.
A tokenizer breaks down input text into:
- Words ("Hello world" → ["Hello", "world"])
- Subwords ("unfamiliar" → ["unfam", "iliar"])
- Characters (if needed)
- Special tokens ([BOS], [EOS], [PAD], [UNK]):
- [BOS] (Beginning of sequence)
- [EOS] (End of sequence)
- [PAD] (Padding to equalize sequence lengths)
- [UNK] (Unknown words that don’t exist in the vocabulary)
Each token is then mapped to a unique integer (token ID) using a vocabulary.
Byte Pair Encoding (BPE) - GPT’s Tokenization Method Why BPE? LLMs need to handle words outside their vocabulary (out-of-vocabulary words). Instead of storing every possible word, Byte Pair Encoding (BPE) breaks words into subwords.
- This allows the model to generalize words it hasn't explicitly seen during training.
- GPT-2 uses OpenAI’s tiktoken library, which implements BPE in Rust for better efficiency.
Preparing Input-Target Pairs for Training.
To train an LLM, we need to structure the data properly:
-
Chunking text into smaller sequences.
-
Next-word prediction: The model predicts the next word given the previous words. Example:
-
Input: ["The", "cat", "sat", "on"]
Target: ["cat", "sat", "on", "the"] -
The target is just a right-shifted version of the input. Using DataLoaders in PyTorch:
-
The Dataset and DataLoader classes load the data efficiently in mini-batches.
Creating Token Embeddings (Converting Tokens into Vectors):
-
Since token IDs are just numbers, we need to convert them into meaningful numerical representations:
-
Convert token IDs into 256-dimensional embedding vectors (GPT-3 uses 12,288 dimensions).
-
Embedding layer: Maps token IDs to high-dimensional embedding vectors. Example: If a token ID is 3, it retrieves the corresponding row from the embedding matrix.
- Why embeddings? They allow words with similar meanings to have similar numerical representations.
Encoding Word Positions (Positional Embeddings)
- Embedding layer convert IDs into identical vector representations regardless of where they are located in the input sequence:
-
LLMs process words without knowing their order, which can cause problems. To fix this, we add positional embeddings, which provide a sense of word order.
-
There are two types of positional embeddings:
- Absolute Positional Embeddings (used in GPT models): Assigns a fixed embedding to each position in a sequence.
-
These embeddings are optimized during training.
- Relative Positional Embeddings: Instead of storing absolute positions, it encodes distances between words.
"cat" and "sat" may have a distance of 1.
"cat" and "mat" may have a distance of 3
- Relative Positional Embeddings: Instead of storing absolute positions, it encodes distances between words.
Final Processing Before Training
- To create the input embeddings used in an LLM, we simply add the token and the absolute positional embeddings:
- 3.3.1 A simple self-attention mechanism without trainable weights
- 3.3.2 Computing attention weights for all input tokens
- 3.4.1 Computing the attention weights step by step
- 3.4.2 Implementing a compact SelfAttention class
- 3.5.1 Applying a causal attention mask
- 3.5.2 Masking additional attention weights with dropout
- 3.5.3 Implementing a compact causal self-attention class
- 3.6.1 Stacking multiple single-head attention layers
- 3.6.2 Implementing multi-head attention with weight splits
- 5.1.1 Using GPT to Generate Text
- 5.1.2 Calculating the Text Generation Loss: Cross Entropy and Perplexity
- 5.1.3 Calculating the Training and Validation Set Losses
- 5.3.1 Temperature Scaling
- 5.3.2 Top-k Sampling
- 5.3.3 Modifying the Text Generation Function with Above Strategies
- 7.3.1 Creating Target Token IDs for Training
