GPT Model from Scratch

This project is an implementation of a GPT model built entirely from scratch, inspired by LLM from Scratch by Sebastian. It covers key concepts like attention mechanisms, transformer architectures, and fine-tuning large language models for real-world applications.

Topic	Description	README Section	Code File
Understanding Large Language Models	Introduction to LLMs, transformer, and training objectives.	Read More	📜 No Code - add papers
Working with Text Data	Covers tokenization, Byte-pair encoding, word embeddings, and positional embeddings.	Read More	📜 Code
Coding Attention Mechanisms	Explains self-attention, causal masking, and multi-head attention.	Read More	📜 Code
Implementing a GPT Model from Scratch	Step-by-step implementation of a GPT model, including transformer blocks and text generation.	Read More	📜 Code
Pretraining on Unlabeled Data	Covers loss functions, decoding strategies, and loading pre-trained weights.	Read More	📜 Code
Finetuning for Text Classification	Adapts the model for supervised tasks like spam detection, adding classification heads, and loss calculation.	Read More	📜 Code
Instruction Finetuning	Covers supervised instruction tuning, dataset preparation, and response extraction.	Read More	📜 Code

1. Understanding Large Language Models

What is an LLM?

A Large Language Model (LLM) is a deep neural network trained on vast amounts of text data to understand and generate human-like text. LLMs use the transformer architecture, which enables them to focus on different parts of the input using an attention mechanism. These models, trained via next-word prediction, power applications like chatbots, text summarization, and code generation.
LLMs are widely used for:
- Text generation
- Machine translation
- Sentiment analysis
- Summarization
- Question answering
- Conversational AI (e.g., ChatGPT, Gemini, Claude)

Stages of Building and Using LLMs

LLM training generally involves two key stages:

Pretraining – Training on a massive dataset to learn general language structures using next-word prediction.
Finetuning – Adapting the pretrained model to specific tasks using labeled datasets (e.g., instruction tuning or classification).

This two-step approach allows LLMs to be customized for specific applications while leveraging the knowledge learned from large-scale text corpora.

Transformer Architecture

Most modern LLMs rely on the transformer architecture, introduced in the 2017 paper Attention Is All You Need. The original transformer was developed for machine translation.

Architecture

Consists of two submodules: Encoder and Decoder
Encoder processes input text into numerical representations (embeddings)
Decoder generates the output text from these embeddings

GPT Architecture

GPT architecture is relatively simple. It's just the decoder part without the encoder. Since decoder-style models like GPT generate text by predicting text one word at a time, they are considered a type of autoregressive model.
Autoregressive models incorporate their previous outputs as inputs for future predictions. Consequently, in GPT, each new word is chosen based on the sequence that precedes it, which improves coherence of the resulting text.
GPT models, though designed for next-word prediction, unexpectedly perform translation—a phenomenon called "emergent behavior." This arises from exposure to multilingual data, enabling diverse tasks without specialized training, showcasing the power of large-scale generative models.

Building a large language model in 3 stages

2. Working with Text Data

Word Embeddings

Deep neural networks can't process raw text directly, as it must be converted into numerical form. Embeddings map words or other discrete data into continuous vector space, enabling neural networks to handle text, images, or audio efficiently.

Preparing Embeddings for LLMs

When training a Large Language Model (LLM), we need to convert raw text into a numerical format that the model can process. This involves several key steps:

Tokenizing Text
Before converting words into numerical representations, we split text into tokens.

A tokenizer breaks down input text into:

Words ("Hello world" → ["Hello", "world"])
Subwords ("unfamiliar" → ["unfam", "iliar"])
Characters (if needed)
Special tokens ([BOS], [EOS], [PAD], [UNK]):
- [BOS] (Beginning of sequence)
- [EOS] (End of sequence)
- [PAD] (Padding to equalize sequence lengths)
- [UNK] (Unknown words that don’t exist in the vocabulary)

Each token is then mapped to a unique integer (token ID) using a vocabulary.

Byte Pair Encoding (BPE) - GPT’s Tokenization Method Why BPE? LLMs need to handle words outside their vocabulary (out-of-vocabulary words). Instead of storing every possible word, Byte Pair Encoding (BPE) breaks words into subwords.

This allows the model to generalize words it hasn't explicitly seen during training.
GPT-2 uses OpenAI’s tiktoken library, which implements BPE in Rust for better efficiency.

Preparing Input-Target Pairs for Training.
To train an LLM, we need to structure the data properly:

Chunking text into smaller sequences.
Next-word prediction: The model predicts the next word given the previous words. Example:
Input: ["The", "cat", "sat", "on"]
Target: ["cat", "sat", "on", "the"]
The target is just a right-shifted version of the input. Using DataLoaders in PyTorch:
The Dataset and DataLoader classes load the data efficiently in mini-batches.

Creating Token Embeddings (Converting Tokens into Vectors):

Since token IDs are just numbers, we need to convert them into meaningful numerical representations:
Convert token IDs into 256-dimensional embedding vectors (GPT-3 uses 12,288 dimensions).
Embedding layer: Maps token IDs to high-dimensional embedding vectors. Example: If a token ID is 3, it retrieves the corresponding row from the embedding matrix.

Why embeddings? They allow words with similar meanings to have similar numerical representations.

Encoding Word Positions (Positional Embeddings)

Embedding layer convert IDs into identical vector representations regardless of where they are located in the input sequence:

LLMs process words without knowing their order, which can cause problems. To fix this, we add positional embeddings, which provide a sense of word order.
There are two types of positional embeddings:
- Absolute Positional Embeddings (used in GPT models): Assigns a fixed embedding to each position in a sequence.
These embeddings are optimized during training.
- Relative Positional Embeddings: Instead of storing absolute positions, it encodes distances between words.
  "cat" and "sat" may have a distance of 1.
  "cat" and "mat" may have a distance of 3

Final Processing Before Training

To create the input embeddings used in an LLM, we simply add the token and the absolute positional embeddings:

3. Coding Attention Mechanisms

3.1 The problem with modeling long sequences

3.2 Capturing data dependencies with attention mechanisms

3.3 Attending to different parts of the input with self-attention

3.3.1 A simple self-attention mechanism without trainable weights
3.3.2 Computing attention weights for all input tokens

3.4 Implementing self-attention with trainable weights

3.4.1 Computing the attention weights step by step
3.4.2 Implementing a compact SelfAttention class

3.5 Hiding future words with causal attention

3.5.1 Applying a causal attention mask
3.5.2 Masking additional attention weights with dropout
3.5.3 Implementing a compact causal self-attention class

3.6 Extending single-head attention to multi-head attention

3.6.1 Stacking multiple single-head attention layers
3.6.2 Implementing multi-head attention with weight splits

4. Implementing a GPT Model from Scratch to Generate Text

4.1 Coding an LLM architecture

4.2 Normalizing activations with layer normalization

4.3 Implementing a feed forward network with GELU activations

4.4 Adding shortcut connections

4.5 Connecting attention and linear layers in a transformer block

4.6 Coding the GPT model

4.7 Generating text

5. Pretraining on Unlabeled Data

5.1 Evaluating Generative Text Models

5.1.1 Using GPT to Generate Text
5.1.2 Calculating the Text Generation Loss: Cross Entropy and Perplexity
5.1.3 Calculating the Training and Validation Set Losses

5.2 Training an LLM

5.3 Decoding Strategies to Control Randomness

5.3.1 Temperature Scaling
5.3.2 Top-k Sampling
5.3.3 Modifying the Text Generation Function with Above Strategies

5.4 Loading and Saving the Weights in PyTorch

5.5 Loading the Pre-trained Weights from OpenAI

6. Finetuning for Text Classification

6.1 Finetuning

6.2 Data Preparation (Spam Data)

6.3 Creating Data Loaders

6.4 Initializing the Model with Pre-trained Weights

6.5 Adding a Classification Head

6.6 Calculating the Classification Loss and Accuracy

6.7 Finetuning the Models on Supervised Data

6.8 Using the LLM as a Spam Classifier

7. Instruction Finetuning

7.1 Introduction to instruction finetuning

7.2 Preparing Dataset for Supervised Instruction Finetuning

7.3 Organizing data into training batches

7.3.1 Creating Target Token IDs for Training

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
images		images
Attention_Mechanism.ipynb		Attention_Mechanism.ipynb
Causal_And_Multi_Head_Attention.ipynb		Causal_And_Multi_Head_Attention.ipynb
Fine_Tuning_for_Classification.ipynb		Fine_Tuning_for_Classification.ipynb
GPT2_Model_&_Output_Text.ipynb		GPT2_Model_&_Output_Text.ipynb
GPT2_Training_&_Text_Generation.ipynb		GPT2_Training_&_Text_Generation.ipynb
GPT_Transformer.ipynb		GPT_Transformer.ipynb
Harry_Potter_Sorcerer's_Stone.txt		Harry_Potter_Sorcerer's_Stone.txt
Instruction_Fine_Tuning.ipynb		Instruction_Fine_Tuning.ipynb
LLM_Data_Preprocessing.ipynb		LLM_Data_Preprocessing.ipynb
Loading_Pretrained_Weights_Of_OpenAI.ipynb		Loading_Pretrained_Weights_Of_OpenAI.ipynb
README.md		README.md
gpt_download3.py		gpt_download3.py
image-1.png		image-1.png
image.png		image.png

Folders and files

Latest commit

History

Repository files navigation

GPT Model from Scratch

Table of Contents

1. Understanding Large Language Models

What is an LLM?

Stages of Building and Using LLMs

Transformer Architecture

GPT Architecture

Building a large language model in 3 stages

2. Working with Text Data

Word Embeddings

Preparing Embeddings for LLMs

3. Coding Attention Mechanisms

3.1 The problem with modeling long sequences

3.2 Capturing data dependencies with attention mechanisms

3.3 Attending to different parts of the input with self-attention

3.4 Implementing self-attention with trainable weights

3.5 Hiding future words with causal attention

3.6 Extending single-head attention to multi-head attention

4. Implementing a GPT Model from Scratch to Generate Text

4.1 Coding an LLM architecture

4.2 Normalizing activations with layer normalization

4.3 Implementing a feed forward network with GELU activations

4.4 Adding shortcut connections

4.5 Connecting attention and linear layers in a transformer block

4.6 Coding the GPT model

4.7 Generating text

5. Pretraining on Unlabeled Data

5.1 Evaluating Generative Text Models

5.2 Training an LLM

5.3 Decoding Strategies to Control Randomness

5.4 Loading and Saving the Weights in PyTorch

5.5 Loading the Pre-trained Weights from OpenAI

6. Finetuning for Text Classification

6.1 Finetuning

6.2 Data Preparation (Spam Data)

6.3 Creating Data Loaders

6.4 Initializing the Model with Pre-trained Weights

6.5 Adding a Classification Head

6.6 Calculating the Classification Loss and Accuracy

6.7 Finetuning the Models on Supervised Data

6.8 Using the LLM as a Spam Classifier

7. Instruction Finetuning

7.1 Introduction to instruction finetuning

7.2 Preparing Dataset for Supervised Instruction Finetuning

7.3 Organizing data into training batches

7.4 Creating data loaders for an instruction dataset

7.5 Loading a pretrained LLM

7.6 Finetuning the LLM on instruction data

7.7 Extracting and saving responses

7.8 Evaluating the finetuned LLM

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages