A foundational character-level Bigram Language Model built from scratch using PyTorch.
This project is designed to be a clear and simple implementation, perfect for understanding the basics of language modeling before moving on to more complex architectures like Transformers.
The model is trained on the text of "The Wonderful Wizard of Oz" and can generate new, albeit simplistic, text in a similar style.
- Character-Level Tokenization: The model learns from individual characters, not words.
- Simple Bigram Architecture: Predicts the next character based only on the previous one, implemented with a simple
nn.Embeddinglayer. - Configuration Driven: All hyperparameters, file paths, and settings are managed in a single
config.yamlfile. - Clean Code Structure: The project is organized into logical modules for data handling, model definition, training, and inference.
- Reproducible: The training script handles everything from data processing to saving the final model and character mappings.
The repository is organized to separate source code from data and model outputs, making it clean and easy to navigate.
.
├── config.yaml # All hyperparameters and paths
├── data/
│ └── wizard_of_oz.txt # Your training dataset
├── model_output/ # All generated files are saved here
│ ├── bigram.pth # The trained model weights
│ └── mappings.json # Character-to-integer mappings
├── data_utils.py # Handles data loading, encoding, and batching
├── model.py # The PyTorch nn.Module definition
├── train.py # Script to train the model
├── inference.py # Script to generate text from a trained model
└── README.md # This file
Follow these steps to get the project running on your local machine.
git clone https://github.com/shivendra-dev54/bigram-model
cd bigram-modelIt's good practice to create a virtual environment to manage project dependencies.
# For Windows
python -m venv venv
venv\Scripts\activate
# For macOS/Linux
python3 -m venv venv
source venv/bin/activateThis project requires PyTorch and PyYAML.
pip install torch pyyamlRunning the project involves three main steps: preparing the data, training the model, and running inference.
Place your training data, a single plain text file (e.g., wizard_of_oz.txt), inside the data/ directory. Make sure the data_path in config.yaml points to this file.
You can modify the hyperparameters in config.yaml to experiment with different settings.
# In config.yaml
batch_size: 32
block_size: 128
max_iters: 5000
learning_rate: 3e-4
# ... and other settingsRun the training script from the root directory. This will process your data, train the model, and save the model weights (bigram.pth) and character mappings (mappings.json) to the model_output/ directory.
python train.pyYou should see output indicating the training progress:
Using device: cuda
✅ Mappings saved to ./model_output/mappings.json
Vocabulary size: 81
🚀 Starting training...
step 0: train loss 4.6821, val loss 4.6759
step 500: train loss 2.8913, val loss 2.9104
...
step 4999: train loss 2.4812, val loss 2.5031
✅ Model saved to ./model_output/bigram.pth
Once the model is trained, you can generate new text by running the inference script.
python inference.pyThis script will load the saved model and mappings from model_output/ and print the generated text to the console:
✍️ Generating text...
--- GENERATED OUTPUT ---
The oot t a's s,
"Id as,
he s wef, l.
"
Th o
"
"I dyo an.
Tine sas, thet lof thavane, shor, se th l l,
Asorof t he ood athe s."
Dorot we s aind t anororot sse s asor asorof t s t hind wind," sanot t s s, the we a t aind t aind," se,
s,
"I d, thet sorot lof t we, sor the s, thet aind, s," se, sorof t we s, shind t wef l, the wef wef,
------------------------
A Bigram Model is one of the simplest types of language models. Its core assumption is that the probability of the next character in a sequence depends only on the immediately preceding character.
It completely ignores any context before that single character. For example, when trying to predict the character that follows "l" in the word "hello", it only uses "l" to make its prediction, ignoring "h", "e", and "l".
In this implementation, the nn.Embedding layer acts as a direct lookup table. For a given vocabulary size V, the embedding table is of size (V, V). When you input the index of a character, the model simply looks up that row in the table. This row contains V numbers (logits), representing the model's confidence for every possible character in the vocabulary being the next one.
While this approach is too simple for generating coherent, long-form text, it's a fundamental concept and a great starting point for understanding how models learn sequential patterns.
This project is licensed under the MIT License. See the LICENSE file for details.