Skip to content

shivendra-dev54/bigram-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple Bigram Language Model in PyTorch

A foundational character-level Bigram Language Model built from scratch using PyTorch.

This project is designed to be a clear and simple implementation, perfect for understanding the basics of language modeling before moving on to more complex architectures like Transformers.

The model is trained on the text of "The Wonderful Wizard of Oz" and can generate new, albeit simplistic, text in a similar style.


🚀 Key Features

  • Character-Level Tokenization: The model learns from individual characters, not words.
  • Simple Bigram Architecture: Predicts the next character based only on the previous one, implemented with a simple nn.Embedding layer.
  • Configuration Driven: All hyperparameters, file paths, and settings are managed in a single config.yaml file.
  • Clean Code Structure: The project is organized into logical modules for data handling, model definition, training, and inference.
  • Reproducible: The training script handles everything from data processing to saving the final model and character mappings.

📂 Project Structure

The repository is organized to separate source code from data and model outputs, making it clean and easy to navigate.

.
├── config.yaml               # All hyperparameters and paths
├── data/
│   └── wizard_of_oz.txt      # Your training dataset
├── model_output/             # All generated files are saved here
│   ├── bigram.pth            # The trained model weights
│   └── mappings.json         # Character-to-integer mappings
├── data_utils.py             # Handles data loading, encoding, and batching
├── model.py                  # The PyTorch nn.Module definition
├── train.py                  # Script to train the model
├── inference.py              # Script to generate text from a trained model
└── README.md                 # This file

🛠️ Setup and Installation

Follow these steps to get the project running on your local machine.

1. Clone the Repository

git clone https://github.com/shivendra-dev54/bigram-model
cd bigram-model

2. Create a Virtual Environment (Recommended)

It's good practice to create a virtual environment to manage project dependencies.

# For Windows
python -m venv venv
venv\Scripts\activate

# For macOS/Linux
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

This project requires PyTorch and PyYAML.

pip install torch pyyaml

⚙️ Usage

Running the project involves three main steps: preparing the data, training the model, and running inference.

Step 1: Add Your Data

Place your training data, a single plain text file (e.g., wizard_of_oz.txt), inside the data/ directory. Make sure the data_path in config.yaml points to this file.

Step 2: Configure Your Training (Optional)

You can modify the hyperparameters in config.yaml to experiment with different settings.

# In config.yaml
batch_size: 32
block_size: 128
max_iters: 5000
learning_rate: 3e-4
# ... and other settings

Step 3: Train the Model

Run the training script from the root directory. This will process your data, train the model, and save the model weights (bigram.pth) and character mappings (mappings.json) to the model_output/ directory.

python train.py

You should see output indicating the training progress:

Using device: cuda
✅ Mappings saved to ./model_output/mappings.json
Vocabulary size: 81
🚀 Starting training...
step 0: train loss 4.6821, val loss 4.6759
step 500: train loss 2.8913, val loss 2.9104
...
step 4999: train loss 2.4812, val loss 2.5031
✅ Model saved to ./model_output/bigram.pth

Step 4: Generate Text (Inference)

Once the model is trained, you can generate new text by running the inference script.

python inference.py

This script will load the saved model and mappings from model_output/ and print the generated text to the console:

✍️ Generating text...

--- GENERATED OUTPUT ---

The oot t a's s,
"Id as,
he s wef, l.
"
Th o
"
"I dyo an.
Tine sas, thet lof thavane, shor, se th l l,
Asorof t he ood athe s."
Dorot we s aind t anororot sse s asor asorof t s t hind wind," sanot t s s, the we a t aind t aind," se,
s,
"I d, thet sorot lof t we, sor the s, thet aind, s," se, sorof t we s, shind t wef l, the wef wef,

------------------------

🔬 How It Works

A Bigram Model is one of the simplest types of language models. Its core assumption is that the probability of the next character in a sequence depends only on the immediately preceding character.

It completely ignores any context before that single character. For example, when trying to predict the character that follows "l" in the word "hello", it only uses "l" to make its prediction, ignoring "h", "e", and "l".

In this implementation, the nn.Embedding layer acts as a direct lookup table. For a given vocabulary size V, the embedding table is of size (V, V). When you input the index of a character, the model simply looks up that row in the table. This row contains V numbers (logits), representing the model's confidence for every possible character in the vocabulary being the next one.

While this approach is too simple for generating coherent, long-form text, it's a fundamental concept and a great starting point for understanding how models learn sequential patterns.


📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

Build by Shivendra Devadhe

About

This is a simple bigram model build with pytorch can be trained and used for text generation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages