Small Language Model from Scratch

A compact, GPT-style decoder-only Transformer trained on TinyStories. The goal is simple, build and understand an end-to-end small language model that can generate short, coherent stories while staying small enough to run on a single GPU or Colab.

It’s written in plain PyTorch, no training frameworks, so you can see every moving part.

Highlights

Architecture, GPT-2 style blocks with learned token and position embeddings, multi-head causal self-attention, GELU MLP, residual connections, tied output projection.
Tiny but capable, default config is ~30M params, 6 layers, 6 heads, 384 hidden size, context length 128.
Data pipeline, Hugging Face roneneldan/TinyStories, tokenized with tiktoken, memory-mapped into train.bin and validation.bin for fast batched reads.
Trainer you can actually read, AdamW, linear warmup, cosine decay, gradient accumulation, mixed precision with torch.amp, gradient clipping, periodic eval and best-checkpoint saving.
Inference with temperature and optional top-k sampling.

Quickstart

1) Environment

# Python 3.10+ recommended
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # pick the wheel that fits your setup
pip install datasets tiktoken numpy tqdm matplotlib

Colab works too, the script already includes minimal Colab friendly bits.

2) Run the script

The repository currently uses a single script:

Small Language model.py

Just run it as a notebook in Colab, or as a script locally. If you run locally and your Python chokes on lines like !pip install ..., remove those ! installs and make sure you installed the packages in the step above.

On first run it will:

Download TinyStories via datasets
Tokenize with tiktoken
Write train.bin and validation.bin
Start training and periodically compute validation loss
Save the best model to best_model_params.pt

Configuration

Default model and training knobs live inside the script. Change in place if you want different sizes.

Model

config = GPTConfig(
    vocab_size=50257,  # tiktoken GPT-2 vocab
    block_size=128,    # context length
    n_layer=6,         # transformer blocks
    n_head=6,          # attention heads
    n_embd=384,        # embedding width
    dropout=0.1,
    bias=True
)

Training

learning_rate = 1e-4
max_iters = 20_000
warmup_steps = 1_000
min_lr = 5e-5            # suggested small floor, adjust as you like
eval_iters = 500
batch_size = 32
block_size = 128
gradient_accumulation_steps = 32

The trainer uses:

AdamW with weight decay
Linear warmup, then cosine decay to min_lr
Autocast mixed precision on CUDA
Grad clip at 0.5
Best checkpointing on the lowest validation loss

Dataset and Tokenization

Dataset, roneneldan/TinyStories from Hugging Face.
Tokenizer, tiktoken GPT-2 encoder.
Storage, tokens are written to train.bin and validation.bin as uint16 via numpy.memmap for fast slice reads.
Batching, random contiguous blocks of length block_size with next-token targets.

If you want a different dataset, replace the load_dataset call and the process function. Everything downstream stays the same.

Training

Just run the script. You’ll see logs like:

Epoch 1000: train loss 2.45, val loss 2.60
Saved best model to best_model_params.pt

Loss is cross-entropy. Want perplexity, compute ppx = exp(val_loss).

Tips:

If you’re on a smaller GPU, lower batch_size and raise gradient_accumulation_steps to keep the effective batch size roughly constant.
If training is unstable, try learning_rate = 5e-5 or raise warmup_steps.
If you hit OOM, reduce block_size or model width n_embd.

Inference

After training finishes, the script loads best_model_params.pt and runs a couple prompts:

sentence = "Once upon a time there was a pumpkin."
context = torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(0)
y = model.generate(context, max_new_tokens=200, temperature=1.0, top_k=50)
print(enc.decode(y.squeeze().tolist()))

You can tweak:

temperature for creativity
top_k to limit to the k most likely tokens
max_new_tokens for output length

Model Internals, at a glance

Embeddings, token wte, position wpe, then dropout.
Block × n_layer, LayerNorm, Causal Self-Attention, MLP with GELU, residuals.
Attention, fused QKV projection, PyTorch SDPA when available, manual masked matmul fallback otherwise.
Weight tying, the LM head shares weights with the token embedding.
Param count, with the default config it’s about 30M parameters.

This is intentionally close to the GPT-2 paper recipe so you can map ideas back and forth.

Results you should expect

With the defaults, the model learns TinyStories distribution and produces short, simple, grammatical stories. It won’t be factual. It won’t follow long instructions. That’s expected, the context is 128 and the model is small. Raise block_size, n_layer, and n_embd if you want more headroom, then budget your GPU.

Troubleshooting

CUDA OOM, lower batch_size, block_size, or n_embd. Increase gradient_accumulation_steps to keep the effective batch size similar.
Loss spikes, try smaller learning_rate, longer warmup, or turn off dropout for a bit.
Slow dataloading, the memmap approach avoids holding everything in RAM. Keep it, it’s faster than you think.

Roadmap

Add CLI args, config files, and proper requirements.txt
Gradient checkpointing for deeper models
Packing multiple sequences per block for higher token efficiency
Optional RoPE and RMSNorm variants
WandB or TensorBoard logging
Unit tests for sampling and masking

Acknowledgments

Vizuara AI Labs Small Language Model scratch workshop for the inspiration and outline.
A few utilities and batching tricks are adapted from nanoGPT style training.

License

MIT License.

Citation

If this project helped you learn or ship something, a star or mention is appreciated. If you publish results, feel free to cite the repo and TinyStories dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
Small Language model.py		Small Language model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small Language Model from Scratch

Highlights

Quickstart

1) Environment

2) Run the script

Configuration

Dataset and Tokenization

Training

Inference

Model Internals, at a glance

Results you should expect

Troubleshooting

Roadmap

Acknowledgments

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Small Language Model from Scratch

Highlights

Quickstart

1) Environment

2) Run the script

Configuration

Dataset and Tokenization

Training

Inference

Model Internals, at a glance

Results you should expect

Troubleshooting

Roadmap

Acknowledgments

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages