A compact, GPT-style decoder-only Transformer trained on TinyStories. The goal is simple, build and understand an end-to-end small language model that can generate short, coherent stories while staying small enough to run on a single GPU or Colab.
It’s written in plain PyTorch, no training frameworks, so you can see every moving part.
- Architecture, GPT-2 style blocks with learned token and position embeddings, multi-head causal self-attention, GELU MLP, residual connections, tied output projection.
- Tiny but capable, default config is ~30M params, 6 layers, 6 heads, 384 hidden size, context length 128.
- Data pipeline, Hugging Face
roneneldan/TinyStories, tokenized withtiktoken, memory-mapped intotrain.binandvalidation.binfor fast batched reads. - Trainer you can actually read, AdamW, linear warmup, cosine decay, gradient accumulation, mixed precision with
torch.amp, gradient clipping, periodic eval and best-checkpoint saving. - Inference with temperature and optional top-k sampling.
# Python 3.10+ recommended
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # pick the wheel that fits your setup
pip install datasets tiktoken numpy tqdm matplotlibColab works too, the script already includes minimal Colab friendly bits.
The repository currently uses a single script:
Small Language model.py
Just run it as a notebook in Colab, or as a script locally. If you run locally and your Python chokes on lines like !pip install ..., remove those ! installs and make sure you installed the packages in the step above.
On first run it will:
- Download TinyStories via
datasets - Tokenize with tiktoken
- Write
train.binandvalidation.bin - Start training and periodically compute validation loss
- Save the best model to
best_model_params.pt
Default model and training knobs live inside the script. Change in place if you want different sizes.
Model
config = GPTConfig(
vocab_size=50257, # tiktoken GPT-2 vocab
block_size=128, # context length
n_layer=6, # transformer blocks
n_head=6, # attention heads
n_embd=384, # embedding width
dropout=0.1,
bias=True
)Training
learning_rate = 1e-4
max_iters = 20_000
warmup_steps = 1_000
min_lr = 5e-5 # suggested small floor, adjust as you like
eval_iters = 500
batch_size = 32
block_size = 128
gradient_accumulation_steps = 32The trainer uses:
- AdamW with weight decay
- Linear warmup, then cosine decay to
min_lr - Autocast mixed precision on CUDA
- Grad clip at 0.5
- Best checkpointing on the lowest validation loss
- Dataset,
roneneldan/TinyStoriesfrom Hugging Face. - Tokenizer,
tiktokenGPT-2 encoder. - Storage, tokens are written to
train.binandvalidation.binasuint16vianumpy.memmapfor fast slice reads. - Batching, random contiguous blocks of length
block_sizewith next-token targets.
If you want a different dataset, replace the load_dataset call and the process function. Everything downstream stays the same.
Just run the script. You’ll see logs like:
Epoch 1000: train loss 2.45, val loss 2.60
Saved best model to best_model_params.pt
Loss is cross-entropy. Want perplexity, compute ppx = exp(val_loss).
Tips:
- If you’re on a smaller GPU, lower
batch_sizeand raisegradient_accumulation_stepsto keep the effective batch size roughly constant. - If training is unstable, try
learning_rate = 5e-5or raisewarmup_steps. - If you hit OOM, reduce
block_sizeor model widthn_embd.
After training finishes, the script loads best_model_params.pt and runs a couple prompts:
sentence = "Once upon a time there was a pumpkin."
context = torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(0)
y = model.generate(context, max_new_tokens=200, temperature=1.0, top_k=50)
print(enc.decode(y.squeeze().tolist()))You can tweak:
temperaturefor creativitytop_kto limit to the k most likely tokensmax_new_tokensfor output length
- Embeddings, token
wte, positionwpe, then dropout. - Block ×
n_layer, LayerNorm, Causal Self-Attention, MLP with GELU, residuals. - Attention, fused QKV projection, PyTorch SDPA when available, manual masked matmul fallback otherwise.
- Weight tying, the LM head shares weights with the token embedding.
- Param count, with the default config it’s about 30M parameters.
This is intentionally close to the GPT-2 paper recipe so you can map ideas back and forth.
With the defaults, the model learns TinyStories distribution and produces short, simple, grammatical stories. It won’t be factual. It won’t follow long instructions. That’s expected, the context is 128 and the model is small. Raise block_size, n_layer, and n_embd if you want more headroom, then budget your GPU.
- CUDA OOM, lower
batch_size,block_size, orn_embd. Increasegradient_accumulation_stepsto keep the effective batch size similar. - Loss spikes, try smaller
learning_rate, longer warmup, or turn off dropout for a bit. - Slow dataloading, the
memmapapproach avoids holding everything in RAM. Keep it, it’s faster than you think.
- Add CLI args, config files, and proper
requirements.txt - Gradient checkpointing for deeper models
- Packing multiple sequences per block for higher token efficiency
- Optional RoPE and RMSNorm variants
- WandB or TensorBoard logging
- Unit tests for sampling and masking
- Vizuara AI Labs Small Language Model scratch workshop for the inspiration and outline.
- A few utilities and batching tricks are adapted from nanoGPT style training.
MIT License.
If this project helped you learn or ship something, a star or mention is appreciated. If you publish results, feel free to cite the repo and TinyStories dataset.