Skip to content

dishant2009/Small-Language-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Small Language Model from Scratch

A compact, GPT-style decoder-only Transformer trained on TinyStories. The goal is simple, build and understand an end-to-end small language model that can generate short, coherent stories while staying small enough to run on a single GPU or Colab.

It’s written in plain PyTorch, no training frameworks, so you can see every moving part.

Highlights

  • Architecture, GPT-2 style blocks with learned token and position embeddings, multi-head causal self-attention, GELU MLP, residual connections, tied output projection.
  • Tiny but capable, default config is ~30M params, 6 layers, 6 heads, 384 hidden size, context length 128.
  • Data pipeline, Hugging Face roneneldan/TinyStories, tokenized with tiktoken, memory-mapped into train.bin and validation.bin for fast batched reads.
  • Trainer you can actually read, AdamW, linear warmup, cosine decay, gradient accumulation, mixed precision with torch.amp, gradient clipping, periodic eval and best-checkpoint saving.
  • Inference with temperature and optional top-k sampling.

Quickstart

1) Environment

# Python 3.10+ recommended
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # pick the wheel that fits your setup
pip install datasets tiktoken numpy tqdm matplotlib

Colab works too, the script already includes minimal Colab friendly bits.

2) Run the script

The repository currently uses a single script:

Small Language model.py

Just run it as a notebook in Colab, or as a script locally. If you run locally and your Python chokes on lines like !pip install ..., remove those ! installs and make sure you installed the packages in the step above.

On first run it will:

  • Download TinyStories via datasets
  • Tokenize with tiktoken
  • Write train.bin and validation.bin
  • Start training and periodically compute validation loss
  • Save the best model to best_model_params.pt

Configuration

Default model and training knobs live inside the script. Change in place if you want different sizes.

Model

config = GPTConfig(
    vocab_size=50257,  # tiktoken GPT-2 vocab
    block_size=128,    # context length
    n_layer=6,         # transformer blocks
    n_head=6,          # attention heads
    n_embd=384,        # embedding width
    dropout=0.1,
    bias=True
)

Training

learning_rate = 1e-4
max_iters = 20_000
warmup_steps = 1_000
min_lr = 5e-5            # suggested small floor, adjust as you like
eval_iters = 500
batch_size = 32
block_size = 128
gradient_accumulation_steps = 32

The trainer uses:

  • AdamW with weight decay
  • Linear warmup, then cosine decay to min_lr
  • Autocast mixed precision on CUDA
  • Grad clip at 0.5
  • Best checkpointing on the lowest validation loss

Dataset and Tokenization

  • Dataset, roneneldan/TinyStories from Hugging Face.
  • Tokenizer, tiktoken GPT-2 encoder.
  • Storage, tokens are written to train.bin and validation.bin as uint16 via numpy.memmap for fast slice reads.
  • Batching, random contiguous blocks of length block_size with next-token targets.

If you want a different dataset, replace the load_dataset call and the process function. Everything downstream stays the same.


Training

Just run the script. You’ll see logs like:

Epoch 1000: train loss 2.45, val loss 2.60
Saved best model to best_model_params.pt

Loss is cross-entropy. Want perplexity, compute ppx = exp(val_loss).

Tips:

  • If you’re on a smaller GPU, lower batch_size and raise gradient_accumulation_steps to keep the effective batch size roughly constant.
  • If training is unstable, try learning_rate = 5e-5 or raise warmup_steps.
  • If you hit OOM, reduce block_size or model width n_embd.

Inference

After training finishes, the script loads best_model_params.pt and runs a couple prompts:

sentence = "Once upon a time there was a pumpkin."
context = torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(0)
y = model.generate(context, max_new_tokens=200, temperature=1.0, top_k=50)
print(enc.decode(y.squeeze().tolist()))

You can tweak:

  • temperature for creativity
  • top_k to limit to the k most likely tokens
  • max_new_tokens for output length

Model Internals, at a glance

  • Embeddings, token wte, position wpe, then dropout.
  • Block × n_layer, LayerNorm, Causal Self-Attention, MLP with GELU, residuals.
  • Attention, fused QKV projection, PyTorch SDPA when available, manual masked matmul fallback otherwise.
  • Weight tying, the LM head shares weights with the token embedding.
  • Param count, with the default config it’s about 30M parameters.

This is intentionally close to the GPT-2 paper recipe so you can map ideas back and forth.


Results you should expect

With the defaults, the model learns TinyStories distribution and produces short, simple, grammatical stories. It won’t be factual. It won’t follow long instructions. That’s expected, the context is 128 and the model is small. Raise block_size, n_layer, and n_embd if you want more headroom, then budget your GPU.


Troubleshooting

  • CUDA OOM, lower batch_size, block_size, or n_embd. Increase gradient_accumulation_steps to keep the effective batch size similar.
  • Loss spikes, try smaller learning_rate, longer warmup, or turn off dropout for a bit.
  • Slow dataloading, the memmap approach avoids holding everything in RAM. Keep it, it’s faster than you think.

Roadmap

  • Add CLI args, config files, and proper requirements.txt
  • Gradient checkpointing for deeper models
  • Packing multiple sequences per block for higher token efficiency
  • Optional RoPE and RMSNorm variants
  • WandB or TensorBoard logging
  • Unit tests for sampling and masking

Acknowledgments

  • Vizuara AI Labs Small Language Model scratch workshop for the inspiration and outline.
  • A few utilities and batching tricks are adapted from nanoGPT style training.

License

MIT License.


Citation

If this project helped you learn or ship something, a star or mention is appreciated. If you publish results, feel free to cite the repo and TinyStories dataset.

About

Implementation of TinyStories Paper, my first exposure to small language models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages