Building a Small Language Model from Scratch in Python
Part 1: Foundations — From Text to Tokens
-
Introduction: What Is a Small Language Model (SLM)? → Explain architecture, tokenization, and what makes “small” models special.
-
Collecting and Cleaning Your Dataset → Use open text sources (TinyStories, Gutenberg, wikitext). Show cleaning & normalization in Python.
-
Building a Simple Tokenizer from Scratch → Implement Byte-Pair Encoding (BPE) or WordPiece tokenizer in Python — step by step.
-
Converting Text into Numerical Data → Create a vocabulary, encode text into token IDs, and build a dataloader with PyTorch.