Skip to content

BenardKemp/Building-a-Small-Language-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Building-a-Small-Language-Model

Building a Small Language Model from Scratch in Python

Part 1: Foundations — From Text to Tokens

  1. Introduction: What Is a Small Language Model (SLM)? → Explain architecture, tokenization, and what makes “small” models special.

  2. Collecting and Cleaning Your Dataset → Use open text sources (TinyStories, Gutenberg, wikitext). Show cleaning & normalization in Python.

  3. Building a Simple Tokenizer from Scratch → Implement Byte-Pair Encoding (BPE) or WordPiece tokenizer in Python — step by step.

  4. Converting Text into Numerical Data → Create a vocabulary, encode text into token IDs, and build a dataloader with PyTorch.