LLMs Unplugged

Understanding how AI language models work starts with building one yourself. This teaching project shows you how to create N-gram language models from scratch---either by hand in 20 minutes with pen and paper, or with automated tools that generate dice-powered text generation booklets.

The core insight is simple: language models predict what comes next by counting word patterns. A bigram model asks "after seeing word X, what usually comes next?" By building this yourself rather than treating it as a black box, you develop intuition for how larger models work.

Website: www.llmsunplugged.org

This is a Cybernetic Studio artefact by Ben Swift as part of the Human-Scale AI project.

What's in this repository

This repository contains both teaching materials and software tools. The teaching materials (lesson plans, worksheets in the handouts/ directory) can be used standalone without any software installation. The software tools (the llms_unplugged CLI tool + other helper scripts in the cli/ directory) are only necessary if you want to create your own pre-trained N-gram booklets from custom text corpora.

The website/ directory contains the source for the project website at www.llmsunplugged.org.

Which path should I take?

This project offers several entry points depending on your goals:

Want to understand the fundamentals in 20 minutes? Use the pen and paper approach with the grid template and step-by-step instructions (in lessons 01 and 02). No software required.

Teaching a class or workshop? Explore the teaching lessons and instructor notes for structured lesson plans and materials.

Want to create your own N-gram booklet? You have two options:

Use a pre-built release: Download the binary for your platform from the releases page, unpack it, and run the llms_unplugged on a .txt file containing your training data (see data/frankenstein.txt for an example)
Build from source: Use the Rust toolchain to compile and customize the tool yourself

Creating your own N-gram booklets

Process any text corpus into a typeset N-gram model booklet for dice-based text generation.

You'll need:

Typst
Rust toolchain (optional---only if you want to modify the tool)

Quickstart

If you've downloaded the release tarball:

# Unpack the release archive
tar -xzf llms_unplugged-v1.0.0.tar.gz
cd llms_unplugged

# Generate a booklet (JSON + PDF) from the included sample text
# (use the binary for your platform from the bin/ directory)
./bin/llms_unplugged-linux-x86_64 pdf --target frankenstein-2-1 --input data/frankenstein.txt --out-dir out

The resulting PDF contains your N-gram model formatted for dice-roll-based text generation. For all options see --help.

Input file format

Your input text file must include YAML frontmatter with these keys:

---
title: "Title of the Text"
author: "Author Name"
url: "https://source.url"
---
Your text content here...

The tokenizer lowercases text and removes punctuation (except apostrophes in contractions) to keep the model small.

Subcommands and key options

build - Produce JSON only.
- --n <N>: N-gram size (default 2)
- --books <N>: Split large models into multiple JSON files
- --raw: Emit raw counts (no dice scaling)
pdf - Produce PDFs (and JSON if needed).
- --target name-n-books: Matches Makefile targets (e.g. frankenstein-3-2)
- --paper-size, --columns, --template book.typ, --subtitle
- --pdf-only / --json-only for incremental builds
tsv - Export a bigram TSV matrix for spreadsheets (n=2 only).

By default, counts are scaled for d10 dice using 10^k-1 scaling (e.g., 0-9, 0-99, 0-999), making it easy to add more dice for larger ranges.

How the pipeline works

text file → Rust CLI → model.json → Typst → PDF booklet

The Rust tool (cli/src/main.rs, cli/src/lib.rs) processes your text through a unified normalizer (cli/src/text.rs) to generate N-gram statistics. The Typst template (cli/book.typ) reads model.json and typesets it into a printable booklet with guide words, proper pagination, and dice-roll ranges.

For large trigram models, use the -b flag to split across multiple books.

Project structure

cli/ - Rust CLI tool and booklet generation pipeline
- src/ - Rust source code for N-gram processing and CLI
- book.typ - Main booklet template
data/ - Input text corpora (*.txt files with YAML frontmatter)
handouts/ - Teaching materials (lessons, worksheets, runsheets)
website/ - Project website source (Vitepress)
backlog/ - Task management

Testing

# Rust CLI tests (from cli/ directory)
cd cli && cargo test

# Website tests (from website/ directory)
cd website && npm run test

Tests cover capitalization rules, tokenization edge cases, and full integration tests. Test output must be pristine with zero failures.

Citation

If you use these teaching materials, please cite them:

@misc{swift2025llmsunplugged,
  author = {Swift, Ben},
  title = {LLMs Unplugged: Understand how AI language models work by building one yourself.},
  year = {2025},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.17403824},
  url = {https://doi.org/10.5281/zenodo.17403824}
}

Author

This work is a project of the Cybernetic Studio at the ANU School of Cybernetics.

License

Source code for this project is licensed under the MIT License. See the LICENSE file for details.

Documentation (in handouts/) and any typeset "N-gram model booklets" are licenced under a CC BY-NC-SA 4.0 license. See handouts/LICENSE for the full license text.

Source text licenses used as input for the language model remain as described in their original sources.

Name		Name	Last commit message	Last commit date
Latest commit History 1,019 Commits
.github/workflows		.github/workflows
backlog		backlog
cli		cli
data		data
docs		docs
handouts		handouts
scripts		scripts
typst		typst
website		website
.gitignore		.gitignore
.prettierignore		.prettierignore
.zenodo.json		.zenodo.json
AGENTS.md		AGENTS.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
TODO.md		TODO.md
mise.toml		mise.toml
socy-logo-bw.svg		socy-logo-bw.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLMs Unplugged

What's in this repository

Which path should I take?

Creating your own N-gram booklets

Quickstart

Input file format

Subcommands and key options

How the pipeline works

Project structure

Testing

Citation

Author

License

About

Uh oh!

Releases 7

Packages

Uh oh!

Languages

License

ANUcybernetics/llms-unplugged

Folders and files

Latest commit

History

Repository files navigation

LLMs Unplugged

What's in this repository

Which path should I take?

Creating your own N-gram booklets

Quickstart

Input file format

Subcommands and key options

How the pipeline works

Project structure

Testing

Citation

Author

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Languages

Packages