This repo follows Stanford CS336 (Spring 2025) Assignment 1: Basics as a starting point, and is evolving into a personal “LLM from scratch” implementation. The original assignment materials and structure are preserved where helpful, with custom modifications added over time.
We manage our environments with uv to ensure reproducibility, portability, and ease of use.
Install uv here (recommended), or run pip install uv/brew install uv.
We recommend reading a bit about managing projects in uv here (you will not regret it!).
You can now run any code in the repo using
uv run <python_file_path>and the environment will be automatically solved and activated when necessary.
uv run pytestInitially, all tests should fail with NotImplementedErrors.
To connect your implementation to the tests, complete the
functions in ./tests/adapters.py.
Download the TinyStories data and a subsample of OpenWebText
mkdir -p data
cd data
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt
wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_train.txt.gz
gunzip owt_train.txt.gz
wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_valid.txt.gz
gunzip owt_valid.txt.gz
cd ..