llm-foundation

Educational repository for building LLM understanding from first principles, plus a working Sparse MoE Transformer implementation.

Repository Structure

`phase1_foundation/` - Learning Path

Educational implementations for understanding neural networks and transformers from scratch:

Backpropagation (backprop_from_scratch.py) - Pure NumPy implementation with XOR training
Attention Mechanisms (attention_mechanism.py) - Multi-head attention exploration
Interactive Jupyter notebooks for hands-on learning

Purpose: Build foundational understanding before tackling advanced architectures.

`sparse_moe_transformer/` - Production Model

A fully functional Sparse Mixture-of-Experts Transformer achieving state-of-the-art results:

Performance Highlights:

158M parameters, 2.2 validation loss on TinyStories
87.5% compute savings through sparse expert routing
37ms inference latency on 4GB GPU
Trains on consumer hardware (Quadro T1000, RTX 3050+)

Key Files:

src/train.py - Training script with full implementation
tests/test_sparse_activation.py - Verification suite
README.md - Complete documentation

`phase2_architecture/` to `phase5_production/`

Planned expansion areas for transformer components, training techniques, and deployment strategies.

Quick Start

Prerequisites

pip install -r requirements.txt

Run Educational Examples

# Backpropagation from scratch
python phase1_foundation/src/backprop_from_scratch.py

# Explore in Jupyter
jupyter notebook phase1_foundation/notebooks/backprop.ipynb

Train Sparse MoE Transformer

cd sparse_moe_transformer
python src/train.py --mode train --output_dir output/my_model

# Benchmark the model
python src/train.py --mode benchmark --model_path output/my_model/best_model.pt

Learning Philosophy

This codebase prioritizes understanding over production readiness in the phase directories:

Favor clarity and explicit operations over optimization
Include comments explaining mathematical concepts
Show intermediate steps rather than hiding them in abstractions
Compare manual implementations with framework versions

The sparse_moe_transformer/ directory demonstrates production-quality implementations.

Key Concepts Covered

Phase 1 - Foundation:

Manual gradient computation and backpropagation
Weight initialization strategies
Activation functions (tanh, GELU)
Multi-head attention mechanisms

Sparse MoE Transformer:

Sparse Mixture-of-Experts architecture
Expert routing with learned gating
Memory-efficient training on 4GB GPUs
Load balancing and gradient flow

Important Note

The Sparse MoE Transformer in this repo is not the biologically-inspired "Baby Dragon Hatchling" (BDH) model from arxiv.org/abs/2509.26507. That paper describes Hebbian learning and spiking neurons. Our implementation uses standard transformer architecture with sparse expert routing for computational efficiency.

Contributing

This is an educational project. Contributions that enhance learning clarity are welcome:

Improved explanations and comments
Additional toy problems for validation
Comparisons between manual and framework implementations
Documentation of theoretical concepts

Project Status

✅ Phase 1 Foundation: Backpropagation, Attention
✅ Sparse MoE Transformer: Fully functional, tested
🚧 Phase 2-5: Planned expansion areas

Resources

Attention Is All You Need - Original Transformer paper
Switch Transformers - Sparse MoE at scale
Project Documentation: See individual directory READMEs

License

MIT License - See LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.devcontainer		.devcontainer
.github		.github
models		models
papers_and_notes		papers_and_notes
phase1_foundation		phase1_foundation
phase2_architecture		phase2_architecture
phase3_components		phase3_components
phase4_training		phase4_training
phase5_production		phase5_production
shared		shared
.gitconfig		.gitconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-foundation

Repository Structure

`phase1_foundation/` - Learning Path

`sparse_moe_transformer/` - Production Model

`phase2_architecture/` to `phase5_production/`

Quick Start

Prerequisites

Run Educational Examples

Train Sparse MoE Transformer

Learning Philosophy

Key Concepts Covered

Important Note

Contributing

Project Status

Resources

License

About

Uh oh!

Releases

Packages

Languages

License

jamclaug/llm-foundation

Folders and files

Latest commit

History

Repository files navigation

llm-foundation

Repository Structure

phase1_foundation/ - Learning Path

sparse_moe_transformer/ - Production Model

phase2_architecture/ to phase5_production/

Quick Start

Prerequisites

Run Educational Examples

Train Sparse MoE Transformer

Learning Philosophy

Key Concepts Covered

Important Note

Contributing

Project Status

Resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`phase1_foundation/` - Learning Path

`sparse_moe_transformer/` - Production Model

`phase2_architecture/` to `phase5_production/`

Packages