Forgather ML

Forgather is a training framework for language-model experiments on hardware you actually own — a single 24 GB GPU, two consumer cards sharing a desktop's PCIe bus, or a few boxes linked by 1 Gbit Ethernet, no InfiniBand. Full-parameter fine-tune a 7B model at 53 K context on one RTX 3090 / 4090 / 5090, pretrain Llama / Mistral / Qwen3 / Gemma-3 across two machines that DDP and FSDP would choke on, or run optimizer and scaling-law ablations overnight. Under the hood it's configuration-driven (template inheritance, no fork-the-training-script sprawl); the headline is what fits on your GPUs.

📚 Documentation: forgather.readthedocs.io or docs/README.md. New users should head straight to Getting Started.

🖥️ Web UI: an IDE for model-training workflows. Forgather ships with a single-user web frontend over the same APIs the CLI uses. Browse projects, edit templates with Forgather-aware syntax highlighting, queue runs into a GPU-aware scheduler, watch jobs through a live TTY with per-card training-stat cards, then chat with the trained model in-browser without leaving the page. Less tmux-and-glue, more time on the experiment. The Forgather server walkthrough tours the whole thing end-to-end, from a fresh install through training a small model and chatting with it.

Why Forgather?

Most training scripts fork. You copy train.py to try a thing; six months later you have ten near-identical scripts, and the small bugs — a loss function wired wrong, a scheduler silently reset on resume, a CLI flag that didn't actually reach the tokenizer — hide across the forks. Every variation gets expensive to try.

A Forgather project config extends a parent; both are plain YAML with Jinja2 preprocessing. Every override is named and explicit — a silently-reset scheduler shows up as a one-line diff on a documented knob, not a buried fork waiting to bite three months later.

Key Benefits

Full fine-tunes on a single 24 GB GPU. Full-parameter (not LoRA) fine-tuning of 7B models at ~53 K context on one RTX 3090 / 4090 / 5090, via gradient checkpointing, activation offload, and fused kernels (full list under Key Features).
Train across the boxes you have. Pipeline-parallel and DiLoCo trainers need dramatically less cross-device communication than DDP or FSDP — Forgather has trained a 7B model across two desktops linked only by 1 Gbit Ethernet, and the same design avoids the PCIe stalls FSDP hits on consumer hardware.
Multi-node training, without the multi-node tax. Spinning up a pipeline-parallel finetune across machines normally means hand-rolled rendezvous, dataset distribution, port forwarding, mTLS, and coordinated job control — a different chore per job. With Forgather, you install on each peer, start forgather server --cluster <name>, and mDNS handles discovery; forgather cluster submit fans a training bundle across the hosts/GPUs you pick. A workstation with two 3090s, a couple of borrowed gaming PCs on Ethernet, and a DGX Spark show up in the same Nodes panel — heterogeneous boxes are fine.
No config duplication. Inherit from a base template and override only what changes — types are hyperparameters too, swap optimizers, models, or trainers in YAML via !partial / !factory / !singleton with no Python edits.
Standalone, framework-portable models. Each run writes the equivalent PyTorch source into output_models/, loadable by plain AutoModelForCausalLM. Or run forgather convert --reverse to emit a canonical HF Llama / Mistral / Qwen3 / Gemma-3 checkpoint, from which llama.cpp's convert_hf_to_gguf.py produces a GGUF for llama.cpp / ollama / LM Studio.
Live job control + GPU-aware web UI. Save, stop, or abort running training jobs from another shell, coordinated across DDP / FSDP-2 / pipeline workers; the web frontend drops ▶ Run jobs into a priority + GPU-policy queue with live TTY and an in-browser chat client.
HF-compatible distributed checkpoints. Standard Safetensors shards readable by transformers, vLLM, and the llama.cpp converter; explicit state-sharing patterns above the on-disk format so PP / FSDP-2 runs checkpoint correctly without per-trainer custom code.

Where does Forgather fit?

If LoRA / QLoRA is what you need, axolotl and unsloth are great starting points — Forgather doesn't ship a LoRA path today. Forgather's bet is that full-parameter training of 7B-class models is now feasible on a single consumer GPU, and that for many workloads it's the right tool. The hard problems Forgather targets are training-side — multi-GPU coordination, pipeline schedules, multi-node over Ethernet, optimizer and precision research, custom architectures, reproducible experiments — rather than inference-side fine-tuning UX. If those are your problems, Forgather is closer to what you want than the LoRA-first tools.

Hardware

Tested on NVIDIA consumer cards (RTX 3090 / 4090 / 5090) up to 4× and 6× 4090 boxes, and DGX Spark (GB10, aarch64).
Minimum useful config for LM work: one 24 GB GPU. The 7B-at-53K finetune fits a single 3090.
Multi-node: any LAN ≥ 1 Gbit works for pipeline-parallel or DiLoCo. NVLink / InfiniBand are not required.
CUDA-only today. AMD / ROCm and Apple Silicon may work (Forgather avoids hard CUDA dependencies where possible) but are not tested, so treat them as experimental. ROCm contributions welcome.

Quick Start

Full install walkthrough and first-training-run tutorial: docs/getting-started/README.md.

# If running remotely over ssh,
# setup port forwarding
ssh -L 8765:localhost:8765 \
    -L 8137:localhost:8137 \
    -L 6006:localhost:6006 \
    -L 8000:localhost:8000 \
    user@dev-host

# Install with Docker
git clone https://github.com/jdinalt/forgather.git
cd forgather
docker/build                  # generic image; works for any host user
docker/run                    # interactive shell, --gpus all, ports forwarded

# Inside the container:

# Start the webui...
forgather server

# control-click on `http://localhost:8765/?token=4c4febdc07830cdd...` to connect with your browser

# ...or use the CLI
forgather --help
cd examples/tutorials/tiny_llama
forgather -t v2.yaml train

Requires Docker Engine 24+ and (for GPU training) the NVIDIA Container Toolkit; host-venv install is also supported. See docs/getting-started/ for install details, the Tiny Llama tutorial for the full train → monitor → control → eval → inference → export walkthrough, or the Forgather server walkthrough for the same flow through the web UI.

What's new

⚠️ Heads up. vLLM integration is currently broken — Forgather has moved to Transformers v5, which vLLM does not yet support. Upstream is working on v5 compatibility; the integration will be re-enabled once that lands.

Latest release: 1.2.0 (May 2026). Headline is multi-node training: forgather server --cluster <name> puts a node into cluster mode, peers discover each other over mDNS, and a new forgather cluster CLI plus a Cluster panel in the web UI fan training bundles across selected hosts/GPUs. Native TLS / mTLS, a cluster-shared dataset server with O(1) resume, in-place server restart, a distributable runtime Docker image, and DGX Spark (GB10, aarch64) as a first-class cluster member. Multi-node guide: docs/guides/multi-node-training.md.

For the full timeline (pre-1.2.0 highlights: web UI, sharded-checkpoint abstraction, Triton Adafactor, fused linear-CE, model conversion, packed sequences + Flex Attention, …) see docs/release-notes/.

Key Features

Template inheritance

Create new experiments by inheriting from existing configs and specifying only the differences:

-- extends 'base_experiment.yaml'

[config_metadata]
    == super()
    -- set ns.seq_len = 16384        # longer context

[optimizer]
    == super()
    lr: 1.0e-3                       # override the LR, keep everything else

Dynamic type system

Use any Python class or function directly in configs. Custom YAML tags (!partial, !factory, !singleton, !var, !call) describe how to build live Python objects from the parsed graph:

optimizer: !partial:torch.optim.AdamW
    lr: 1.0e-3
    weight_decay: 0.01

[layer_factory]
# Experiment: swap PreLayerNorm for PostLayerNorm
layer_factory: &layer_factory !partial:.post_ln_layer:PostLNLayer@layer_factory
    feedforward_factory: *feedforward_factory
    attention_factory: *attention_factory
    norm_factory: *layer_norm_factory
    dropout: !var "layer_dropout"
    residual_dropout: !var "residual_dropout"

See Syntax Reference for the full list of line statements and YAML tags.

Model source export

When you construct a model for the first time, Forgather writes the equivalent standalone PyTorch source into the run's output directory. The generated code has no Forgather dependency: any HF-compatible consumer (transformers, vLLM, llama.cpp via convert_hf_to_gguf.py, etc.) can load the model with trust_remote_code=True. If you want plain-HF weights without trust_remote_code, run forgather convert --reverse --model-type {llama,mistral,qwen3,gemma3_text} <src> <dst>.

forgather code --target X prints the Python equivalent of any node in the config graph — useful when you want to see what a !partial / !factory chain actually constructs.

Trainers, optimizers, precision

All first-class trainers — basic (single-GPU), ddp (with optional PostLocalSGD), fsdp2 (FSDP-2 with configurable dtypes and CPU offload), pipeline (GPipe / 1F1B / Interleaved-1F1B / zero-bubble), and DiLoCo (docs) for very-low- bandwidth multi-machine training — share the same config surface, so swapping between them is a YAML override, not a rewrite. An AccelTrainer wrapper and a Transformers-Trainer compatibility shim also exist for legacy code.

On the optimizer side, the distinctive one is a fused Triton Adafactor with per-parameter bf16 stochastic rounding — to our knowledge the only Adafactor+SR implementation available, and faster than every other Adafactor we've benchmarked. SR matters for pure-bf16 training (no fp32 master weights): without it, updates below the bf16 precision step round to zero and weight norms drift. A stochastic-rounding AdamW ships alongside; if you also want quantized state, torchao.optim.AdamW4bit works (see the adam4bit config). Apollo / Apollo-mini (low-rank gradient projection, experimental), SinkGD, SGD, and Muon are also configurable, plus a regex-based multiopt helper for per-parameter-group assignment. Mixed precision covers bf16, fp16, and FP8-via-torchao (tensorwise / rowwise / rowwise_with_gw_hp); schedulers cover Warmup-Stable-Decay, Cosine, and Infinite-LR with token- or step-budgeted warmup/decay.

Distributed checkpointing

Weight shards are written as a standard Hugging Face Safetensors layout, so transformers, vLLM, llama.cpp conversion, and remote eval harnesses all read the trained model directly. Sitting above the on-disk format, Forgather's coordination layer uses explicit state-sharing patterns (GLOBAL / PER_RANK / REPLICATED / PER_GROUP / PER_NODE) — every checkpoint component declares its sharing pattern and the trainer derives barriers and load paths from that, so pipeline-parallel and FSDP-2 runs checkpoint correctly without per-trainer custom code. Resume restores optimizer, scheduler, dataset iterator, RNG, and step counter; optional replication validation (NONE / QUICK / TENSOR / FULL) hashes parameters across replicas to catch DDP-sync bugs.

See docs/checkpointing/ for the full abstraction.

Core Concepts

Projects

Every Forgather experiment is a Project with this structure:

my_project/
├── meta.yaml              # Project metadata
├── templates/
│   ├── project.yaml       # Project-wide defaults
│   └── configs/           # Experiment configurations
│       ├── baseline.yaml
│       └── experiment_a.yaml
├── output_models/         # Generated code + runs (per config)
└── project_index.ipynb    # Optional interactive notebook

A workspace groups related projects and centralises template search paths. Use forgather ws create to scaffold one and forgather project create to add projects to it.

Template language

Forgather uses Jinja2 + YAML with custom syntax:

-- extends 'template.yaml' — template inheritance (single parent)
[block_name] — named override-able sections
== super() — include parent's version of the current block
-- set ns.var = value — set a variable in the namespace
-- include 'template.yaml' — include template content inline
#---- inline.template.name ---- — split a document into multiple templates
!partial:module:Class / !factory:... / !singleton:... — construct Python objects
!var "name" — variable references

Config pipeline

Every config goes through the same pipeline, and each intermediate step is inspectable:

Templates → YAML → Node Graph → Python Objects
                       │
                       └──> (optional) Python source code
                            - model source export (for HF
                              trust_remote_code loading)
                            - debugging / pedagogy

Forgather materialises the node graph directly into Python objects at runtime; the Python-source path is a separate export, not an intermediary step. Model construction uses the export path so the resulting model is framework-portable; everything else (trainer, optimiser, dataset, callbacks) is built by walking the graph.

Inspection commands:

forgather -t config.yaml pp                      # Preprocess Jinja2 → YAML
forgather -t config.yaml graph --format yaml     # Parsed node graph
forgather -t config.yaml targets                 # Constructable objects in the graph
forgather -t config.yaml code --target model     # Python-source export of a target (debug / model export)
forgather -t config.yaml construct --target model --call
                                                 # Materialise and show the constructed object

When you hit a config bug, start with forgather ls -d (dumps the preprocessed file with YAML errors, or the Jinja2 error if preprocessing itself failed), then escalate to pp --debug (dumps every template in the chain).

Learning Forgather

Recommended path

examples/tutorials/tiny_llama — trains a 5M-param Llama in ~10 minutes; covers config anatomy, dynamic CLI args, monitoring, control, eval, inference, and exporting to plain HF format. Start here.
examples/tutorials/projects_overview — how Forgather's multi-project layout is organised.
examples/tutorials/project_composition — cross-project composition (datasets / models / evals as independent projects that reference each other).
examples/tutorials/hp_lovecraft_project — fine-tune Mistral-7B / Llama-2-7B on the complete works of H.P. Lovecraft on a single 24 GB GPU. Long-context (up to 53K tokens), YaRN, gradient checkpointing, activation offloading.

Interactive shell

forgather -i

Drops you into a shell where every subcommand works without the forgather prefix (so pp, ls, train instead of forgather pp, etc.). Convenient for quick iteration inside a single project.

Featured Examples

Forgather ships with a library of worked examples that go well beyond the tutorials — each has a detailed README with reproducible commands and, where relevant, a headline result. The table below picks the best starting point for each journey. For per-project write-ups (the "why this is interesting" version of each row), see the Featured Examples highlights; for the full directory map, see examples/README.md.

Journey	Project	Headline
Pretrain from scratch	`examples/pretrain/small-llm`	162M Llama on SmolLM, Chinchilla-scaling plots
Fine-tune a 7B model (multi-GPU)	`examples/finetune/samantha`	Every trainer backend, ~8.9K tok/s on 4× RTX 4090
Instruction / reasoning fine-tune	`examples/finetune/open-orca`	1B Llama, 1B-token budget, ~11h on 4× 4090
Long-context fine-tuning + RoPE recipes	`examples/tutorials/hp_lovecraft_project`	7B at 53 K context on one 24 GB GPU
Cut peak memory	`examples/tiny_experiments/peak_memory`	81% peak-memory reduction at ~2.7× throughput
Pick an optimizer	`examples/tiny_experiments/optimizers`	Ten-optimizer bake-off; Muon wins at small batch
Pipeline-parallel recipes	`examples/tiny_experiments/pipeline_parallel`	GPipe / 1F1B / ZBV / interleaved test harness
Decentralised / bandwidth-limited training	`examples/tiny_experiments/diloco`	DiLoCo with pseudo-gradient compression

Building your own

Scaffold a new project with forgather project create (inside an existing workspace) or forgather ws create (a brand-new workspace). These commands generate a minimum-working meta.yaml
- templates/ tree that extends the recommended base templates. Full walk-through: the Tiny Llama tutorial.
examples/base_lm_project — a bare harness that drives the raw projects/lm_training_project.yaml template with no project-specific overrides. Useful for inspecting what the base template does on its own, and for debugging changes to the base template itself, but not a typical starting point for new work.

Name		Name	Last commit message	Last commit date
Latest commit History 1,122 Commits
.githooks		.githooks
.github/workflows		.github/workflows
add_tokens_config		add_tokens_config
chat_templates		chat_templates
docker		docker
docs		docs
examples		examples
forgather_workspace		forgather_workspace
generation_config		generation_config
modelsrc		modelsrc
prompts		prompts
scripts		scripts
src/forgather		src/forgather
syntax_highlighting		syntax_highlighting
templatelib		templatelib
tests		tests
tools		tools
.dockerignore		.dockerignore
.flake8		.flake8
.formatting-ignore		.formatting-ignore
.gitignore		.gitignore
.pre-commit-config.yaml.example		.pre-commit-config.yaml.example
.readthedocs.yaml		.readthedocs.yaml
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Dockerfile.runtime		Dockerfile.runtime
LICENSE		LICENSE
README.md		README.md
build-webui.sh		build-webui.sh
docs_hooks.py		docs_hooks.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forgather ML

Why Forgather?

Key Benefits

Where does Forgather fit?

Hardware

Quick Start

What's new

Key Features

Template inheritance

Dynamic type system

Model source export

Trainers, optimizers, precision

Distributed checkpointing

Core Concepts

Projects

Template language

Config pipeline

Learning Forgather

Recommended path

Interactive shell

Featured Examples

Building your own

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Forgather ML

Why Forgather?

Key Benefits

Where does Forgather fit?

Hardware

Quick Start

What's new

Key Features

Template inheritance

Dynamic type system

Model source export

Trainers, optimizers, precision

Distributed checkpointing

Core Concepts

Projects

Template language

Config pipeline

Learning Forgather

Recommended path

Interactive shell

Featured Examples

Building your own

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages