Forgather is a training framework for language-model experiments on hardware you actually own — a single 24 GB GPU, two consumer cards sharing a desktop's PCIe bus, or a few boxes linked by 1 Gbit Ethernet, no InfiniBand. Full-parameter fine-tune a 7B model at 53 K context on one RTX 3090 / 4090 / 5090, pretrain Llama / Mistral / Qwen3 / Gemma-3 across two machines that DDP and FSDP would choke on, or run optimizer and scaling-law ablations overnight. Under the hood it's configuration-driven (template inheritance, no fork-the-training-script sprawl); the headline is what fits on your GPUs.
📚 Documentation: forgather.readthedocs.io or docs/README.md. New users should head straight to Getting Started.
🖥️ Web UI: an IDE for model-training workflows. Forgather ships with a single-user web frontend over the same APIs the CLI uses. Browse projects, edit templates with Forgather-aware syntax highlighting, queue runs into a GPU-aware scheduler, watch jobs through a live TTY with per-card training-stat cards, then chat with the trained model in-browser without leaving the page. Less tmux-and-glue, more time on the experiment. The Forgather server walkthrough tours the whole thing end-to-end, from a fresh install through training a small model and chatting with it.
Most training scripts fork. You copy train.py to try a thing; six
months later you have ten near-identical scripts, and the small bugs
— a loss function wired wrong, a scheduler silently reset on resume,
a CLI flag that didn't actually reach the tokenizer — hide across the
forks. Every variation gets expensive to try.
A Forgather project config extends a parent; both are plain YAML with Jinja2 preprocessing. Every override is named and explicit — a silently-reset scheduler shows up as a one-line diff on a documented knob, not a buried fork waiting to bite three months later.
- Full fine-tunes on a single 24 GB GPU. Full-parameter (not LoRA) fine-tuning of 7B models at ~53 K context on one RTX 3090 / 4090 / 5090, via gradient checkpointing, activation offload, and fused kernels (full list under Key Features).
- Train across the boxes you have. Pipeline-parallel and DiLoCo trainers need dramatically less cross-device communication than DDP or FSDP — Forgather has trained a 7B model across two desktops linked only by 1 Gbit Ethernet, and the same design avoids the PCIe stalls FSDP hits on consumer hardware.
- Multi-node training, without the multi-node tax. Spinning up a
pipeline-parallel finetune across machines normally means
hand-rolled rendezvous, dataset distribution, port forwarding, mTLS,
and coordinated job control — a different chore per job. With
Forgather, you install on each peer, start
forgather server --cluster <name>, and mDNS handles discovery;forgather cluster submitfans a training bundle across the hosts/GPUs you pick. A workstation with two 3090s, a couple of borrowed gaming PCs on Ethernet, and a DGX Spark show up in the same Nodes panel — heterogeneous boxes are fine. - No config duplication. Inherit from a base template and
override only what changes — types are hyperparameters too, swap
optimizers, models, or trainers in YAML via
!partial/!factory/!singletonwith no Python edits. - Standalone, framework-portable models. Each run writes the
equivalent PyTorch source into
output_models/, loadable by plainAutoModelForCausalLM. Or runforgather convert --reverseto emit a canonical HF Llama / Mistral / Qwen3 / Gemma-3 checkpoint, from which llama.cpp'sconvert_hf_to_gguf.pyproduces a GGUF for llama.cpp / ollama / LM Studio. - Live job control + GPU-aware web UI. Save, stop, or abort
running training jobs from another shell, coordinated across DDP /
FSDP-2 / pipeline workers; the web frontend drops
▶ Runjobs into a priority + GPU-policy queue with live TTY and an in-browser chat client. - HF-compatible distributed checkpoints. Standard Safetensors
shards readable by
transformers, vLLM, and the llama.cpp converter; explicit state-sharing patterns above the on-disk format so PP / FSDP-2 runs checkpoint correctly without per-trainer custom code.
If LoRA / QLoRA is what you need, axolotl and unsloth are great starting points — Forgather doesn't ship a LoRA path today. Forgather's bet is that full-parameter training of 7B-class models is now feasible on a single consumer GPU, and that for many workloads it's the right tool. The hard problems Forgather targets are training-side — multi-GPU coordination, pipeline schedules, multi-node over Ethernet, optimizer and precision research, custom architectures, reproducible experiments — rather than inference-side fine-tuning UX. If those are your problems, Forgather is closer to what you want than the LoRA-first tools.
- Tested on NVIDIA consumer cards (RTX 3090 / 4090 / 5090) up to 4× and 6× 4090 boxes, and DGX Spark (GB10, aarch64).
- Minimum useful config for LM work: one 24 GB GPU. The 7B-at-53K finetune fits a single 3090.
- Multi-node: any LAN ≥ 1 Gbit works for pipeline-parallel or DiLoCo. NVLink / InfiniBand are not required.
- CUDA-only today. AMD / ROCm and Apple Silicon may work (Forgather avoids hard CUDA dependencies where possible) but are not tested, so treat them as experimental. ROCm contributions welcome.
Full install walkthrough and first-training-run tutorial: docs/getting-started/README.md.
# If running remotely over ssh,
# setup port forwarding
ssh -L 8765:localhost:8765 \
-L 8137:localhost:8137 \
-L 6006:localhost:6006 \
-L 8000:localhost:8000 \
user@dev-host
# Install with Docker
git clone https://github.com/jdinalt/forgather.git
cd forgather
docker/build # generic image; works for any host user
docker/run # interactive shell, --gpus all, ports forwarded
# Inside the container:
# Start the webui...
forgather server
# control-click on `http://localhost:8765/?token=4c4febdc07830cdd...` to connect with your browser
# ...or use the CLI
forgather --help
cd examples/tutorials/tiny_llama
forgather -t v2.yaml trainRequires Docker Engine 24+ and (for GPU training) the
NVIDIA Container Toolkit;
host-venv install is also supported. See
docs/getting-started/ for install
details, the Tiny Llama tutorial
for the full train → monitor → control → eval → inference → export
walkthrough, or the
Forgather server walkthrough
for the same flow through the web UI.
⚠️ Heads up. vLLM integration is currently broken — Forgather has moved to Transformers v5, which vLLM does not yet support. Upstream is working on v5 compatibility; the integration will be re-enabled once that lands.
Latest release: 1.2.0 (May 2026).
Headline is multi-node training: forgather server --cluster <name>
puts a node into cluster mode, peers discover each other over mDNS,
and a new forgather cluster CLI plus a Cluster panel in the web UI
fan training bundles across selected hosts/GPUs. Native TLS / mTLS,
a cluster-shared dataset server with O(1) resume, in-place server
restart, a distributable runtime Docker image, and DGX Spark (GB10,
aarch64) as a first-class cluster member. Multi-node guide:
docs/guides/multi-node-training.md.
For the full timeline (pre-1.2.0 highlights: web UI, sharded-checkpoint
abstraction, Triton Adafactor, fused linear-CE, model conversion,
packed sequences + Flex Attention, …) see
docs/release-notes/.
Create new experiments by inheriting from existing configs and specifying only the differences:
-- extends 'base_experiment.yaml'
[config_metadata]
== super()
-- set ns.seq_len = 16384 # longer context
[optimizer]
== super()
lr: 1.0e-3 # override the LR, keep everything elseUse any Python class or function directly in configs. Custom YAML tags
(!partial, !factory, !singleton, !var, !call) describe how to
build live Python objects from the parsed graph:
optimizer: !partial:torch.optim.AdamW
lr: 1.0e-3
weight_decay: 0.01
[layer_factory]
# Experiment: swap PreLayerNorm for PostLayerNorm
layer_factory: &layer_factory !partial:.post_ln_layer:PostLNLayer@layer_factory
feedforward_factory: *feedforward_factory
attention_factory: *attention_factory
norm_factory: *layer_norm_factory
dropout: !var "layer_dropout"
residual_dropout: !var "residual_dropout"See Syntax Reference for the full list of line statements and YAML tags.
When you construct a model for the first time, Forgather writes the
equivalent standalone PyTorch source into the run's output directory.
The generated code has no Forgather dependency: any HF-compatible
consumer (transformers, vLLM, llama.cpp via convert_hf_to_gguf.py,
etc.) can load the model with trust_remote_code=True. If you want
plain-HF weights without trust_remote_code, run forgather convert --reverse --model-type {llama,mistral,qwen3,gemma3_text} <src> <dst>.
forgather code --target X prints the Python equivalent of any node
in the config graph — useful when you want to see what a !partial /
!factory chain actually constructs.
All first-class trainers — basic (single-GPU), ddp (with optional
PostLocalSGD), fsdp2 (FSDP-2 with configurable dtypes and CPU
offload), pipeline (GPipe / 1F1B / Interleaved-1F1B / zero-bubble),
and DiLoCo (docs) for very-low-
bandwidth multi-machine training — share the same config surface, so
swapping between them is a YAML override, not a rewrite. An
AccelTrainer wrapper and a Transformers-Trainer compatibility shim
also exist for legacy code.
On the optimizer side, the distinctive one is a fused Triton
Adafactor with per-parameter bf16 stochastic rounding — to our
knowledge the only Adafactor+SR implementation available, and faster
than every other Adafactor we've benchmarked. SR matters for
pure-bf16 training (no fp32 master weights): without it, updates
below the bf16 precision step round to zero and weight norms drift.
A stochastic-rounding AdamW ships alongside; if you also want
quantized state, torchao.optim.AdamW4bit works (see the
adam4bit config).
Apollo / Apollo-mini (low-rank gradient projection, experimental),
SinkGD, SGD, and Muon are also configurable, plus a regex-based
multiopt helper for per-parameter-group assignment. Mixed precision
covers bf16, fp16, and FP8-via-torchao (tensorwise / rowwise /
rowwise_with_gw_hp); schedulers cover Warmup-Stable-Decay, Cosine,
and Infinite-LR with token- or step-budgeted warmup/decay.
Weight shards are written as a standard Hugging Face Safetensors
layout, so transformers, vLLM, llama.cpp conversion, and remote
eval harnesses all read the trained model directly. Sitting above
the on-disk format, Forgather's coordination layer uses explicit
state-sharing patterns (GLOBAL / PER_RANK / REPLICATED /
PER_GROUP / PER_NODE) — every checkpoint component declares its
sharing pattern and the trainer derives barriers and load paths from
that, so pipeline-parallel and FSDP-2 runs checkpoint correctly
without per-trainer custom code. Resume restores optimizer,
scheduler, dataset iterator, RNG, and step counter; optional
replication validation (NONE / QUICK / TENSOR / FULL) hashes
parameters across replicas to catch DDP-sync bugs.
See docs/checkpointing/ for the full
abstraction.
Every Forgather experiment is a Project with this structure:
my_project/
├── meta.yaml # Project metadata
├── templates/
│ ├── project.yaml # Project-wide defaults
│ └── configs/ # Experiment configurations
│ ├── baseline.yaml
│ └── experiment_a.yaml
├── output_models/ # Generated code + runs (per config)
└── project_index.ipynb # Optional interactive notebook
A workspace groups related projects and centralises template
search paths. Use forgather ws create to scaffold one and
forgather project create to add projects to it.
Forgather uses Jinja2 + YAML with custom syntax:
-- extends 'template.yaml'— template inheritance (single parent)[block_name]— named override-able sections== super()— include parent's version of the current block-- set ns.var = value— set a variable in the namespace-- include 'template.yaml'— include template content inline#---- inline.template.name ----— split a document into multiple templates!partial:module:Class/!factory:.../!singleton:...— construct Python objects!var "name"— variable references
Every config goes through the same pipeline, and each intermediate step is inspectable:
Templates → YAML → Node Graph → Python Objects
│
└──> (optional) Python source code
- model source export (for HF
trust_remote_code loading)
- debugging / pedagogy
Forgather materialises the node graph directly into Python objects at runtime; the Python-source path is a separate export, not an intermediary step. Model construction uses the export path so the resulting model is framework-portable; everything else (trainer, optimiser, dataset, callbacks) is built by walking the graph.
Inspection commands:
forgather -t config.yaml pp # Preprocess Jinja2 → YAML
forgather -t config.yaml graph --format yaml # Parsed node graph
forgather -t config.yaml targets # Constructable objects in the graph
forgather -t config.yaml code --target model # Python-source export of a target (debug / model export)
forgather -t config.yaml construct --target model --call
# Materialise and show the constructed objectWhen you hit a config bug, start with forgather ls -d (dumps the
preprocessed file with YAML errors, or the Jinja2 error if
preprocessing itself failed), then escalate to pp --debug (dumps
every template in the chain).
- examples/tutorials/tiny_llama — trains a 5M-param Llama in ~10 minutes; covers config anatomy, dynamic CLI args, monitoring, control, eval, inference, and exporting to plain HF format. Start here.
- examples/tutorials/projects_overview — how Forgather's multi-project layout is organised.
- examples/tutorials/project_composition — cross-project composition (datasets / models / evals as independent projects that reference each other).
- examples/tutorials/hp_lovecraft_project — fine-tune Mistral-7B / Llama-2-7B on the complete works of H.P. Lovecraft on a single 24 GB GPU. Long-context (up to 53K tokens), YaRN, gradient checkpointing, activation offloading.
forgather -iDrops you into a shell where every subcommand works without the
forgather prefix (so pp, ls, train instead of forgather pp,
etc.). Convenient for quick iteration inside a single project.
Forgather ships with a library of worked examples that go well beyond
the tutorials — each has a detailed README with reproducible commands
and, where relevant, a headline result. The table below picks the
best starting point for each journey. For per-project write-ups (the
"why this is interesting" version of each row), see the
Featured Examples highlights;
for the full directory map, see
examples/README.md.
| Journey | Project | Headline |
|---|---|---|
| Pretrain from scratch | examples/pretrain/small-llm |
162M Llama on SmolLM, Chinchilla-scaling plots |
| Fine-tune a 7B model (multi-GPU) | examples/finetune/samantha |
Every trainer backend, ~8.9K tok/s on 4× RTX 4090 |
| Instruction / reasoning fine-tune | examples/finetune/open-orca |
1B Llama, 1B-token budget, ~11h on 4× 4090 |
| Long-context fine-tuning + RoPE recipes | examples/tutorials/hp_lovecraft_project |
7B at 53 K context on one 24 GB GPU |
| Cut peak memory | examples/tiny_experiments/peak_memory |
81% peak-memory reduction at ~2.7× throughput |
| Pick an optimizer | examples/tiny_experiments/optimizers |
Ten-optimizer bake-off; Muon wins at small batch |
| Pipeline-parallel recipes | examples/tiny_experiments/pipeline_parallel |
GPipe / 1F1B / ZBV / interleaved test harness |
| Decentralised / bandwidth-limited training | examples/tiny_experiments/diloco |
DiLoCo with pseudo-gradient compression |
- Scaffold a new project with
forgather project create(inside an existing workspace) orforgather ws create(a brand-new workspace). These commands generate a minimum-workingmeta.yamltemplates/tree that extends the recommended base templates. Full walk-through: the Tiny Llama tutorial.
examples/base_lm_project— a bare harness that drives the rawprojects/lm_training_project.yamltemplate with no project-specific overrides. Useful for inspecting what the base template does on its own, and for debugging changes to the base template itself, but not a typical starting point for new work.
