Tags: Zeerg/parameter-golf
Tags
Add gen-15 results (val_bpb=1.6807, dim=640 wider model) 6 phys × 2 loops, dim=640, mlp_mult=2, 20.3M params. Wider model has lower train loss but higher val_bpb than gen-13 at 5000 steps — needs more steps to converge with tiny Mac batches. Would likely win on H100s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add gen-14 convergence results (val_bpb=1.4995, 20k steps) 5 phys × 2 loops, dim=576, mlp_mult=2 — same config as gen-12/13. 20k steps on Mac M3 Max, 3.6 hours. Loss still dropping at completion. On H100s with proper batch sizes this architecture should beat baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add gen-13 convergence results (val_bpb=1.6300) and port looping to CUDA gen-13: same looped config as gen-12, 5000 steps — val_bpb=1.6300 (was 2.0597 at 500 steps). Loss still dropping, no plateau from looping. Port layer looping to train_gpt.py (CUDA/H100 submission version): - num_physical_layers + loop_scales in GPT class - Modular block indexing in forward pass - loop_scales in optimizer scalar params and control tensor patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add gen-12 layer looping results (val_bpb=2.0597, new best) 5 physical layers × 2 loops = 10 effective, dim=576, mlp_mult=2 13.9M params, 6.5MB artifact, ~6GB memory — no OOM Layer looping validated: wider dims with shared layers beats more unique narrow layers. Updated experiment log and looping docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add gen-9 through gen-11 results (best val_bpb=2.1446) gen-9: 9 layers, dim=512, mlp_mult=1 — val_bpb=2.1770, 10.6M params gen-10: 11 layers, dim=576 rejected (~18.4M params) gen-11: 8 layers, dim=512, mlp_mult=2, kv_heads=4 — val_bpb=2.1446, 15.2M params (new best) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add gen-7 (rejected, oversized) and gen-8 results (val_bpb=2.1863)
gen-7: mlp_mult=3 + dim=512 rejected pre-training (~17.8M params)
gen-8: 8 layers, dim=576, mlp_mult=1, kv_heads=1 — 11.8M params,
artifact 9.5MB, val_bpb=2.1863 at 500 steps
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>