🧠 HSLM Training Farm — Live Status
📊 Farm Overview (2026-03-13)
| Account |
Slots |
Training |
Building |
Queued |
Crashed |
Fixed |
| Primary |
2/25 |
2 ✅ |
0 |
0 |
0 |
- |
| Farm-2 |
8/25 |
0→8 🔧 |
8 (rebuilding) |
0 |
0 |
8 just fixed |
| Farm-3 |
16/25 |
1 (R20) |
6 |
7 |
0 |
2 just fixed |
| Total |
26/75 |
3 |
14 |
7 |
0 |
10 fixed |
🔧 Actions Taken This Cycle
- ✅ Fixed
startCommand=null on ALL services (was referencing deleted entrypoint-train.sh)
- ✅ Set HSLM_* env vars on all 8 farm-2 services
- ✅ Redeployed all 10 crashed services (8 farm-2 + 2 farm-3)
- ✅ Fixed primary services startCommand for future redeploys
📈 Active Training Metrics
| Service |
Account |
Step |
AvgLoss |
PPL |
LR |
Tok/s |
| hslm-v11 |
primary |
21K |
6.12 |
~455 |
1.41e-4 |
12,113 |
| hslm-train |
primary |
4.7K |
6.18 |
~483 |
2.82e-4 |
12,830 |
| hslm-r20 |
farm-3 |
? |
? |
? |
? |
? |
🏆 Best Known Result
v4R: PPL=125 (Loss=4.83) — Adam, LR=3e-4, cosine, 100K steps
🎯 Goal: 75/75 slots utilized (3 accounts × 25)
Wave 4+5 Experiment Matrix
| Run |
Optimizer |
LR |
Schedule |
Special |
Account |
Status |
| R5 |
adam |
3e-4 |
cosine |
baseline |
farm-2 |
🔄 rebuilding |
| R6 |
adamw |
1e-3 |
cosine |
— |
farm-2 |
🔄 rebuilding |
| R10 |
lamb |
3e-4 |
cosine |
ga=2 |
farm-2 |
🔄 rebuilding |
| R11 |
adam |
3e-4 |
cosine |
restarts |
farm-2 |
🔄 rebuilding |
| R12 |
adam |
3e-4 |
cosine |
ga=4 |
farm-2 |
🔄 rebuilding |
| R13 |
lamb |
1e-3 |
cosine |
ga=4 |
farm-2 |
🔄 rebuilding |
| R14 |
adam |
5e-4 |
cosine |
higher LR |
farm-3 |
🔄 redeployed |
| R15 |
adamw |
5e-4 |
cosine |
WD=0.05 |
farm-3 |
⏳ queued |
| R16 |
adam |
3e-4 |
sacred |
φ-scale |
farm-3 |
⏳ queued |
| R17 |
adam |
3e-4 |
cosine |
adaptive-sparsity |
farm-3 |
⏳ queued |
| R18 |
adam |
3e-4 |
cosine |
ternary-schedule |
farm-2 |
🔄 rebuilding |
| R19 |
lamb |
3e-3 |
cosine |
ga=8 |
farm-2 |
🔄 rebuilding |
| R20 |
adam |
3e-4 |
cosine |
full-ternary |
farm-3 |
✅ SUCCESS |
| R21 |
lamb |
3e-4 |
cosine |
batch=128 |
farm-3 |
🔄 redeployed |
| R22 |
lamb |
5e-4 |
cosine |
batch=128 |
farm-3 |
🔨 building |
| R23 |
adam |
3e-4 |
cosine |
batch=128 |
farm-3 |
🔨 building |
| R24 |
lamb |
1e-3 |
cosine |
batch=66 |
farm-3 |
⏳ queued |
| R25 |
adam |
1e-3 |
cosine |
batch=128 |
farm-3 |
🔨 building |
| R26 |
adam |
3e-4 |
cosine |
ga=2 |
farm-3 |
⏳ queued |
| R27 |
lamb |
3e-4 |
cosine |
ga=4 |
farm-3 |
🔨 building |
| R28 |
adam |
5e-4 |
cosine |
batch=128 |
farm-3 |
🔨 building |
| R29 |
adamw |
3e-4 |
cosine |
WD=0.01 |
farm-3 |
🔨 building |
| R30 |
lamb |
3e-4 |
cosine |
warmup=5K |
farm-3 |
⏳ queued |
| T1 |
adam |
3e-4 |
cosine |
ctx=27, 30K |
farm-3 |
⏳ queued |
💰 Cost
~$0.67/run × 26 runs = ~$17.42 (spread across 3 accounts, covered by PRO credits)
Auto-updated by /iterate cycle. Next update in ~15 min.
🧠 HSLM Training Farm — Live Status
📊 Farm Overview (2026-03-13)
🔧 Actions Taken This Cycle
startCommand=nullon ALL services (was referencing deletedentrypoint-train.sh)📈 Active Training Metrics
🏆 Best Known Result
v4R: PPL=125 (Loss=4.83) — Adam, LR=3e-4, cosine, 100K steps
🎯 Goal: 75/75 slots utilized (3 accounts × 25)
Wave 4+5 Experiment Matrix
💰 Cost
~$0.67/run × 26 runs = ~$17.42 (spread across 3 accounts, covered by PRO credits)
Auto-updated by /iterate cycle. Next update in ~15 min.