No error handling around rollout collection. If the WAA server fails mid-rollout (timeout, crash, undismissable dialog), the entire training run crashes.
Need:
- Proactive
health_check() before rollouts
- try/except with retry in
collect_rollout
- VM pool health monitoring endpoint (
GET /health -> {"status": "ready"|"busy"|"needs_recovery"})