Skip to content

Comments

Fix elastic autoscaling: use base model for health checks#1745

Merged
JannikSt merged 1 commit intoenv-workerfrom
fix/elastic-autoscaling-env-worker
Feb 9, 2026
Merged

Fix elastic autoscaling: use base model for health checks#1745
JannikSt merged 1 commit intoenv-workerfrom
fix/elastic-autoscaling-env-worker

Conversation

@JannikSt
Copy link
Member

@JannikSt JannikSt commented Feb 9, 2026

Summary

  • Fix elastic autoscaling not working for newly scaled-up inference servers
  • Health check was using model_name which gets updated to LoRA name after training starts
  • New servers only have base model, so they'd fail health checks forever

The bug

After training starts, update_model_name() changes model_name to the LoRA adapter name (e.g., rft-xxx). New inference servers only have the base model loaded. The health check would reject them, so they'd never get added to the pool and never receive the LoRA adapter.

The fix

Store original model name as base_model_name at init time and use it for health checks.


Note

Low Risk
Low risk change limited to elastic pool health checks; main risk is unintentionally accepting servers that lack the current LoRA model, which is mitigated by the separate adapter sync logic.

Overview
Fixes elastic inference autoscaling by decoupling server health checks from the mutable model_name.

ElasticInferencePool now stores the initial model_name as base_model_name and uses it when validating /v1/models in _check_server_health, so newly scaled servers that only have the base model are considered healthy and can be added to the pool for subsequent LoRA adapter syncing.

Written by Cursor Bugbot for commit 7d566a9. This will update automatically on new commits. Configure here.

The health check was using `model_name` which gets updated to the LoRA
adapter name after training starts. New inference servers only have the
base model, so they would fail health checks and never get added to the
pool.

Store the original model name as `base_model_name` and use it for health
checks, allowing new servers to be discovered and have the LoRA adapter
loaded on them.
@JannikSt JannikSt merged commit 02a29d3 into env-worker Feb 9, 2026
8 checks passed
samsja pushed a commit that referenced this pull request Feb 9, 2026
* start env servers for env group

* bump

* working reverse-text

* correctly set log levels

* use client configs everywhere + make val work

* cycle through clients via inference pool

* update model name

* deprecate env worker

* implement evals

* deprecate evals+synthesize

* deprecate serialization

* use all clients for evals

* simplify config

* fix types and use inference pool for opd

* bring back logging intercept

* setup env client/server in prime-rl

* externalize running env server

* style

* bring back rate limiter on scheduler

* revert vf branch

* back to custom branch

* do not double asyncio

* bring back eval

* fix cpu tests

* bump vf

* add math group config

* remove stop server call

* add logs

* remove last mentions of vf.State

* deprecate some configs

* more

* remove eval + cpu integration tests

* remove evals + synthesize configs

* bump vf

* fix branch with vlm cache

* do not reference rollout status

* use correct model name

* stop teacher infer pool if setup

* update changelog

* deprecate server discovery (unused)

* fix env id stripping

* use updated model name for evals

* strip env version on env server

* remove server discovery tests

* bump vf

* do not fail if env server not yet up

* add elastic sanity check

* update docs

* use extra env kwargs consistently across orch and env server

* update math group config

* update cfg

* use dynamic model name in final evals

* add extra_env_kwargs to changelog

* resolve vf merge conflicts

* assert lora name not None

* fix unit tests

* fix types

* add hendrycks math sanity check

* bump vf

* do not double repeat eval inputs

* do not duplicate eval inputs

* lower avg@

* Initialize logger in env-server before install_env (#1743)

install_env() calls get_logger() which requires the logger to be set up first.
This was missing in env-server but present in orchestrator.

* disable vf logging on orch

* Fix elastic autoscaling: use base model for health checks (#1745)

The health check was using `model_name` which gets updated to the LoRA
adapter name after training starts. New inference servers only have the
base model, so they would fail health checks and never get added to the
pool.

Store the original model name as `base_model_name` and use it for health
checks, allowing new servers to be discovered and have the LoRA adapter
loaded on them.

---------

Co-authored-by: JannikSt <JannikSt@users.noreply.github.com>
Co-authored-by: William Brown <williambrown97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants