You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
apply_config overwrites a single KV blob CONFIG_CHECKPOINT_KEY = "serve-app-config-checkpoint" and calls ApplicationStateManager.apply_app_configs(...), which per-app does either an in-place override (same code_version) or a fresh build_app task (application_state.py:1299).
Each app independently transitions DEPLOYING -> RUNNING | DEPLOY_FAILED | UNHEALTHY via ApplicationState._determine_app_status (:760). DEPLOY_FAILED is set when any underlying deployment is DEPLOY_FAILED, which the deployment FSM in common.py (handle_transition) triggers on:
There is no previous-version retention today. The KV blob holds only the current target.
2. State machine introduced by this feature
stateDiagram-v2
[*] --> Idle
Idle --> Watching: apply_config new submitted
Watching --> Promoted: all apps RUNNING
Watching --> RollingBack: any app DEPLOY_FAILED
Watching --> Cancelled: user submits another config
RollingBack --> RolledBack: rollback reaches RUNNING
RollingBack --> RollbackFailed: rollback also DEPLOY_FAILED
Promoted --> Idle: promote pending to last_good
RolledBack --> Idle
RollbackFailed --> Idle: alert and stop, no auto retry
Cancelled --> Watching: new submission becomes pending
Loading
3. Important considerations
Trigger is ApplicationStatus.DEPLOY_FAILED only. No progress timeout, no UNHEALTHY-driven rollback. This avoids false positives on slow-starting workloads and post-rollout regressions.
Scope is per-submission. Even if only one app in a multi-app config fails, all apps revert together to the last-good ServeDeploySchema. This matches the atomic semantics of apply_config.
Declarative-only. Imperative serve.run / client.deploy_applications is out of scope for v1; the feature keys off the KV checkpoint that only declarative deploys produce.
"Last good" definition = the most recent ServeDeploySchema where every app reached RUNNING after apply_config. Promotion happens only on success, so we never roll back to a config that itself was failing.
First deploy has no previous good. If the very first apply_config fails, there is nothing to roll back to. Behavior: leave the partial deploy in DEPLOY_FAILED and emit a clear status message ("no previous successful config to roll back to"). Optionally tear down the failed apps if --rollback-on-failure is set; document the trade-off.
Concurrent submissions. If a user pushes a new config while we are watching or rolling back, the new submission wins (cancels the watch / rollback). Reconciliation already serializes through the controller actor, so no extra locking is needed beyond a pending_submission_id.
Avoid rollback ping-pong. If the last-good config also ends in DEPLOY_FAILED after rollback (e.g., environment regressed), do not auto-rollback again. Mark ROLLBACK_FAILED, log clearly, surface in serve status. Operator must intervene.
Persistence and recovery. The watcher state must survive controller crashes. We will extend CONFIG_CHECKPOINT_KEY with an additional dict so _recover_state_from_checkpoint can resume the watch (and re-check whether to roll back) after a restart.
Side effects of revert. Rolling back re-runs apply_config(last_good). Things that get reverted: code versions, replica counts, runtime envs, route prefixes, target_capacity, target_capacity_direction. Apps that the failed config newly added are deleted (since they are not in last-good). Apps the failed config deleted will be re-created from scratch — this resurrects deployments the user may have intentionally deleted, an important UX caveat to document.
HTTP/gRPC/proxy options in ServeDeploySchema are not currently part of apply_config's persisted-config flow (they go through serve_head.py start-time). Rollback semantics deliberately scoped to applications + target_capacity to avoid restarting proxies.
Imperative apps coexisting with declarative apps.apply_app_configs only deletes apps with api_type == DECLARATIVE. Auto-rollback re-applies a ServeDeploySchema, so it inherits the same scope: imperative apps (deployed via serve.run) are untouched.
Idempotency.DeploymentState.deploy() already short-circuits on no-op; re-applying an unchanged app costs nothing. Rollback that touches only the failed app is essentially free for the others.
Detection latency.ApplicationState.update() runs on each control loop tick, so DEPLOY_FAILED is observable within one or two control loops of the failure. The watcher polls inside the same loop, so no extra timer is required.
Status surface. New fields on ApplicationDetails and ServeInstanceDetails:
serve status prints these alongside the existing ApplicationStatus.
Observability. Emit a controller log Auto-rollback triggered: app '<X>' is DEPLOY_FAILED, reverting to config from <ts>. Add metrics: serve_auto_rollback_triggered_total, serve_auto_rollback_succeeded_total, serve_auto_rollback_failed_total.
Opt-in vs opt-out. Recommendation: opt-in via a new field on ServeDeploySchema (rollout_strategy.auto_rollback: bool = False and serve deploy --rollback-on-failure), to avoid surprising existing users. The value is persisted with the config so it survives controller restarts.
Tests. Need coverage for: build-task failure rollback, replica-startup failure rollback, mid-rollout health failure rollback, controller restart during watch, controller restart during rollback, multi-app partial failure, rollback ping-pong (rollback also fails), auto_rollback=False keeps current behavior.
4. High-level approach
4a. Persist the previous good config
Today CONFIG_CHECKPOINT_KEY stores (deployment_time, target_capacity, target_capacity_direction, config_dict) (controller.py:1143). Replace with a forward-compatible dict:
_read_config_checkpoint (controller.py:778) gets a v1 -> v2 migration shim. _recover_state_from_checkpoint (controller.py:760) restores the watcher state and decides whether to resume rollback.
4b. Add a RolloutSupervisor inside the controller
A small object owned by ServeController, ticked from the main control loop right after application_state_manager.update() (controller.py:585). Responsibilities:
Track the current "pending submission" (app names, started_at, auto_rollback_enabled).
On each tick, read app statuses via ApplicationStateManager.list_app_statuses(...).
If every app in pending.config_dict is RUNNING -> promote: last_good := pending, clear pending, persist KV, emit metric.
If any app is DEPLOY_FAILED and auto_rollback_enabled:
If last_good is None -> mark rollout_state = "NONE"; surface "no prior good config" message; stop watching.
Else -> set rollout_state = "ROLLING_BACK", persist KV, call self._apply_config_internal(last_good_config, is_rollback=True). After the rollback's reconciliation is itself watched: if it reaches RUNNING set ROLLED_BACK -> NONE and keep last_good unchanged; if it ends in DEPLOY_FAILED set ROLLBACK_FAILED and stop.
New submission supersedes anything in flight: clear any WATCHING / ROLLING_BACK state and start a fresh watch.
4c. Wire opt-in flag
Add an optional RolloutStrategySchema to ServeDeploySchema (schema.py:974):
Add --rollback-on-failure / --no-rollback-on-failure to serve deploy in scripts.py (:341); the flag mutates the schema before ServeSubmissionClient(...).deploy_applications(...).
Forward-compat preserved by extra="allow" already on ServeDeploySchema.
public apply_config(config, deployment_time=0) (unchanged signature) -> calls internal helper with is_rollback=False and arms the supervisor.
private _apply_config_internal(config, deployment_time, *, is_rollback: bool) -> does today's work (persist KV, call apply_app_configs, save_checkpoint). When is_rollback=True, the supervisor sets rollout_state="ROLLING_BACK" instead of WATCHING and does not arm a second auto-rollback (prevents ping-pong even if last_good itself fails).
1. How rollouts work today
PUT /api/serve/applications/andserve deployboth end at a single controller method,ServeController.apply_config(controller.py:1102).apply_configoverwrites a single KV blobCONFIG_CHECKPOINT_KEY = "serve-app-config-checkpoint"and callsApplicationStateManager.apply_app_configs(...), which per-app does either an in-place override (samecode_version) or a freshbuild_apptask (application_state.py:1299).DEPLOYING -> RUNNING | DEPLOY_FAILED | UNHEALTHYviaApplicationState._determine_app_status(:760).DEPLOY_FAILEDis set when any underlying deployment isDEPLOY_FAILED, which the deployment FSM incommon.py(handle_transition) triggers on:min(max_constructor_retry_count, target_num_replicas * MAX_PER_REPLICA_RETRY_COUNT)UPDATING2. State machine introduced by this feature
stateDiagram-v2 [*] --> Idle Idle --> Watching: apply_config new submitted Watching --> Promoted: all apps RUNNING Watching --> RollingBack: any app DEPLOY_FAILED Watching --> Cancelled: user submits another config RollingBack --> RolledBack: rollback reaches RUNNING RollingBack --> RollbackFailed: rollback also DEPLOY_FAILED Promoted --> Idle: promote pending to last_good RolledBack --> Idle RollbackFailed --> Idle: alert and stop, no auto retry Cancelled --> Watching: new submission becomes pending3. Important considerations
ApplicationStatus.DEPLOY_FAILEDonly. No progress timeout, no UNHEALTHY-driven rollback. This avoids false positives on slow-starting workloads and post-rollout regressions.ServeDeploySchema. This matches the atomic semantics ofapply_config.serve.run/client.deploy_applicationsis out of scope for v1; the feature keys off the KV checkpoint that only declarative deploys produce.ServeDeploySchemawhere every app reachedRUNNINGafterapply_config. Promotion happens only on success, so we never roll back to a config that itself was failing.apply_configfails, there is nothing to roll back to. Behavior: leave the partial deploy inDEPLOY_FAILEDand emit a clear status message ("no previous successful config to roll back to"). Optionally tear down the failed apps if--rollback-on-failureis set; document the trade-off.pending_submission_id.DEPLOY_FAILEDafter rollback (e.g., environment regressed), do not auto-rollback again. MarkROLLBACK_FAILED, log clearly, surface inserve status. Operator must intervene.CONFIG_CHECKPOINT_KEYwith an additional dict so_recover_state_from_checkpointcan resume the watch (and re-check whether to roll back) after a restart.apply_config(last_good). Things that get reverted: code versions, replica counts, runtime envs, route prefixes,target_capacity,target_capacity_direction. Apps that the failed config newly added are deleted (since they are not in last-good). Apps the failed config deleted will be re-created from scratch — this resurrects deployments the user may have intentionally deleted, an important UX caveat to document.ServeDeploySchemaare not currently part ofapply_config's persisted-config flow (they go throughserve_head.pystart-time). Rollback semantics deliberately scoped toapplications+target_capacityto avoid restarting proxies.apply_app_configsonly deletes apps withapi_type == DECLARATIVE. Auto-rollback re-applies aServeDeploySchema, so it inherits the same scope: imperative apps (deployed viaserve.run) are untouched.DeploymentState.deploy()already short-circuits on no-op; re-applying an unchanged app costs nothing. Rollback that touches only the failed app is essentially free for the others.ApplicationState.update()runs on each control loop tick, so DEPLOY_FAILED is observable within one or two control loops of the failure. The watcher polls inside the same loop, so no extra timer is required.ApplicationDetailsandServeInstanceDetails:rollout_status:WATCHING | NONE | ROLLING_BACK | ROLLED_BACK | ROLLBACK_FAILEDrolled_back_from_deployment_time,last_good_deployment_timeserve statusprints these alongside the existingApplicationStatus.Auto-rollback triggered: app '<X>' is DEPLOY_FAILED, reverting to config from <ts>. Add metrics:serve_auto_rollback_triggered_total,serve_auto_rollback_succeeded_total,serve_auto_rollback_failed_total.ServeDeploySchema(rollout_strategy.auto_rollback: bool = Falseandserve deploy --rollback-on-failure), to avoid surprising existing users. The value is persisted with the config so it survives controller restarts.auto_rollback=Falsekeeps current behavior.4. High-level approach
4a. Persist the previous good config
Today
CONFIG_CHECKPOINT_KEYstores(deployment_time, target_capacity, target_capacity_direction, config_dict)(controller.py:1143). Replace with a forward-compatible dict:{ "version": 2, "current": { "deployment_time": float, "target_capacity": float | None, "target_capacity_direction": TargetCapacityDirection | None, "config_dict": Dict[str, app_config_dict], "auto_rollback_enabled": bool, }, "last_good": { ... same shape ... } | None, "rollout_state": "WATCHING" | "NONE" | "ROLLING_BACK" | "ROLLBACK_FAILED", "rollout_started_at": float, }_read_config_checkpoint(controller.py:778) gets a v1 -> v2 migration shim._recover_state_from_checkpoint(controller.py:760) restores the watcher state and decides whether to resume rollback.4b. Add a
RolloutSupervisorinside the controllerA small object owned by
ServeController, ticked from the main control loop right afterapplication_state_manager.update()(controller.py:585). Responsibilities:ApplicationStateManager.list_app_statuses(...).pending.config_dictisRUNNING-> promote:last_good := pending, clearpending, persist KV, emit metric.DEPLOY_FAILEDandauto_rollback_enabled:last_good is None-> markrollout_state = "NONE"; surface "no prior good config" message; stop watching.rollout_state = "ROLLING_BACK", persist KV, callself._apply_config_internal(last_good_config, is_rollback=True). After the rollback's reconciliation is itself watched: if it reaches RUNNING setROLLED_BACK -> NONEand keeplast_goodunchanged; if it ends in DEPLOY_FAILED setROLLBACK_FAILEDand stop.WATCHING/ROLLING_BACKstate and start a fresh watch.4c. Wire opt-in flag
Add an optional
RolloutStrategySchematoServeDeploySchema(schema.py:974):Add
--rollback-on-failure / --no-rollback-on-failuretoserve deployinscripts.py(:341); the flag mutates the schema beforeServeSubmissionClient(...).deploy_applications(...).Forward-compat preserved by
extra="allow"already onServeDeploySchema.4d. Refactor
apply_configto feed the supervisorSplit current
apply_configinto:apply_config(config, deployment_time=0)(unchanged signature) -> calls internal helper withis_rollback=Falseand arms the supervisor._apply_config_internal(config, deployment_time, *, is_rollback: bool)-> does today's work (persist KV, callapply_app_configs, save_checkpoint). Whenis_rollback=True, the supervisor setsrollout_state="ROLLING_BACK"instead ofWATCHINGand does not arm a second auto-rollback (prevents ping-pong even iflast_gooditself fails).4e. Surface state to users
ApplicationDetails/ServeInstanceDetailsinschema.pywith the new rollout fields.ServeController.get_serve_instance_details(:1258) andscripts.py:_get_status(:711) to render them.metrics_pusherinfrastructure used byDeploymentStateManager.