Skip to content

[Serve][RFC] Auto-rollback for serve deploy #63016

@abrarsheikh

Description

@abrarsheikh

1. How rollouts work today

  • The CLI/REST entry PUT /api/serve/applications/ and serve deploy both end at a single controller method, ServeController.apply_config (controller.py:1102).
  • apply_config overwrites a single KV blob CONFIG_CHECKPOINT_KEY = "serve-app-config-checkpoint" and calls ApplicationStateManager.apply_app_configs(...), which per-app does either an in-place override (same code_version) or a fresh build_app task (application_state.py:1299).
  • Each app independently transitions DEPLOYING -> RUNNING | DEPLOY_FAILED | UNHEALTHY via ApplicationState._determine_app_status (:760). DEPLOY_FAILED is set when any underlying deployment is DEPLOY_FAILED, which the deployment FSM in common.py (handle_transition) triggers on:
    • build_app task error (import/runtime env)
    • replica startup retries reaching min(max_constructor_retry_count, target_num_replicas * MAX_PER_REPLICA_RETRY_COUNT)
    • health-check failure while still UPDATING
  • There is no previous-version retention today. The KV blob holds only the current target.

2. State machine introduced by this feature

stateDiagram-v2
    [*] --> Idle
    Idle --> Watching: apply_config new submitted
    Watching --> Promoted: all apps RUNNING
    Watching --> RollingBack: any app DEPLOY_FAILED
    Watching --> Cancelled: user submits another config
    RollingBack --> RolledBack: rollback reaches RUNNING
    RollingBack --> RollbackFailed: rollback also DEPLOY_FAILED
    Promoted --> Idle: promote pending to last_good
    RolledBack --> Idle
    RollbackFailed --> Idle: alert and stop, no auto retry
    Cancelled --> Watching: new submission becomes pending
Loading

3. Important considerations

  • Trigger is ApplicationStatus.DEPLOY_FAILED only. No progress timeout, no UNHEALTHY-driven rollback. This avoids false positives on slow-starting workloads and post-rollout regressions.
  • Scope is per-submission. Even if only one app in a multi-app config fails, all apps revert together to the last-good ServeDeploySchema. This matches the atomic semantics of apply_config.
  • Declarative-only. Imperative serve.run / client.deploy_applications is out of scope for v1; the feature keys off the KV checkpoint that only declarative deploys produce.
  • "Last good" definition = the most recent ServeDeploySchema where every app reached RUNNING after apply_config. Promotion happens only on success, so we never roll back to a config that itself was failing.
  • First deploy has no previous good. If the very first apply_config fails, there is nothing to roll back to. Behavior: leave the partial deploy in DEPLOY_FAILED and emit a clear status message ("no previous successful config to roll back to"). Optionally tear down the failed apps if --rollback-on-failure is set; document the trade-off.
  • Concurrent submissions. If a user pushes a new config while we are watching or rolling back, the new submission wins (cancels the watch / rollback). Reconciliation already serializes through the controller actor, so no extra locking is needed beyond a pending_submission_id.
  • Avoid rollback ping-pong. If the last-good config also ends in DEPLOY_FAILED after rollback (e.g., environment regressed), do not auto-rollback again. Mark ROLLBACK_FAILED, log clearly, surface in serve status. Operator must intervene.
  • Persistence and recovery. The watcher state must survive controller crashes. We will extend CONFIG_CHECKPOINT_KEY with an additional dict so _recover_state_from_checkpoint can resume the watch (and re-check whether to roll back) after a restart.
  • Side effects of revert. Rolling back re-runs apply_config(last_good). Things that get reverted: code versions, replica counts, runtime envs, route prefixes, target_capacity, target_capacity_direction. Apps that the failed config newly added are deleted (since they are not in last-good). Apps the failed config deleted will be re-created from scratch — this resurrects deployments the user may have intentionally deleted, an important UX caveat to document.
  • HTTP/gRPC/proxy options in ServeDeploySchema are not currently part of apply_config's persisted-config flow (they go through serve_head.py start-time). Rollback semantics deliberately scoped to applications + target_capacity to avoid restarting proxies.
  • Imperative apps coexisting with declarative apps. apply_app_configs only deletes apps with api_type == DECLARATIVE. Auto-rollback re-applies a ServeDeploySchema, so it inherits the same scope: imperative apps (deployed via serve.run) are untouched.
  • Idempotency. DeploymentState.deploy() already short-circuits on no-op; re-applying an unchanged app costs nothing. Rollback that touches only the failed app is essentially free for the others.
  • Detection latency. ApplicationState.update() runs on each control loop tick, so DEPLOY_FAILED is observable within one or two control loops of the failure. The watcher polls inside the same loop, so no extra timer is required.
  • Status surface. New fields on ApplicationDetails and ServeInstanceDetails:
    • rollout_status: WATCHING | NONE | ROLLING_BACK | ROLLED_BACK | ROLLBACK_FAILED
    • rolled_back_from_deployment_time, last_good_deployment_time
    • serve status prints these alongside the existing ApplicationStatus.
  • Observability. Emit a controller log Auto-rollback triggered: app '<X>' is DEPLOY_FAILED, reverting to config from <ts>. Add metrics: serve_auto_rollback_triggered_total, serve_auto_rollback_succeeded_total, serve_auto_rollback_failed_total.
  • Opt-in vs opt-out. Recommendation: opt-in via a new field on ServeDeploySchema (rollout_strategy.auto_rollback: bool = False and serve deploy --rollback-on-failure), to avoid surprising existing users. The value is persisted with the config so it survives controller restarts.
  • Tests. Need coverage for: build-task failure rollback, replica-startup failure rollback, mid-rollout health failure rollback, controller restart during watch, controller restart during rollback, multi-app partial failure, rollback ping-pong (rollback also fails), auto_rollback=False keeps current behavior.

4. High-level approach

4a. Persist the previous good config

Today CONFIG_CHECKPOINT_KEY stores (deployment_time, target_capacity, target_capacity_direction, config_dict) (controller.py:1143). Replace with a forward-compatible dict:

{
    "version": 2,
    "current": {
        "deployment_time": float,
        "target_capacity": float | None,
        "target_capacity_direction": TargetCapacityDirection | None,
        "config_dict": Dict[str, app_config_dict],
        "auto_rollback_enabled": bool,
    },
    "last_good": { ... same shape ... } | None,
    "rollout_state": "WATCHING" | "NONE" | "ROLLING_BACK" | "ROLLBACK_FAILED",
    "rollout_started_at": float,
}

_read_config_checkpoint (controller.py:778) gets a v1 -> v2 migration shim. _recover_state_from_checkpoint (controller.py:760) restores the watcher state and decides whether to resume rollback.

4b. Add a RolloutSupervisor inside the controller

A small object owned by ServeController, ticked from the main control loop right after application_state_manager.update() (controller.py:585). Responsibilities:

  • Track the current "pending submission" (app names, started_at, auto_rollback_enabled).
  • On each tick, read app statuses via ApplicationStateManager.list_app_statuses(...).
    • If every app in pending.config_dict is RUNNING -> promote: last_good := pending, clear pending, persist KV, emit metric.
    • If any app is DEPLOY_FAILED and auto_rollback_enabled:
      • If last_good is None -> mark rollout_state = "NONE"; surface "no prior good config" message; stop watching.
      • Else -> set rollout_state = "ROLLING_BACK", persist KV, call self._apply_config_internal(last_good_config, is_rollback=True). After the rollback's reconciliation is itself watched: if it reaches RUNNING set ROLLED_BACK -> NONE and keep last_good unchanged; if it ends in DEPLOY_FAILED set ROLLBACK_FAILED and stop.
  • New submission supersedes anything in flight: clear any WATCHING / ROLLING_BACK state and start a fresh watch.

4c. Wire opt-in flag

  • Add an optional RolloutStrategySchema to ServeDeploySchema (schema.py:974):

    rollout_strategy: Optional[RolloutStrategySchema]
        auto_rollback: bool = False
    
  • Add --rollback-on-failure / --no-rollback-on-failure to serve deploy in scripts.py (:341); the flag mutates the schema before ServeSubmissionClient(...).deploy_applications(...).

  • Forward-compat preserved by extra="allow" already on ServeDeploySchema.

4d. Refactor apply_config to feed the supervisor

Split current apply_config into:

  • public apply_config(config, deployment_time=0) (unchanged signature) -> calls internal helper with is_rollback=False and arms the supervisor.
  • private _apply_config_internal(config, deployment_time, *, is_rollback: bool) -> does today's work (persist KV, call apply_app_configs, save_checkpoint). When is_rollback=True, the supervisor sets rollout_state="ROLLING_BACK" instead of WATCHING and does not arm a second auto-rollback (prevents ping-pong even if last_good itself fails).

4e. Surface state to users

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions