[Serve][RFC] Auto-rollback for `serve deploy`

## 1. How rollouts work today

- The CLI/REST entry [`PUT /api/serve/applications/`](python/ray/dashboard/modules/serve/serve_head.py) and `serve deploy` both end at a single controller method, [`ServeController.apply_config`](python/ray/serve/_private/controller.py) (`controller.py:1102`).
- `apply_config` overwrites a single KV blob `CONFIG_CHECKPOINT_KEY = "serve-app-config-checkpoint"` and calls `ApplicationStateManager.apply_app_configs(...)`, which per-app does either an in-place override (same `code_version`) or a fresh `build_app` task ([`application_state.py:1299`](python/ray/serve/_private/application_state.py)).
- Each app independently transitions `DEPLOYING -> RUNNING | DEPLOY_FAILED | UNHEALTHY` via [`ApplicationState._determine_app_status`](python/ray/serve/_private/application_state.py) (`:760`). `DEPLOY_FAILED` is set when any underlying deployment is `DEPLOY_FAILED`, which the deployment FSM in [`common.py`](python/ray/serve/_private/common.py) (`handle_transition`) triggers on:
  - build_app task error (import/runtime env)
  - replica startup retries reaching `min(max_constructor_retry_count, target_num_replicas * MAX_PER_REPLICA_RETRY_COUNT)`
  - health-check failure while still `UPDATING`
- There is **no** previous-version retention today. The KV blob holds only the current target.

## 2. State machine introduced by this feature

```mermaid
stateDiagram-v2
    [*] --> Idle
    Idle --> Watching: apply_config new submitted
    Watching --> Promoted: all apps RUNNING
    Watching --> RollingBack: any app DEPLOY_FAILED
    Watching --> Cancelled: user submits another config
    RollingBack --> RolledBack: rollback reaches RUNNING
    RollingBack --> RollbackFailed: rollback also DEPLOY_FAILED
    Promoted --> Idle: promote pending to last_good
    RolledBack --> Idle
    RollbackFailed --> Idle: alert and stop, no auto retry
    Cancelled --> Watching: new submission becomes pending
```

## 3. Important considerations

- **Trigger is `ApplicationStatus.DEPLOY_FAILED` only**. No progress timeout, no UNHEALTHY-driven rollback. This avoids false positives on slow-starting workloads and post-rollout regressions.
- **Scope is per-submission**. Even if only one app in a multi-app config fails, all apps revert together to the last-good `ServeDeploySchema`. This matches the atomic semantics of `apply_config`.
- **Declarative-only**. Imperative `serve.run` / `client.deploy_applications` is out of scope for v1; the feature keys off the KV checkpoint that only declarative deploys produce.
- **"Last good" definition** = the most recent `ServeDeploySchema` where every app reached `RUNNING` after `apply_config`. Promotion happens only on success, so we never roll back to a config that itself was failing.
- **First deploy has no previous good.** If the very first `apply_config` fails, there is nothing to roll back to. Behavior: leave the partial deploy in `DEPLOY_FAILED` and emit a clear status message ("no previous successful config to roll back to"). Optionally tear down the failed apps if `--rollback-on-failure` is set; document the trade-off.
- **Concurrent submissions.** If a user pushes a new config while we are watching or rolling back, the new submission wins (cancels the watch / rollback). Reconciliation already serializes through the controller actor, so no extra locking is needed beyond a `pending_submission_id`.
- **Avoid rollback ping-pong.** If the last-good config also ends in `DEPLOY_FAILED` after rollback (e.g., environment regressed), do **not** auto-rollback again. Mark `ROLLBACK_FAILED`, log clearly, surface in `serve status`. Operator must intervene.
- **Persistence and recovery.** The watcher state must survive controller crashes. We will extend `CONFIG_CHECKPOINT_KEY` with an additional dict so `_recover_state_from_checkpoint` can resume the watch (and re-check whether to roll back) after a restart.
- **Side effects of revert.** Rolling back re-runs `apply_config(last_good)`. Things that get reverted: code versions, replica counts, runtime envs, route prefixes, `target_capacity`, `target_capacity_direction`. Apps that the failed config newly added are deleted (since they are not in last-good). Apps the failed config deleted will be re-created from scratch — **this resurrects deployments the user may have intentionally deleted**, an important UX caveat to document.
- **HTTP/gRPC/proxy options** in `ServeDeploySchema` are not currently part of `apply_config`'s persisted-config flow (they go through `serve_head.py` start-time). Rollback semantics deliberately scoped to `applications` + `target_capacity` to avoid restarting proxies.
- **Imperative apps coexisting with declarative apps.** `apply_app_configs` only deletes apps with `api_type == DECLARATIVE`. Auto-rollback re-applies a `ServeDeploySchema`, so it inherits the same scope: imperative apps (deployed via `serve.run`) are untouched.
- **Idempotency.** `DeploymentState.deploy()` already short-circuits on no-op; re-applying an unchanged app costs nothing. Rollback that touches only the failed app is essentially free for the others.
- **Detection latency.** `ApplicationState.update()` runs on each control loop tick, so DEPLOY_FAILED is observable within one or two control loops of the failure. The watcher polls inside the same loop, so no extra timer is required.
- **Status surface.** New fields on `ApplicationDetails` and `ServeInstanceDetails`:
  - `rollout_status`: `WATCHING | NONE | ROLLING_BACK | ROLLED_BACK | ROLLBACK_FAILED`
  - `rolled_back_from_deployment_time`, `last_good_deployment_time`
  - `serve status` prints these alongside the existing `ApplicationStatus`.
- **Observability.** Emit a controller log `Auto-rollback triggered: app '<X>' is DEPLOY_FAILED, reverting to config from <ts>`. Add metrics: `serve_auto_rollback_triggered_total`, `serve_auto_rollback_succeeded_total`, `serve_auto_rollback_failed_total`.
- **Opt-in vs opt-out.** Recommendation: opt-in via a new field on `ServeDeploySchema` (`rollout_strategy.auto_rollback: bool = False` and `serve deploy --rollback-on-failure`), to avoid surprising existing users. The value is persisted with the config so it survives controller restarts.
- **Tests.** Need coverage for: build-task failure rollback, replica-startup failure rollback, mid-rollout health failure rollback, controller restart during watch, controller restart during rollback, multi-app partial failure, rollback ping-pong (rollback also fails), `auto_rollback=False` keeps current behavior.

## 4. High-level approach

### 4a. Persist the previous good config

Today `CONFIG_CHECKPOINT_KEY` stores `(deployment_time, target_capacity, target_capacity_direction, config_dict)` (`controller.py:1143`). Replace with a forward-compatible dict:

```python
{
    "version": 2,
    "current": {
        "deployment_time": float,
        "target_capacity": float | None,
        "target_capacity_direction": TargetCapacityDirection | None,
        "config_dict": Dict[str, app_config_dict],
        "auto_rollback_enabled": bool,
    },
    "last_good": { ... same shape ... } | None,
    "rollout_state": "WATCHING" | "NONE" | "ROLLING_BACK" | "ROLLBACK_FAILED",
    "rollout_started_at": float,
}
```

`_read_config_checkpoint` (`controller.py:778`) gets a v1 -> v2 migration shim. `_recover_state_from_checkpoint` (`controller.py:760`) restores the watcher state and decides whether to resume rollback.

### 4b. Add a `RolloutSupervisor` inside the controller

A small object owned by `ServeController`, ticked from the main control loop right after `application_state_manager.update()` ([`controller.py:585`](python/ray/serve/_private/controller.py)). Responsibilities:

- Track the current "pending submission" (app names, started_at, auto_rollback_enabled).
- On each tick, read app statuses via `ApplicationStateManager.list_app_statuses(...)`.
  - If every app in `pending.config_dict` is `RUNNING` -> promote: `last_good := pending`, clear `pending`, persist KV, emit metric.
  - If any app is `DEPLOY_FAILED` and `auto_rollback_enabled`:
    - If `last_good is None` -> mark `rollout_state = "NONE"`; surface "no prior good config" message; stop watching.
    - Else -> set `rollout_state = "ROLLING_BACK"`, persist KV, call `self._apply_config_internal(last_good_config, is_rollback=True)`. After the rollback's reconciliation is itself watched: if it reaches RUNNING set `ROLLED_BACK -> NONE` and keep `last_good` unchanged; if it ends in DEPLOY_FAILED set `ROLLBACK_FAILED` and stop.
- New submission supersedes anything in flight: clear any `WATCHING` / `ROLLING_BACK` state and start a fresh watch.

### 4c. Wire opt-in flag

- Add an optional `RolloutStrategySchema` to [`ServeDeploySchema`](python/ray/serve/schema.py) (`schema.py:974`):

  ```
  rollout_strategy: Optional[RolloutStrategySchema]
      auto_rollback: bool = False
  ```
- Add `--rollback-on-failure / --no-rollback-on-failure` to `serve deploy` in [`scripts.py`](python/ray/serve/scripts.py) (`:341`); the flag mutates the schema before `ServeSubmissionClient(...).deploy_applications(...)`.
- Forward-compat preserved by `extra="allow"` already on `ServeDeploySchema`.

### 4d. Refactor `apply_config` to feed the supervisor

Split current [`apply_config`](python/ray/serve/_private/controller.py) into:

- public `apply_config(config, deployment_time=0)` (unchanged signature) -> calls internal helper with `is_rollback=False` and arms the supervisor.
- private `_apply_config_internal(config, deployment_time, *, is_rollback: bool)` -> does today's work (persist KV, call `apply_app_configs`, save_checkpoint). When `is_rollback=True`, the supervisor sets `rollout_state="ROLLING_BACK"` instead of `WATCHING` and does not arm a second auto-rollback (prevents ping-pong even if `last_good` itself fails).

### 4e. Surface state to users

- Extend [`ApplicationDetails` / `ServeInstanceDetails` in `schema.py`](python/ray/serve/schema.py) with the new rollout fields.
- Update [`ServeController.get_serve_instance_details`](python/ray/serve/_private/controller.py) (`:1258`) and [`scripts.py:_get_status`](python/ray/serve/scripts.py) (`:711`) to render them.
- Emit info logs at supervisor transitions; add Prometheus counters via existing `metrics_pusher` infrastructure used by `DeploymentStateManager`.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve][RFC] Auto-rollback for `serve deploy` #63016

1. How rollouts work today

2. State machine introduced by this feature

3. Important considerations

4. High-level approach

4a. Persist the previous good config

4b. Add a `RolloutSupervisor` inside the controller

4c. Wire opt-in flag

4d. Refactor `apply_config` to feed the supervisor

4e. Surface state to users

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Serve][RFC] Auto-rollback for serve deploy #63016

Description

1. How rollouts work today

2. State machine introduced by this feature

3. Important considerations

4. High-level approach

4a. Persist the previous good config

4b. Add a RolloutSupervisor inside the controller

4c. Wire opt-in flag

4d. Refactor apply_config to feed the supervisor

4e. Surface state to users

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Serve][RFC] Auto-rollback for `serve deploy` #63016

4b. Add a `RolloutSupervisor` inside the controller

4d. Refactor `apply_config` to feed the supervisor