You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Custom autoscaling policies are experimental and may change in future releases.
449
+
:::
450
+
451
+
Ray Serve’s built-in, request-driven autoscaling works well for most apps. Use **custom autoscaling policies** when you need more control—e.g., scaling on external metrics (CloudWatch, Prometheus), anticipating predictable traffic (scheduled batch jobs), or applying business logic that goes beyond queue thresholds.
452
+
453
+
Custom policies let you implement scaling logic based on any metrics or rules you choose.
454
+
455
+
### Custom policy for deployment
456
+
457
+
A custom autoscaling policy is a user-provided Python function that takes an [`AutoscalingContext`](../api/doc/ray.serve.config.AutoscalingContext.rst) and returns a tuple `(target_replicas, policy_state)` for a single Deployment.
458
+
459
+
* **Current state:** Current replica count and deployment metadata.
460
+
* **Built-in metrics:** Total requests, queued requests, per-replica counts.
461
+
* **Custom metrics:** Values your deployment reports via `record_autoscaling_stats()`. (See below.)
462
+
* **Capacity bounds:** `min` / `max` replica limits adjusted for current cluster capacity.
463
+
* **Policy state:** A `dict` you can use to persist arbitrary state across control-loop iterations.
464
+
* **Timing:** Timestamps of the last scale actions and “now”.
465
+
466
+
The following example showcases a policy that scales up during business hours and evening batch processing, and scales down during off-peak hours:
Policies are defined **per deployment**. If you don’t provide one, Ray Serve falls back to its built-in request-based policy.
481
+
482
+
The policy function is invoked by the Ray Serve controller every `RAY_SERVE_CONTROL_LOOP_INTERVAL_S` seconds (default **0.1s**), so your logic runs against near-real-time state.
483
+
484
+
:::{warning}
485
+
Keep policy functions **fast and lightweight**. Slow logic can block the Serve controller and degrade cluster responsiveness.
486
+
:::
487
+
488
+
489
+
### Custom metrics
490
+
491
+
You can make richer decisions by emitting your own metrics from the deployment. Implement `record_autoscaling_stats()` to return a `dict[str, float]`. Ray Serve will surface these values in the [`AutoscalingContext`](../api/doc/ray.serve.config.AutoscalingContext.rst).
492
+
493
+
This example demonstrates how deployments can provide their own metrics (CPU usage, memory usage) and how autoscaling policies can use these metrics to make scaling decisions:
The `record_autoscaling_stats()` method can be either synchronous or asynchronous. It must complete within the timeout specified by `RAY_SERVE_RECORD_AUTOSCALING_STATS_TIMEOUT_S` (default 30 seconds).
509
+
:::
510
+
511
+
In your policy, access custom metrics via:
512
+
513
+
* **`ctx.raw_metrics[metric_name]`** — A mapping of replica IDs to lists of raw metric values.
514
+
The number of data points stored for each replica depends on the [`look_back_period_s`](../api/doc/ray.serve.config.AutoscalingConfig.look_back_period_s.rst) (the sliding window size) and `RAY_SERVE_REPLICA_AUTOSCALING_METRIC_RECORD_INTERVAL_S` (the metric recording interval).
515
+
* **`ctx.aggregated_metrics[metric_name]`** — A time-weighted average computed from the raw metric values for each replica.
516
+
517
+
> Today, aggregation is a time-weighted average. In future releases, additional aggregation options may be supported.
0 commit comments