Skip to content

Commit d50cb5b

Browse files
add docs for post API (#57698)
- adding docs for POST API, here are more details: https://docs.google.com/document/d/1KtMUDz1O3koihG6eh-QcUqudZjNAX3NsqqOMYh3BoWA/edit?tab=t.0#heading=h.2vf4s2d7ca46 - also, making changes for the external scaler enabled in the existing serve application docs to be merged after #57554 --------- Signed-off-by: harshit <harshit@anyscale.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com>
1 parent 5435a56 commit d50cb5b

File tree

5 files changed

+223
-1
lines changed

5 files changed

+223
-1
lines changed

doc/source/serve/advanced-guides/advanced-autoscaling.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -723,3 +723,96 @@ When your custom autoscaling policy has complex dependencies or you want better
723723
- **Contribute to Ray Serve**: If your policy is general-purpose and might benefit others, consider contributing it to Ray Serve as a built-in policy by opening a feature request or pull request on the [Ray GitHub repository](https://github.com/ray-project/ray/issues). The recommended location for the implementation is `python/ray/serve/autoscaling_policy.py`.
724724
- **Ensure dependencies in your environment**: Make sure that the external dependencies are installed in your Docker image or environment.
725725
:::
726+
727+
728+
(serve-external-scale-api)=
729+
730+
### External scaling API
731+
732+
:::{warning}
733+
This API is in alpha and may change before becoming stable.
734+
:::
735+
736+
The external scaling API provides programmatic control over the number of replicas for any deployment in your Ray Serve application. Unlike Ray Serve's built-in autoscaling, which scales based on queue depth and ongoing requests, this API allows you to scale based on any external criteria you define.
737+
738+
#### Example: Predictive scaling
739+
740+
This example shows how to implement predictive scaling based on historical patterns or forecasts. You can preemptively scale up before anticipated traffic spikes by running an external script that adjusts replica counts based on time of day.
741+
742+
##### Define the deployment
743+
744+
The following example creates a simple text processing deployment that you can scale externally. Save this code to a file named `external_scaler_predictive.py`:
745+
746+
```{literalinclude} ../doc_code/external_scaler_predictive.py
747+
:language: python
748+
:start-after: __serve_example_begin__
749+
:end-before: __serve_example_end__
750+
```
751+
752+
##### Configure external scaling
753+
754+
Before using the external scaling API, enable it in your application configuration by setting `external_scaler_enabled: true`. Save this configuration to a file named `external_scaler_config.yaml`:
755+
756+
```{literalinclude} ../doc_code/external_scaler_config.yaml
757+
:language: yaml
758+
:start-after: __external_scaler_config_begin__
759+
:end-before: __external_scaler_config_end__
760+
```
761+
762+
:::{warning}
763+
External scaling and built-in autoscaling are mutually exclusive. You can't use both for the same application. If you set `external_scaler_enabled: true`, you **must not** configure `autoscaling_config` on any deployment in that application. Attempting to use both results in an error.
764+
:::
765+
766+
##### Implement the scaling logic
767+
768+
The following script implements predictive scaling based on time of day and historical traffic patterns. Save this script to a file named `external_scaler_predictive_client.py`:
769+
770+
```{literalinclude} ../doc_code/external_scaler_predictive_client.py
771+
:language: python
772+
:start-after: __client_script_begin__
773+
:end-before: __client_script_end__
774+
```
775+
776+
The script uses the external scaling API endpoint to scale deployments:
777+
- **API endpoint**: `POST http://localhost:8265/api/v1/applications/{application_name}/deployments/{deployment_name}/scale`
778+
- **Request body**: `{"target_num_replicas": <number>}` (must conform to the [`ScaleDeploymentRequest`](../api/doc/ray.serve.schema.ScaleDeploymentRequest.rst) schema)
779+
780+
The scaling client continuously adjusts the number of replicas based on the time of day:
781+
- Business hours (9 AM - 5 PM): 10 replicas
782+
- Off-peak hours: 3 replicas
783+
784+
##### Run the example
785+
786+
Follow these steps to run the complete example:
787+
788+
1. Start the Ray Serve application with the configuration:
789+
790+
```bash
791+
serve run external_scaler_config.yaml
792+
```
793+
794+
2. Run the predictive scaling client in a separate terminal:
795+
796+
```bash
797+
python external_scaler_predictive_client.py
798+
```
799+
800+
The client adjusts replica counts automatically based on the time of day. You can monitor the scaling behavior in the Ray dashboard or by checking the application logs.
801+
802+
#### Important considerations
803+
804+
Understanding how the external scaler interacts with your deployments helps you build reliable scaling logic:
805+
806+
- **Idempotent API calls**: The scaling API is idempotent. You can safely call it multiple times with the same `target_num_replicas` value without side effects. This makes it safe to run your scaling logic on a schedule or in response to repeated metric updates.
807+
808+
- **Interaction with serve deploy**: When you upgrade your service with `serve deploy`, the number of replicas you set through the external scaler API stays intact. This behavior matches what you'd expect from Ray Serve's built-in autoscaler—deployment updates don't reset replica counts.
809+
810+
- **Query current replica count**: You can get the current number of replicas for any deployment by querying the GET `/applications` API:
811+
812+
```bash
813+
curl -X GET http://localhost:8265/api/serve/applications/ \
814+
```
815+
816+
The response follows the [`ServeInstanceDetails`](../api/doc/ray.serve.schema.ServeInstanceDetails.rst) schema, which includes an `applications` field containing a dictionary with application names as keys. Each application includes detailed information about all its deployments, including current replica counts. Use this information to make informed scaling decisions. For example, you might scale up gradually by adding a percentage of existing replicas rather than jumping to a fixed number.
817+
818+
- **Initial replica count**: When you deploy an application for the first time, Ray Serve creates the number of replicas specified in the `num_replicas` field of your deployment configuration. The external scaler can then adjust this count dynamically based on your scaling logic.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# __external_scaler_config_begin__
2+
applications:
3+
- name: my-app
4+
import_path: external_scaler_predictive:app
5+
external_scaler_enabled: true
6+
deployments:
7+
- name: TextProcessor
8+
num_replicas: 1
9+
# __external_scaler_config_end__
10+
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# __serve_example_begin__
2+
import time
3+
from ray import serve
4+
from typing import Any
5+
6+
@serve.deployment(num_replicas=3)
7+
class TextProcessor:
8+
"""A simple text processing deployment that can be scaled externally."""
9+
def __init__(self):
10+
self.request_count = 0
11+
12+
def __call__(self, text: Any) -> dict:
13+
# Simulate text processing work
14+
time.sleep(0.1)
15+
self.request_count += 1
16+
return {
17+
"request_count": self.request_count,
18+
}
19+
20+
21+
app = TextProcessor.bind()
22+
# __serve_example_end__
23+
24+
def main():
25+
import requests
26+
27+
serve.run(app)
28+
29+
# Test the deployment
30+
resp = requests.post(
31+
"http://localhost:8000/",
32+
json="hello world"
33+
)
34+
print(f"Response: {resp.json()}")
35+
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# __client_script_begin__
2+
import logging
3+
import time
4+
from datetime import datetime
5+
import requests
6+
7+
APPLICATION_NAME = "my-app"
8+
DEPLOYMENT_NAME = "TextProcessor"
9+
SERVE_ENDPOINT = "http://localhost:8265"
10+
SCALING_INTERVAL = 300 # Check every 5 minutes
11+
12+
logger = logging.getLogger(__name__)
13+
14+
15+
def get_current_replicas(app_name: str, deployment_name: str) -> int:
16+
"""Get current replica count. Returns -1 on error."""
17+
try:
18+
resp = requests.get(
19+
f"{SERVE_ENDPOINT}/api/serve/applications/",
20+
timeout=10
21+
)
22+
if resp.status_code != 200:
23+
logger.error(f"Failed to get applications: {resp.status_code}")
24+
return -1
25+
26+
apps = resp.json().get("applications", {})
27+
if app_name not in apps:
28+
logger.error(f"Application {app_name} not found")
29+
return -1
30+
31+
deployments = apps[app_name].get("deployments", {})
32+
if deployment_name in deployments:
33+
return deployments[deployment_name]["target_num_replicas"]
34+
35+
logger.error(f"Deployment {deployment_name} not found")
36+
return -1
37+
except requests.exceptions.RequestException as e:
38+
logger.error(f"Request failed: {e}")
39+
return -1
40+
41+
42+
def scale_deployment(app_name: str, deployment_name: str):
43+
"""Scale deployment based on time of day."""
44+
hour = datetime.now().hour
45+
current = get_current_replicas(app_name, deployment_name)
46+
47+
# Check if we successfully retrieved the current replica count
48+
if current == -1:
49+
logger.error("Failed to get current replicas, skipping scaling decision")
50+
return
51+
52+
target = 10 if 9 <= hour < 17 else 3 # Peak hours: 9am-5pm
53+
54+
delta = target - current
55+
if delta == 0:
56+
logger.info(f"Already at target ({current} replicas)")
57+
return
58+
59+
action = "Adding" if delta > 0 else "Removing"
60+
logger.info(f"{action} {abs(delta)} replicas ({current} -> {target})")
61+
62+
try:
63+
resp = requests.post(
64+
f"{SERVE_ENDPOINT}/api/v1/applications/{app_name}/deployments/{deployment_name}/scale",
65+
headers={"Content-Type": "application/json"},
66+
json={"target_num_replicas": target},
67+
timeout=10
68+
)
69+
if resp.status_code == 200:
70+
logger.info("Successfully scaled deployment")
71+
else:
72+
logger.error(f"Scale failed: {resp.status_code} - {resp.text}")
73+
except requests.exceptions.RequestException as e:
74+
logger.error(f"Request failed: {e}")
75+
76+
77+
def main():
78+
logger.info(f"Starting predictive scaling for {APPLICATION_NAME}/{DEPLOYMENT_NAME}")
79+
while True:
80+
scale_deployment(APPLICATION_NAME, DEPLOYMENT_NAME)
81+
time.sleep(SCALING_INTERVAL)
82+
# __client_script_end__

doc/source/serve/production-guide/config.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ applications:
4040
- name: ...
4141
route_prefix: ...
4242
import_path: ...
43-
runtime_env: ...
43+
runtime_env: ...
44+
external_scaler_enabled: ...
4445
deployments:
4546
- name: ...
4647
num_replicas: ...
@@ -99,6 +100,7 @@ These are the fields per `application`:
99100
- **`route_prefix`**: An application can be called via HTTP at the specified route prefix. It defaults to `/`. The route prefix for each application must be unique.
100101
- **`import_path`**: The path to your top-level Serve deployment (or the same path passed to `serve run`). The most minimal config file consists of only an `import_path`.
101102
- **`runtime_env`**: Defines the environment that the application runs in. Use this parameter to package application dependencies such as `pip` packages (see {ref}`Runtime Environments <runtime-environments>` for supported fields). The `import_path` must be available _within_ the `runtime_env` if it's specified. The Serve config's `runtime_env` can only use [remote URIs](remote-uris) in its `working_dir` and `py_modules`; it can't use local zip files or directories. [More details on runtime env](serve-runtime-env).
103+
- **`external_scaler_enabled`**: Enables the external scaling API, which lets you scale deployments from outside the Ray cluster using a REST API. When enabled, you can't use built-in autoscaling (`autoscaling_config`) for any deployment in this application. Defaults to `False`. See [External Scaling API](serve-external-scale-api) for details.
102104
- **`deployments (optional)`**: A list of deployment options that allows you to override the `@serve.deployment` settings specified in the deployment graph code. Each entry in this list must include the deployment `name`, which must match one in the code. If this section is omitted, Serve launches all deployments in the graph with the parameters specified in the code. See how to [configure serve deployment options](serve-configure-deployment).
103105
- **`args`**: Arguments that are passed to the [application builder](serve-app-builder-guide).
104106

0 commit comments

Comments
 (0)