Closed
Description
What happened + What you expected to happen
Issue: In Ray Serve, if target_num_ongoing_requests_per_replica
is 1 and max_concurrent_queries
is also 1, then autoscaling will never occur.
Expected: I can set the target to 1 and autoscaling will occur as I send more queries to the deployment, but a single replica will never have more than 1 ongoing request.
See: https://discuss.ray.io/t/autoscaling-with-max-concurrent-queries-1/6121 and the source of autoscaling_policy.py
Current workaround: Set max concurrent queries to 2, and target to 1. This increases request latency though, especially for queries that take a while (like model inference for large models).
Versions / Dependencies
Ray 1.12
Python/OS: N/A
Reproduction script
N/A
Issue Severity
Medium: It is a significant difficulty but I can work around it.
Activity