[Serve] Can't autoscale deployment when target ongoing requests is 1

### What happened + What you expected to happen

Issue: In Ray Serve, if `target_num_ongoing_requests_per_replica` is 1 and `max_concurrent_queries` is also 1, then autoscaling will never occur.

Expected: I can set the target to 1 and autoscaling will occur as I send more queries to the deployment, but a single replica will never have more than 1 ongoing request.

See: https://discuss.ray.io/t/autoscaling-with-max-concurrent-queries-1/6121 and the source of `autoscaling_policy.py`

Current workaround: Set max concurrent queries to 2, and target to 1. This increases request latency though, especially for queries that take a while (like model inference for large models).

### Versions / Dependencies

Ray 1.12
Python/OS: N/A

### Reproduction script

N/A

### Issue Severity

Medium: It is a significant difficulty but I can work around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Can't autoscale deployment when target ongoing requests is 1 #24793

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development