[Core] Serve Controller crashing the cluster's worker nodes are slow to become Running

### What happened + What you expected to happen

**What happened**

Sometimes Ray Serve controller becomes unstable and repeatedly crashes with the `Check failed: objects_valid error`. 

```
[2025-08-16 16:28:53,335 C 1654 1703] task_receiver.cc:192:  Check failed: objects_valid
*** StackTrace Information ***
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0x148371a) [0x7fbc4095971a] ray::operator<<()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x479) [0x7fbc4095c199] ray::RayLog::~RayLog()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa36c0c) [0x7fbc3ff0cc0c] ray::core::TaskReceiver::HandleTask()::{lambda()#1}::operator()()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa89a72) [0x7fbc3ff5fa72] ray::core::InboundRequest::Accept()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(_ZN3ray4core30OutOfOrderActorSchedulingQueue31AcceptRequestOrRejectIfCanceledENS_6TaskIDERNS0_14InboundRequestE+0x11c) [0x7fbc3ff548cc] ray::core::OutOfOrderActorSchedulingQueue::AcceptRequestOrRejectIfCanceled()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa7f32b) [0x7fbc3ff5532b] std::_Function_handler<>::_M_invoke()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa3c607) [0x7fbc3ff12607] std::_Function_handler<>::_M_invoke()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa80fd5) [0x7fbc3ff56fd5] boost::fibers::worker_context<>::run_()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa80c20) [0x7fbc3ff56c20] boost::context::detail::fiber_entry<>()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa8970f) [0x7fbc3ff5f70f] make_fcontext
```


I think we've identified below scenario where the Ray Serve controller becomes unstable and repeatedly crashes with the `Check failed: objects_valid error`. It seems to occurs specifically during large-scale deployments when the cluster's worker nodes are slow to become `Running`. (could be due to k8 node scheduling)

**Scenarios:**

1. When a new cluster with `autoscaling enabled`,  `min worker replicas as 1` and `1000+ dynamic serve apps` is deployed. The controller attempts to place the replicas, but the new worker nodes take a significant amount of time to be scheduled and start  (probably due to K8 nodes scheduling, sidecars taking time). Before the cluster can stabilize, the controller crashes.

2. An existing ray cluster with `1000+ dynamic serve apps` receives a spike traffic for a single model and the model replicas scale up to max, however the new worker nodes take time to get to `Running` state  and eventually controller crash

**Our ray cluster setup for Context**
We run two primary, static Serve applications on our ray cluster:

- **Ingress Application**: This is a simple fast api that routes traffic to dynamic serve apps (ie. serve apps deployed by Control Application serve app) via cached Deployment handles

- **Control Application**:
  - This apps gets a list of models from our model registry 
  - Deploys them as serve apps via [run_many](https://github.com/ray-project/ray/blob/master/python/ray/serve/api.py#L644) api (all bool params set to False)
  - The Control Application serve apps is marked healthy only after all the dynamic serve apps becomes running



Additionally I see the below error/warning in the Serve controller error logs multiple times before the crash (2nd one is bit concerning)
```
WARNING 2025-10-03 20:18:18,956 controller 968 -- Deployment 'model' in application '<test_serve_app>' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"no_gpu": 0.001, "ports": 1.0, "CPU": 0.25, "memory": 348127232}, total resources available: {"memory": 458769460648.0, "CPU": 650.11, "ports": 20973.0, "no_gpu": 1497.895}. Use `ray status` for more details.
```
```
Exception in callback _chain_future.<locals>._set_state(<Future finis...imeout 5.0s')>, <Future at 0x... returned str>) at /usr/lib64/python3.12/asyncio/futures.py:383
handle: <Handle _chain_future.<locals>._set_state(<Future finis...imeout 5.0s')>, <Future at 0x... returned str>) at /usr/lib64/python3.12/asyncio/futures.py:383>
Traceback (most recent call last):
  File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib64/python3.12/asyncio/futures.py", line 385, in _set_state
    _copy_future_state(other, future)
  File "/usr/lib64/python3.12/asyncio/futures.py", line 355, in _copy_future_state
    assert not dest.done()
           ^^^^^^^^^^^^^^^
AssertionError
```
```
WARNING 2025-10-03 20:18:41,703 controller 968 -- Deployment '<dummy_deployment>' in application '<dummy_app>' has 1 replicas that have taken more than 30s to initialize.
This may be caused by a slow __init__ or reconfigure method.
```

**What you expected to happen**

The Serve controller should remain stable, even when faced with a large number of unschedulable actors due to a resource bottleneck.

It should gracefully queue the deployment requests and schedule the replicas as new nodes become available, without crashing.

### Versions / Dependencies

Ray 2.47.1
Python 3.12

### Reproduction script

Followed the below steps to reproduce in my k8 cluster

**For Scenario 1:**

1. Configure a Ray cluster with an autoscaler set to a minimum of 1 worker node and 1000+ serve deployment
2. Ensure the autoscaler's ramp-up speed is slow, simulating a realistic delay in cloud node provisioning.
3. Deploy the ray Cluster
4. Observe the Ray Serve controller logs on the head node. it would crash with `Check failed: objects_valid error`

**For Scenario 2:**
1. Configure a Ray cluster with 

   - an autoscaler set to a minimum of 1 worker node
   - 1000+ serve deployment with high `max_replcias` 
3. Ensure the autoscaler's ramp-up speed is slow, simulating a realistic delay in cloud node provisioning.
4. Deploy the ray Cluster.
5. Use locust script to generate high traffic one of the model, such that multiple worker nodes scale up to handle the load


### Issue Severity

Medium: It is a significant difficulty but I can work around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Serve Controller crashing the cluster's worker nodes are slow to become Running #57173

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Core] Serve Controller crashing the cluster's worker nodes are slow to become Running #57173

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions