Skip to content

[Core] Serve Controller crashing the cluster's worker nodes are slow to become Running #57173

@jugalshah291

Description

@jugalshah291

What happened + What you expected to happen

What happened

Sometimes Ray Serve controller becomes unstable and repeatedly crashes with the Check failed: objects_valid error.

[2025-08-16 16:28:53,335 C 1654 1703] task_receiver.cc:192:  Check failed: objects_valid
*** StackTrace Information ***
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0x148371a) [0x7fbc4095971a] ray::operator<<()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x479) [0x7fbc4095c199] ray::RayLog::~RayLog()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa36c0c) [0x7fbc3ff0cc0c] ray::core::TaskReceiver::HandleTask()::{lambda()#1}::operator()()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa89a72) [0x7fbc3ff5fa72] ray::core::InboundRequest::Accept()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(_ZN3ray4core30OutOfOrderActorSchedulingQueue31AcceptRequestOrRejectIfCanceledENS_6TaskIDERNS0_14InboundRequestE+0x11c) [0x7fbc3ff548cc] ray::core::OutOfOrderActorSchedulingQueue::AcceptRequestOrRejectIfCanceled()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa7f32b) [0x7fbc3ff5532b] std::_Function_handler<>::_M_invoke()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa3c607) [0x7fbc3ff12607] std::_Function_handler<>::_M_invoke()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa80fd5) [0x7fbc3ff56fd5] boost::fibers::worker_context<>::run_()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa80c20) [0x7fbc3ff56c20] boost::context::detail::fiber_entry<>()
/app/virtualenv/lib64/python3.12/site-packages/ray/_raylet.so(+0xa8970f) [0x7fbc3ff5f70f] make_fcontext

I think we've identified below scenario where the Ray Serve controller becomes unstable and repeatedly crashes with the Check failed: objects_valid error. It seems to occurs specifically during large-scale deployments when the cluster's worker nodes are slow to become Running. (could be due to k8 node scheduling)

Scenarios:

  1. When a new cluster with autoscaling enabled, min worker replicas as 1 and 1000+ dynamic serve apps is deployed. The controller attempts to place the replicas, but the new worker nodes take a significant amount of time to be scheduled and start (probably due to K8 nodes scheduling, sidecars taking time). Before the cluster can stabilize, the controller crashes.

  2. An existing ray cluster with 1000+ dynamic serve apps receives a spike traffic for a single model and the model replicas scale up to max, however the new worker nodes take time to get to Running state and eventually controller crash

Our ray cluster setup for Context
We run two primary, static Serve applications on our ray cluster:

  • Ingress Application: This is a simple fast api that routes traffic to dynamic serve apps (ie. serve apps deployed by Control Application serve app) via cached Deployment handles

  • Control Application:

    • This apps gets a list of models from our model registry
    • Deploys them as serve apps via run_many api (all bool params set to False)
    • The Control Application serve apps is marked healthy only after all the dynamic serve apps becomes running

Additionally I see the below error/warning in the Serve controller error logs multiple times before the crash (2nd one is bit concerning)

WARNING 2025-10-03 20:18:18,956 controller 968 -- Deployment 'model' in application '<test_serve_app>' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"no_gpu": 0.001, "ports": 1.0, "CPU": 0.25, "memory": 348127232}, total resources available: {"memory": 458769460648.0, "CPU": 650.11, "ports": 20973.0, "no_gpu": 1497.895}. Use `ray status` for more details.
Exception in callback _chain_future.<locals>._set_state(<Future finis...imeout 5.0s')>, <Future at 0x... returned str>) at /usr/lib64/python3.12/asyncio/futures.py:383
handle: <Handle _chain_future.<locals>._set_state(<Future finis...imeout 5.0s')>, <Future at 0x... returned str>) at /usr/lib64/python3.12/asyncio/futures.py:383>
Traceback (most recent call last):
  File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib64/python3.12/asyncio/futures.py", line 385, in _set_state
    _copy_future_state(other, future)
  File "/usr/lib64/python3.12/asyncio/futures.py", line 355, in _copy_future_state
    assert not dest.done()
           ^^^^^^^^^^^^^^^
AssertionError
WARNING 2025-10-03 20:18:41,703 controller 968 -- Deployment '<dummy_deployment>' in application '<dummy_app>' has 1 replicas that have taken more than 30s to initialize.
This may be caused by a slow __init__ or reconfigure method.

What you expected to happen

The Serve controller should remain stable, even when faced with a large number of unschedulable actors due to a resource bottleneck.

It should gracefully queue the deployment requests and schedule the replicas as new nodes become available, without crashing.

Versions / Dependencies

Ray 2.47.1
Python 3.12

Reproduction script

Followed the below steps to reproduce in my k8 cluster

For Scenario 1:

  1. Configure a Ray cluster with an autoscaler set to a minimum of 1 worker node and 1000+ serve deployment
  2. Ensure the autoscaler's ramp-up speed is slow, simulating a realistic delay in cloud node provisioning.
  3. Deploy the ray Cluster
  4. Observe the Ray Serve controller logs on the head node. it would crash with Check failed: objects_valid error

For Scenario 2:

  1. Configure a Ray cluster with

    • an autoscaler set to a minimum of 1 worker node
    • 1000+ serve deployment with high max_replcias
  2. Ensure the autoscaler's ramp-up speed is slow, simulating a realistic delay in cloud node provisioning.

  3. Deploy the ray Cluster.

  4. Use locust script to generate high traffic one of the model, such that multiple worker nodes scale up to handle the load

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

Labels

P0Issues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray Corestability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions