[core][autoscaler][v1] use cluster_resource_state for node state info to fix over-provisioning (#57130)

jinbum-kim · rueian · harshit-anyscale · commit acb6ed23e8d2 · 2025-10-14T18:04:34.000Z
This PR is related to [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864) The v1 autoscaler monitor currently is pulling metrics from two different modules in GCS: - **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_` and updates it at two different intervals (`UpdateNodeResourceUsage` every 0.1s, `UpdateResourceLoads` every 1s). - **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_` and updates it via `UpdateResourceLoadAndUsage`. This module is already the source for the v2 autoscaler. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | current cluster resources | RaySyncer | `GcsResourceManager::UpdateNodeResourceUsage` | 100ms (`raylet_report_resources_period_milliseconds`) | [gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170) | | current pending resources | GcsServer | `GcsResourceManager::UpdateResourceLoads` | 1s (`gcs_pull_resource_loads_period_milliseconds`) | [gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422) | Because these two modules update asynchronously, the autoscaler can end up seeing inconsistent resource states. That causes a race condition where extra nodes may be launched before the updated availability actually shows up. In practice, this means clusters can become over-provisioned even though the demand was already satisfied. In the long run, the right fix is to fully switch the v1 autoscaler over to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like v2 already does. But since v1 will eventually be deprecated, this PR takes a practical interim step: it merges the necessary info from both GcsResourceManager::HandleGetAllResourceUsage and GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid approach. This keeps v1 correct without big changes, while still leaving the path open for a clean migration to v2 later on. ## Details This PR follows the fix suggested by @rueian in #52864 by switching the v1 autoscaler's node state source from `GcsResourceManager::HandleGetAllResourceUsage` to `GcsAutoscalerStateManager::HandleGetClusterResourceState`. Root Cause: The v1 autoscaler previously getting data from two asynchronous update cycles: - Node resources: updated every ~100ms via `UpdateNodeResourceUsage` - Resource demands: updated every ~1s via `UpdateResourceLoads` This created a race condition where newly allocated resources would be visible before demand metrics updated, causing the autoscaler to incorrectly perceive unmet demand and launch extra nodes. The Fix: By using v2's `HandleGetClusterResourceState` for node iteration, both current resources and pending demands now come from the same consistent snapshot (same tick), so the extra-node race condition goes away. ## Proposed Changes in update_load_metrics() This PR updates how the v1 autoscaler collects cluster metrics. Most node state information is now taken from **v2 (`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while certain fields still rely on **v1 (`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't have an equivalent yet. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | Node states (id, ip, resources, idle duration) | [gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527) (`resources_batch_data.batch`) | [autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212) (`cluster_resource_state.node_states`) | O | Now aligned with v2. Verified no regressions in tests. | | waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same as before | X | v2 does not separate ready vs infeasible requests. Still needed for metrics/debugging. | | pending_placement_groups | `placement_group_load` | same as before | X | No validated equivalent in v2 yet. May migrate later. | | cluster_full | response flag (`cluster_full_of_actors_detected`) | same as before | X | No replacement in v2 fields, so kept as is. | ### Additional Notes - This hybrid approach addresses the race condition while still using legacy fields where v2 has no equivalent. - All existing autoscaler monitor tests still pass, which shows that the change is backward-compatible and does not break existing behavior. ## Changed Behavior (Observed) (Autoscaler config & serving code are same as this [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)) After switching to v2 autoscaler state (cluster resource), the issue no longer occurs: - Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node Provider only launches a single `ray.worker.4090.standard` node. (No extra requests for additional nodes are observed.) [debug.log](https://github.com/user-attachments/files/22659163/debug.log) ## Related issue number Closes #52864 Signed-off-by: jinbum-kim <jinbum9958@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com>
diff --git a/python/ray/autoscaler/_private/monitor.py b/python/ray/autoscaler/_private/monitor.py
@@ -243,6 +243,9 @@ def get_latest_readonly_config():
     def update_load_metrics(self):
         """Fetches resource usage data from GCS and updates load metrics."""
 
+        # TODO(jinbum-kim): Still needed since some fields aren't in cluster_resource_state.
+        # Remove after v1 autoscaler fully migrates to get_cluster_resource_state().
+        # ref: https://github.com/ray-project/ray/pull/57130
         response = self.gcs_client.get_all_resource_usage(timeout=60)
         resources_batch_data = response.resource_usage_data
         log_resource_batch_data_if_desired(resources_batch_data)
@@ -259,41 +262,41 @@ def update_load_metrics(self):
         # Tell the readonly node provider what nodes to report.
         if self.readonly_config:
             new_nodes = []
-            for msg in list(resources_batch_data.batch):
+            for msg in list(cluster_resource_state.node_states):
                 node_id = msg.node_id.hex()
-                new_nodes.append((node_id, msg.node_manager_address))
+                new_nodes.append((node_id, msg.node_ip_address))
             self.autoscaler.provider._set_nodes(new_nodes)
 
         mirror_node_types = {}
-        cluster_full = False
+        legacy_cluster_full_detected = any(
+            getattr(entry, "cluster_full_of_actors_detected", False)
+            for entry in resources_batch_data.batch
+        )
+        cluster_full = legacy_cluster_full_detected or getattr(
+            response, "cluster_full_of_actors_detected_by_gcs", False
+        )
         if (
             hasattr(response, "cluster_full_of_actors_detected_by_gcs")
             and response.cluster_full_of_actors_detected_by_gcs
         ):
             # GCS has detected the cluster full of actors.
             cluster_full = True
-        for resource_message in resources_batch_data.batch:
+        for resource_message in cluster_resource_state.node_states:
             node_id = resource_message.node_id
             # Generate node type config based on GCS reported node list.
             if self.readonly_config:
                 # Keep prefix in sync with ReadonlyNodeProvider.
                 node_type = format_readonly_node_type(node_id.hex())
                 resources = {}
-                for k, v in resource_message.resources_total.items():
+                for k, v in resource_message.total_resources.items():
                     resources[k] = v
                 mirror_node_types[node_type] = {
                     "resources": resources,
                     "node_config": {},
                     "max_workers": 1,
                 }
-            if (
-                hasattr(resource_message, "cluster_full_of_actors_detected")
-                and resource_message.cluster_full_of_actors_detected
-            ):
-                # A worker node has detected the cluster full of actors.
-                cluster_full = True
-            total_resources = dict(resource_message.resources_total)
-            available_resources = dict(resource_message.resources_available)
+            total_resources = dict(resource_message.total_resources)
+            available_resources = dict(resource_message.available_resources)
 
             waiting_bundles, infeasible_bundles = parse_resource_demands(
                 resources_batch_data.resource_load_by_shape
@@ -319,7 +322,7 @@ def update_load_metrics(self):
                 else:
                     ip = node_id.hex()
             else:
-                ip = resource_message.node_manager_address
+                ip = resource_message.node_ip_address
 
             idle_duration_s = 0.0
             if node_id in ray_nodes_idle_duration_ms_by_id:
diff --git a/python/ray/tests/test_monitor.py b/python/ray/tests/test_monitor.py
@@ -1,10 +1,17 @@
 import sys
+import types
 
 import pytest
 
 import ray
 import ray._private.gcs_utils as gcs_utils
+from ray.autoscaler._private import (
+    load_metrics as load_metrics_module,
+    monitor as monitor_module,
+)
+from ray.autoscaler._private.load_metrics import LoadMetrics
 from ray.autoscaler._private.monitor import parse_resource_demands
+from ray.core.generated import autoscaler_pb2, gcs_service_pb2
 
 ray.experimental.internal_kv.redis = False
 
@@ -51,5 +58,79 @@ def test_parse_resource_demands():
     assert len(waiting + infeasible) == 10
 
 
+def test_update_load_metrics_uses_cluster_state(monkeypatch):
+    """Ensure cluster_resource_state fields flow into LoadMetrics.
+
+    Verify node data comes from cluster_resource_state while demand parsing
+    still uses resource_load_by_shape.
+    """
+
+    monitor = monitor_module.Monitor.__new__(monitor_module.Monitor)
+    monitor.gcs_client = types.SimpleNamespace()
+    monitor.load_metrics = LoadMetrics()
+    monitor.autoscaler = types.SimpleNamespace(config={"provider": {}})
+    monitor.autoscaling_config = None
+    monitor.readonly_config = None
+    monitor.prom_metrics = None
+    monitor.event_summarizer = None
+
+    usage_reply = gcs_service_pb2.GetAllResourceUsageReply()
+    demand = (
+        usage_reply.resource_usage_data.resource_load_by_shape.resource_demands.add()
+    )
+    demand.shape["CPU"] = 1.0
+    demand.num_ready_requests_queued = 2
+    demand.backlog_size = 1
+
+    monitor.gcs_client.get_all_resource_usage = lambda timeout: usage_reply
+
+    cluster_state = autoscaler_pb2.ClusterResourceState()
+    node_state = cluster_state.node_states.add()
+    node_state.node_id = bytes.fromhex("ab" * 20)
+    node_state.node_ip_address = "1.2.3.4"
+    node_state.total_resources["CPU"] = 4.0
+    node_state.available_resources["CPU"] = 1.5
+    node_state.idle_duration_ms = 1500
+
+    monkeypatch.setattr(
+        monitor_module, "get_cluster_resource_state", lambda gcs_client: cluster_state
+    )
+
+    seen = {}
+    orig_parse = monitor_module.parse_resource_demands
+
+    def spy_parse(arg):
+        # Spy on the legacy parser to ensure resource_load_by_shape still feeds it.
+        seen["arg"] = arg
+        return orig_parse(arg)
+
+    monkeypatch.setattr(monitor_module, "parse_resource_demands", spy_parse)
+
+    fixed_time = 1000.0
+    monkeypatch.setattr(
+        load_metrics_module, "time", types.SimpleNamespace(time=lambda: fixed_time)
+    )
+
+    monitor.update_load_metrics()
+
+    resources = monitor.load_metrics.static_resources_by_ip
+    assert resources["1.2.3.4"]["CPU"] == pytest.approx(4.0)
+
+    usage = monitor.load_metrics.dynamic_resources_by_ip
+    assert usage["1.2.3.4"]["CPU"] == pytest.approx(1.5)
+
+    assert seen["arg"] is usage_reply.resource_usage_data.resource_load_by_shape
+
+    assert monitor.load_metrics.pending_placement_groups == []
+
+    waiting = monitor.load_metrics.waiting_bundles
+    infeasible = monitor.load_metrics.infeasible_bundles
+    assert waiting.count({"CPU": 1.0}) == 3
+    assert not infeasible
+
+    last_used = monitor.load_metrics.ray_nodes_last_used_time_by_ip["1.2.3.4"]
+    assert last_used == pytest.approx(fixed_time - 1.5)
+
+
 if __name__ == "__main__":
     sys.exit(pytest.main(["-sv", __file__]))