[core][autoscaler][v1] use cluster_resource_state for node state info to fix over-provisioning #57130

jinbum-kim · 2025-10-02T12:44:33Z

Why are these changes needed?

This PR is related to #52864
The v1 autoscaler monitor currently is pulling metrics from two different modules in GCS:

GcsResourceManager (v1, legacy): manages node_resource_usages_ and updates it at two different intervals (UpdateNodeResourceUsage every 0.1s, UpdateResourceLoads every 1s).
GcsAutoscalerStateManager (v2): manages node_resource_info_ and updates it via UpdateResourceLoadAndUsage. This module is already the source for the v2 autoscaler.

Field	Source (before, v1)	Source (after)	Change?	Notes
current cluster resources	RaySyncer	`GcsResourceManager::UpdateNodeResourceUsage`	100ms (`raylet_report_resources_period_milliseconds`)	gcs_resource_manager.cc#L170
current pending resources	GcsServer	`GcsResourceManager::UpdateResourceLoads`	1s (`gcs_pull_resource_loads_period_milliseconds`)	gcs_server.cc#L422

Because these two modules update asynchronously, the autoscaler can end up seeing inconsistent resource states. That causes a race condition where extra nodes may be launched before the updated availability actually shows up. In practice, this means clusters can become over-provisioned even though the demand was already satisfied.

In the long run, the right fix is to fully switch the v1 autoscaler over to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like v2 already does. But since v1 will eventually be deprecated, this PR takes a practical interim step: it merges the necessary info from both GcsResourceManager::HandleGetAllResourceUsage and GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid approach.

This keeps v1 correct without big changes, while still leaving the path open for a clean migration to v2 later on.

Details

This PR follows the fix suggested by @rueian in #52864 by switching the v1 autoscaler's node state source from GcsResourceManager::HandleGetAllResourceUsage to GcsAutoscalerStateManager::HandleGetClusterResourceState.

Root Cause: The v1 autoscaler previously getting data from two asynchronous update cycles:

Node resources: updated every ~100ms via UpdateNodeResourceUsage
Resource demands: updated every ~1s via UpdateResourceLoads

This created a race condition where newly allocated resources would be visible before demand metrics updated, causing the autoscaler to incorrectly perceive unmet demand and launch extra nodes.

The Fix: By using v2's HandleGetClusterResourceState for node iteration, both current resources and pending demands now come from the same consistent snapshot (same tick), so the extra-node race condition goes away.

Proposed Changes in update_load_metrics()

This PR updates how the v1 autoscaler collects cluster metrics.

Most node state information is now taken from v2 (GcsAutoscalerStateManager::HandleGetClusterResourceState), while certain fields still rely on v1 (GcsResourceManager::HandleGetAllResourceUsage) because v2 doesn't have an equivalent yet.

Field	Source (before, v1)	Source (after)	Change?	Notes
Node states (id, ip, resources, idle duration)	gcs.proto#L526-L527 (`resources_batch_data.batch`)	autoscaler.proto#L206-L212 (`cluster_resource_state.node_states`)	O	Now aligned with v2. Verified no regressions in tests.
waiting_bundles / infeasible_bundles	`resource_load_by_shape`	same as before	X	v2 does not separate ready vs infeasible requests. Still needed for metrics/debugging.
pending_placement_groups	`placement_group_load`	same as before	X	No validated equivalent in v2 yet. May migrate later.
cluster_full	response flag (`cluster_full_of_actors_detected`)	same as before	X	No replacement in v2 fields, so kept as is.

Additional Notes

This hybrid approach addresses the race condition while still using legacy fields where v2 has no equivalent.
All existing autoscaler monitor tests still pass, which shows that the change is backward-compatible and does not break existing behavior.

Changed Behavior (Observed)

(Autoscaler config & serving code are same as this #52864)

After switching to v2 autoscaler state (cluster resource), the issue no longer occurs:

Even with gcs_pull_resource_loads_period_milliseconds=20000, Node Provider only launches a single ray.worker.4090.standard node. (No extra requests for additional nodes are observed.)

debug.log

Related issue number

Closes #52864

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
Unit tests
Release tests
This PR is not tested :(

gemini-code-assist

Code Review

This pull request addresses an over-provisioning issue in the v1 autoscaler by switching the source of node state information to cluster_resource_state for more consistent data. The changes in python/ray/autoscaler/_private/monitor.py correctly implement this hybrid approach, and the new test file python/ray/tests/test_monitor_cluster_state.py provides good coverage for the new logic. My review includes a recommendation to fix a pre-existing performance issue where cluster-wide computations are redundantly performed within a per-node loop, and a minor suggestion to improve code conciseness. The changes are a good step towards resolving the race condition.

gemini-code-assist · 2025-10-02T12:46:20Z

python/ray/autoscaler/_private/monitor.py

                resources = {}
-                for k, v in resource_message.resources_total.items():
+                for k, v in resource_message.total_resources.items():
                    resources[k] = v


This loop to create the resources dictionary can be simplified by using the dict() constructor directly.

Suggested change

resources = {}

for k, v in resource_message.resources_total.items():

for k, v in resource_message.total_resources.items():

resources[k] = v

resources = dict(resource_message.total_resources)

… to fix over-provisioning Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

jjyao · 2025-10-02T18:54:13Z

@rueian could you review?

nadongjun · 2025-10-03T01:53:18Z

Would be great to see it merged soon.

rueian · 2025-10-03T02:33:55Z

@rueian could you review?

Sure!

Would be great to see it merged soon.

Hi @nadongjun, by the way, it is now more recommended to use autoscaler v2 directly.

rueian · 2025-10-03T03:09:46Z

python/ray/autoscaler/_private/monitor.py

    def update_load_metrics(self):
        """Fetches resource usage data from GCS and updates load metrics."""

        response = self.gcs_client.get_all_resource_usage(timeout=60)


Can we remove this?

No,monitor.py still need the response from the v1 method because some fields rely on it.

As I mentioned in the (Proposed Changes in update_load_metrics()) section, several metrics passed into LoadMetrics.update() don’t have a direct equivalent in cluster_resource_state(#L253) (or would require a bigger refactor to align). To minimize changes, these fields continue to use the v1 method.

Got it. Could you also add a TODO comment on top of it explaining why we still need it and what should be done to remove it?

Sure! I just now added a TODO comment above the line to explain why it’s still needed and how it can be removed.

…igration Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

Copilot

Pull Request Overview

This PR addresses a race condition in the v1 autoscaler that causes over-provisioning by switching the node state source from GcsResourceManager to GcsAutoscalerStateManager. The fix ensures consistent resource and demand data by using the same snapshot source (cluster_resource_state) that the v2 autoscaler already uses.

Key Changes:

Switch node state data source from resources_batch_data.batch to cluster_resource_state.node_states
Use cluster_resource_state for node iteration while keeping legacy fields for resource demands
Add comprehensive test coverage for the hybrid approach

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
python/ray/autoscaler/_private/monitor.py	Updated `update_load_metrics()` to use cluster_resource_state for node information while maintaining legacy resource demand parsing
python/ray/tests/test_monitor.py	Added test to verify the hybrid approach works correctly, ensuring node data comes from cluster_resource_state

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

python/ray/autoscaler/_private/monitor.py

rueian · 2025-10-10T23:24:13Z

LGTM, but tests are failing.

Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

jinbum-kim · 2025-10-11T10:36:53Z

@rueian I found an issue in the code, fixed it — all tests were passed.

In detail, although I mentioned that the cluster_full flag was derived from the legacy v1 data, I discovered that the code was actually using the v2 data inside the loop. I fixed it to correctly use the v1 data, and confirmed that all tests now passed.

rueian · 2025-10-11T16:10:32Z

All tests are passed. @jjyao, please merge this when you get a chance. 🙏

@rueian

… to fix over-provisioning (#57130) This PR is related to [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864) The v1 autoscaler monitor currently is pulling metrics from two different modules in GCS: - **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_` and updates it at two different intervals (`UpdateNodeResourceUsage` every 0.1s, `UpdateResourceLoads` every 1s). - **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_` and updates it via `UpdateResourceLoadAndUsage`. This module is already the source for the v2 autoscaler. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | current cluster resources | RaySyncer | `GcsResourceManager::UpdateNodeResourceUsage` | 100ms (`raylet_report_resources_period_milliseconds`) | [gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170) | | current pending resources | GcsServer | `GcsResourceManager::UpdateResourceLoads` | 1s (`gcs_pull_resource_loads_period_milliseconds`) | [gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422) | Because these two modules update asynchronously, the autoscaler can end up seeing inconsistent resource states. That causes a race condition where extra nodes may be launched before the updated availability actually shows up. In practice, this means clusters can become over-provisioned even though the demand was already satisfied. In the long run, the right fix is to fully switch the v1 autoscaler over to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like v2 already does. But since v1 will eventually be deprecated, this PR takes a practical interim step: it merges the necessary info from both GcsResourceManager::HandleGetAllResourceUsage and GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid approach. This keeps v1 correct without big changes, while still leaving the path open for a clean migration to v2 later on. ## Details This PR follows the fix suggested by @rueian in #52864 by switching the v1 autoscaler's node state source from `GcsResourceManager::HandleGetAllResourceUsage` to `GcsAutoscalerStateManager::HandleGetClusterResourceState`. Root Cause: The v1 autoscaler previously getting data from two asynchronous update cycles: - Node resources: updated every ~100ms via `UpdateNodeResourceUsage` - Resource demands: updated every ~1s via `UpdateResourceLoads` This created a race condition where newly allocated resources would be visible before demand metrics updated, causing the autoscaler to incorrectly perceive unmet demand and launch extra nodes. The Fix: By using v2's `HandleGetClusterResourceState` for node iteration, both current resources and pending demands now come from the same consistent snapshot (same tick), so the extra-node race condition goes away. ## Proposed Changes in update_load_metrics() This PR updates how the v1 autoscaler collects cluster metrics. Most node state information is now taken from **v2 (`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while certain fields still rely on **v1 (`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't have an equivalent yet. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | Node states (id, ip, resources, idle duration) | [gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527) (`resources_batch_data.batch`) | [autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212) (`cluster_resource_state.node_states`) | O | Now aligned with v2. Verified no regressions in tests. | | waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same as before | X | v2 does not separate ready vs infeasible requests. Still needed for metrics/debugging. | | pending_placement_groups | `placement_group_load` | same as before | X | No validated equivalent in v2 yet. May migrate later. | | cluster_full | response flag (`cluster_full_of_actors_detected`) | same as before | X | No replacement in v2 fields, so kept as is. | ### Additional Notes - This hybrid approach addresses the race condition while still using legacy fields where v2 has no equivalent. - All existing autoscaler monitor tests still pass, which shows that the change is backward-compatible and does not break existing behavior. ## Changed Behavior (Observed) (Autoscaler config & serving code are same as this [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)) After switching to v2 autoscaler state (cluster resource), the issue no longer occurs: - Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node Provider only launches a single `ray.worker.4090.standard` node. (No extra requests for additional nodes are observed.) [debug.log](https://github.com/user-attachments/files/22659163/debug.log) ## Related issue number Closes #52864 Signed-off-by: jinbum-kim <jinbum9958@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com>

@rueian

… to fix over-provisioning (ray-project#57130) This PR is related to [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864) The v1 autoscaler monitor currently is pulling metrics from two different modules in GCS: - **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_` and updates it at two different intervals (`UpdateNodeResourceUsage` every 0.1s, `UpdateResourceLoads` every 1s). - **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_` and updates it via `UpdateResourceLoadAndUsage`. This module is already the source for the v2 autoscaler. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | current cluster resources | RaySyncer | `GcsResourceManager::UpdateNodeResourceUsage` | 100ms (`raylet_report_resources_period_milliseconds`) | [gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170) | | current pending resources | GcsServer | `GcsResourceManager::UpdateResourceLoads` | 1s (`gcs_pull_resource_loads_period_milliseconds`) | [gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422) | Because these two modules update asynchronously, the autoscaler can end up seeing inconsistent resource states. That causes a race condition where extra nodes may be launched before the updated availability actually shows up. In practice, this means clusters can become over-provisioned even though the demand was already satisfied. In the long run, the right fix is to fully switch the v1 autoscaler over to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like v2 already does. But since v1 will eventually be deprecated, this PR takes a practical interim step: it merges the necessary info from both GcsResourceManager::HandleGetAllResourceUsage and GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid approach. This keeps v1 correct without big changes, while still leaving the path open for a clean migration to v2 later on. ## Details This PR follows the fix suggested by @rueian in ray-project#52864 by switching the v1 autoscaler's node state source from `GcsResourceManager::HandleGetAllResourceUsage` to `GcsAutoscalerStateManager::HandleGetClusterResourceState`. Root Cause: The v1 autoscaler previously getting data from two asynchronous update cycles: - Node resources: updated every ~100ms via `UpdateNodeResourceUsage` - Resource demands: updated every ~1s via `UpdateResourceLoads` This created a race condition where newly allocated resources would be visible before demand metrics updated, causing the autoscaler to incorrectly perceive unmet demand and launch extra nodes. The Fix: By using v2's `HandleGetClusterResourceState` for node iteration, both current resources and pending demands now come from the same consistent snapshot (same tick), so the extra-node race condition goes away. ## Proposed Changes in update_load_metrics() This PR updates how the v1 autoscaler collects cluster metrics. Most node state information is now taken from **v2 (`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while certain fields still rely on **v1 (`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't have an equivalent yet. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | Node states (id, ip, resources, idle duration) | [gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527) (`resources_batch_data.batch`) | [autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212) (`cluster_resource_state.node_states`) | O | Now aligned with v2. Verified no regressions in tests. | | waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same as before | X | v2 does not separate ready vs infeasible requests. Still needed for metrics/debugging. | | pending_placement_groups | `placement_group_load` | same as before | X | No validated equivalent in v2 yet. May migrate later. | | cluster_full | response flag (`cluster_full_of_actors_detected`) | same as before | X | No replacement in v2 fields, so kept as is. | ### Additional Notes - This hybrid approach addresses the race condition while still using legacy fields where v2 has no equivalent. - All existing autoscaler monitor tests still pass, which shows that the change is backward-compatible and does not break existing behavior. ## Changed Behavior (Observed) (Autoscaler config & serving code are same as this [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)) After switching to v2 autoscaler state (cluster resource), the issue no longer occurs: - Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node Provider only launches a single `ray.worker.4090.standard` node. (No extra requests for additional nodes are observed.) [debug.log](https://github.com/user-attachments/files/22659163/debug.log) ## Related issue number Closes ray-project#52864 Signed-off-by: jinbum-kim <jinbum9958@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com>

@rueian

… to fix over-provisioning (ray-project#57130) This PR is related to [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864) The v1 autoscaler monitor currently is pulling metrics from two different modules in GCS: - **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_` and updates it at two different intervals (`UpdateNodeResourceUsage` every 0.1s, `UpdateResourceLoads` every 1s). - **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_` and updates it via `UpdateResourceLoadAndUsage`. This module is already the source for the v2 autoscaler. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | current cluster resources | RaySyncer | `GcsResourceManager::UpdateNodeResourceUsage` | 100ms (`raylet_report_resources_period_milliseconds`) | [gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170) | | current pending resources | GcsServer | `GcsResourceManager::UpdateResourceLoads` | 1s (`gcs_pull_resource_loads_period_milliseconds`) | [gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422) | Because these two modules update asynchronously, the autoscaler can end up seeing inconsistent resource states. That causes a race condition where extra nodes may be launched before the updated availability actually shows up. In practice, this means clusters can become over-provisioned even though the demand was already satisfied. In the long run, the right fix is to fully switch the v1 autoscaler over to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like v2 already does. But since v1 will eventually be deprecated, this PR takes a practical interim step: it merges the necessary info from both GcsResourceManager::HandleGetAllResourceUsage and GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid approach. This keeps v1 correct without big changes, while still leaving the path open for a clean migration to v2 later on. ## Details This PR follows the fix suggested by @rueian in ray-project#52864 by switching the v1 autoscaler's node state source from `GcsResourceManager::HandleGetAllResourceUsage` to `GcsAutoscalerStateManager::HandleGetClusterResourceState`. Root Cause: The v1 autoscaler previously getting data from two asynchronous update cycles: - Node resources: updated every ~100ms via `UpdateNodeResourceUsage` - Resource demands: updated every ~1s via `UpdateResourceLoads` This created a race condition where newly allocated resources would be visible before demand metrics updated, causing the autoscaler to incorrectly perceive unmet demand and launch extra nodes. The Fix: By using v2's `HandleGetClusterResourceState` for node iteration, both current resources and pending demands now come from the same consistent snapshot (same tick), so the extra-node race condition goes away. ## Proposed Changes in update_load_metrics() This PR updates how the v1 autoscaler collects cluster metrics. Most node state information is now taken from **v2 (`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while certain fields still rely on **v1 (`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't have an equivalent yet. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | Node states (id, ip, resources, idle duration) | [gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527) (`resources_batch_data.batch`) | [autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212) (`cluster_resource_state.node_states`) | O | Now aligned with v2. Verified no regressions in tests. | | waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same as before | X | v2 does not separate ready vs infeasible requests. Still needed for metrics/debugging. | | pending_placement_groups | `placement_group_load` | same as before | X | No validated equivalent in v2 yet. May migrate later. | | cluster_full | response flag (`cluster_full_of_actors_detected`) | same as before | X | No replacement in v2 fields, so kept as is. | ### Additional Notes - This hybrid approach addresses the race condition while still using legacy fields where v2 has no equivalent. - All existing autoscaler monitor tests still pass, which shows that the change is backward-compatible and does not break existing behavior. ## Changed Behavior (Observed) (Autoscaler config & serving code are same as this [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)) After switching to v2 autoscaler state (cluster resource), the issue no longer occurs: - Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node Provider only launches a single `ray.worker.4090.standard` node. (No extra requests for additional nodes are observed.) [debug.log](https://github.com/user-attachments/files/22659163/debug.log) ## Related issue number Closes ray-project#52864 Signed-off-by: jinbum-kim <jinbum9958@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com> Signed-off-by: xgui <xgui@anyscale.com>

@rueian

… to fix over-provisioning (#57130) This PR is related to [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864) The v1 autoscaler monitor currently is pulling metrics from two different modules in GCS: - **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_` and updates it at two different intervals (`UpdateNodeResourceUsage` every 0.1s, `UpdateResourceLoads` every 1s). - **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_` and updates it via `UpdateResourceLoadAndUsage`. This module is already the source for the v2 autoscaler. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | current cluster resources | RaySyncer | `GcsResourceManager::UpdateNodeResourceUsage` | 100ms (`raylet_report_resources_period_milliseconds`) | [gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170) | | current pending resources | GcsServer | `GcsResourceManager::UpdateResourceLoads` | 1s (`gcs_pull_resource_loads_period_milliseconds`) | [gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422) | Because these two modules update asynchronously, the autoscaler can end up seeing inconsistent resource states. That causes a race condition where extra nodes may be launched before the updated availability actually shows up. In practice, this means clusters can become over-provisioned even though the demand was already satisfied. In the long run, the right fix is to fully switch the v1 autoscaler over to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like v2 already does. But since v1 will eventually be deprecated, this PR takes a practical interim step: it merges the necessary info from both GcsResourceManager::HandleGetAllResourceUsage and GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid approach. This keeps v1 correct without big changes, while still leaving the path open for a clean migration to v2 later on. ## Details This PR follows the fix suggested by @rueian in #52864 by switching the v1 autoscaler's node state source from `GcsResourceManager::HandleGetAllResourceUsage` to `GcsAutoscalerStateManager::HandleGetClusterResourceState`. Root Cause: The v1 autoscaler previously getting data from two asynchronous update cycles: - Node resources: updated every ~100ms via `UpdateNodeResourceUsage` - Resource demands: updated every ~1s via `UpdateResourceLoads` This created a race condition where newly allocated resources would be visible before demand metrics updated, causing the autoscaler to incorrectly perceive unmet demand and launch extra nodes. The Fix: By using v2's `HandleGetClusterResourceState` for node iteration, both current resources and pending demands now come from the same consistent snapshot (same tick), so the extra-node race condition goes away. ## Proposed Changes in update_load_metrics() This PR updates how the v1 autoscaler collects cluster metrics. Most node state information is now taken from **v2 (`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while certain fields still rely on **v1 (`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't have an equivalent yet. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | Node states (id, ip, resources, idle duration) | [gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527) (`resources_batch_data.batch`) | [autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212) (`cluster_resource_state.node_states`) | O | Now aligned with v2. Verified no regressions in tests. | | waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same as before | X | v2 does not separate ready vs infeasible requests. Still needed for metrics/debugging. | | pending_placement_groups | `placement_group_load` | same as before | X | No validated equivalent in v2 yet. May migrate later. | | cluster_full | response flag (`cluster_full_of_actors_detected`) | same as before | X | No replacement in v2 fields, so kept as is. | ### Additional Notes - This hybrid approach addresses the race condition while still using legacy fields where v2 has no equivalent. - All existing autoscaler monitor tests still pass, which shows that the change is backward-compatible and does not break existing behavior. ## Changed Behavior (Observed) (Autoscaler config & serving code are same as this [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)) After switching to v2 autoscaler state (cluster resource), the issue no longer occurs: - Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node Provider only launches a single `ray.worker.4090.standard` node. (No extra requests for additional nodes are observed.) [debug.log](https://github.com/user-attachments/files/22659163/debug.log) ## Related issue number Closes #52864 Signed-off-by: jinbum-kim <jinbum9958@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

@rueian

… to fix over-provisioning (ray-project#57130) This PR is related to [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864) The v1 autoscaler monitor currently is pulling metrics from two different modules in GCS: - **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_` and updates it at two different intervals (`UpdateNodeResourceUsage` every 0.1s, `UpdateResourceLoads` every 1s). - **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_` and updates it via `UpdateResourceLoadAndUsage`. This module is already the source for the v2 autoscaler. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | current cluster resources | RaySyncer | `GcsResourceManager::UpdateNodeResourceUsage` | 100ms (`raylet_report_resources_period_milliseconds`) | [gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170) | | current pending resources | GcsServer | `GcsResourceManager::UpdateResourceLoads` | 1s (`gcs_pull_resource_loads_period_milliseconds`) | [gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422) | Because these two modules update asynchronously, the autoscaler can end up seeing inconsistent resource states. That causes a race condition where extra nodes may be launched before the updated availability actually shows up. In practice, this means clusters can become over-provisioned even though the demand was already satisfied. In the long run, the right fix is to fully switch the v1 autoscaler over to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like v2 already does. But since v1 will eventually be deprecated, this PR takes a practical interim step: it merges the necessary info from both GcsResourceManager::HandleGetAllResourceUsage and GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid approach. This keeps v1 correct without big changes, while still leaving the path open for a clean migration to v2 later on. ## Details This PR follows the fix suggested by @rueian in ray-project#52864 by switching the v1 autoscaler's node state source from `GcsResourceManager::HandleGetAllResourceUsage` to `GcsAutoscalerStateManager::HandleGetClusterResourceState`. Root Cause: The v1 autoscaler previously getting data from two asynchronous update cycles: - Node resources: updated every ~100ms via `UpdateNodeResourceUsage` - Resource demands: updated every ~1s via `UpdateResourceLoads` This created a race condition where newly allocated resources would be visible before demand metrics updated, causing the autoscaler to incorrectly perceive unmet demand and launch extra nodes. The Fix: By using v2's `HandleGetClusterResourceState` for node iteration, both current resources and pending demands now come from the same consistent snapshot (same tick), so the extra-node race condition goes away. ## Proposed Changes in update_load_metrics() This PR updates how the v1 autoscaler collects cluster metrics. Most node state information is now taken from **v2 (`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while certain fields still rely on **v1 (`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't have an equivalent yet. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | Node states (id, ip, resources, idle duration) | [gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527) (`resources_batch_data.batch`) | [autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212) (`cluster_resource_state.node_states`) | O | Now aligned with v2. Verified no regressions in tests. | | waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same as before | X | v2 does not separate ready vs infeasible requests. Still needed for metrics/debugging. | | pending_placement_groups | `placement_group_load` | same as before | X | No validated equivalent in v2 yet. May migrate later. | | cluster_full | response flag (`cluster_full_of_actors_detected`) | same as before | X | No replacement in v2 fields, so kept as is. | ### Additional Notes - This hybrid approach addresses the race condition while still using legacy fields where v2 has no equivalent. - All existing autoscaler monitor tests still pass, which shows that the change is backward-compatible and does not break existing behavior. ## Changed Behavior (Observed) (Autoscaler config & serving code are same as this [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)) After switching to v2 autoscaler state (cluster resource), the issue no longer occurs: - Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node Provider only launches a single `ray.worker.4090.standard` node. (No extra requests for additional nodes are observed.) [debug.log](https://github.com/user-attachments/files/22659163/debug.log) ## Related issue number Closes ray-project#52864 Signed-off-by: jinbum-kim <jinbum9958@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com>

@rueian

… to fix over-provisioning (ray-project#57130) This PR is related to [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864) The v1 autoscaler monitor currently is pulling metrics from two different modules in GCS: - **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_` and updates it at two different intervals (`UpdateNodeResourceUsage` every 0.1s, `UpdateResourceLoads` every 1s). - **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_` and updates it via `UpdateResourceLoadAndUsage`. This module is already the source for the v2 autoscaler. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | current cluster resources | RaySyncer | `GcsResourceManager::UpdateNodeResourceUsage` | 100ms (`raylet_report_resources_period_milliseconds`) | [gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170) | | current pending resources | GcsServer | `GcsResourceManager::UpdateResourceLoads` | 1s (`gcs_pull_resource_loads_period_milliseconds`) | [gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422) | Because these two modules update asynchronously, the autoscaler can end up seeing inconsistent resource states. That causes a race condition where extra nodes may be launched before the updated availability actually shows up. In practice, this means clusters can become over-provisioned even though the demand was already satisfied. In the long run, the right fix is to fully switch the v1 autoscaler over to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like v2 already does. But since v1 will eventually be deprecated, this PR takes a practical interim step: it merges the necessary info from both GcsResourceManager::HandleGetAllResourceUsage and GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid approach. This keeps v1 correct without big changes, while still leaving the path open for a clean migration to v2 later on. ## Details This PR follows the fix suggested by @rueian in ray-project#52864 by switching the v1 autoscaler's node state source from `GcsResourceManager::HandleGetAllResourceUsage` to `GcsAutoscalerStateManager::HandleGetClusterResourceState`. Root Cause: The v1 autoscaler previously getting data from two asynchronous update cycles: - Node resources: updated every ~100ms via `UpdateNodeResourceUsage` - Resource demands: updated every ~1s via `UpdateResourceLoads` This created a race condition where newly allocated resources would be visible before demand metrics updated, causing the autoscaler to incorrectly perceive unmet demand and launch extra nodes. The Fix: By using v2's `HandleGetClusterResourceState` for node iteration, both current resources and pending demands now come from the same consistent snapshot (same tick), so the extra-node race condition goes away. ## Proposed Changes in update_load_metrics() This PR updates how the v1 autoscaler collects cluster metrics. Most node state information is now taken from **v2 (`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while certain fields still rely on **v1 (`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't have an equivalent yet. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | Node states (id, ip, resources, idle duration) | [gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527) (`resources_batch_data.batch`) | [autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212) (`cluster_resource_state.node_states`) | O | Now aligned with v2. Verified no regressions in tests. | | waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same as before | X | v2 does not separate ready vs infeasible requests. Still needed for metrics/debugging. | | pending_placement_groups | `placement_group_load` | same as before | X | No validated equivalent in v2 yet. May migrate later. | | cluster_full | response flag (`cluster_full_of_actors_detected`) | same as before | X | No replacement in v2 fields, so kept as is. | ### Additional Notes - This hybrid approach addresses the race condition while still using legacy fields where v2 has no equivalent. - All existing autoscaler monitor tests still pass, which shows that the change is backward-compatible and does not break existing behavior. ## Changed Behavior (Observed) (Autoscaler config & serving code are same as this [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)) After switching to v2 autoscaler state (cluster resource), the issue no longer occurs: - Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node Provider only launches a single `ray.worker.4090.standard` node. (No extra requests for additional nodes are observed.) [debug.log](https://github.com/user-attachments/files/22659163/debug.log) ## Related issue number Closes ray-project#52864 Signed-off-by: jinbum-kim <jinbum9958@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

jinbum-kim requested a review from a team as a code owner October 2, 2025 12:44

gemini-code-assist bot reviewed Oct 2, 2025

View reviewed changes

jinbum-kim marked this pull request as draft October 2, 2025 12:47

This comment was marked as outdated.

Sign in to view

jinbum-kim marked this pull request as ready for review October 2, 2025 13:06

ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Oct 2, 2025

[core][autoscaler][v1] use cluster_resource_state for node state info…

b7a670b

… to fix over-provisioning Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

jinbum-kim force-pushed the monitor-cluster-state-sync branch from 73524c5 to b7a670b Compare October 2, 2025 13:35

jinbum-kim added 2 commits October 2, 2025 22:35

Merge branch 'master' into monitor-cluster-state-sync

ac367a6

Merge branch 'master' into monitor-cluster-state-sync

78be5d0

jjyao assigned rueian Oct 2, 2025

rueian reviewed Oct 3, 2025

View reviewed changes

jinbum-kim added 2 commits October 5, 2025 16:40

Merge branch 'ray-project:master' into monitor-cluster-state-sync

6480c00

[chore] add TODO for get_all_resource_usage() removal after full v2 m…

85e0e54

…igration Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

rueian requested a review from Copilot October 10, 2025 22:32

Copilot AI reviewed Oct 10, 2025

View reviewed changes

python/ray/autoscaler/_private/monitor.py Show resolved Hide resolved

python/ray/autoscaler/_private/monitor.py Show resolved Hide resolved

rueian and others added 3 commits October 10, 2025 16:24

Merge branch 'master' into monitor-cluster-state-sync

488adff

Merge branch 'ray-project:master' into monitor-cluster-state-sync

faafe63

fix(autoscaler): restore cluster_full detection in updae_load_metrics()

8774e63

Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

rueian added the go add ONLY when ready to merge, run all tests label Oct 11, 2025

rueian approved these changes Oct 11, 2025

View reviewed changes

edoakes merged commit 216b780 into ray-project:master Oct 13, 2025
7 checks passed

jinbum-kim deleted the monitor-cluster-state-sync branch October 13, 2025 14:01

[core][autoscaler][v1] use cluster_resource_state for node state info to fix over-provisioning #57130

[core][autoscaler][v1] use cluster_resource_state for node state info to fix over-provisioning #57130

Uh oh!

Conversation

jinbum-kim commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Details

Proposed Changes in update_load_metrics()

Additional Notes

Changed Behavior (Observed)

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

jjyao commented Oct 2, 2025

Uh oh!

nadongjun commented Oct 3, 2025

Uh oh!

rueian commented Oct 3, 2025

Uh oh!

rueian Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

jinbum-kim Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

rueian Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

jinbum-kim Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes:

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

rueian commented Oct 10, 2025

Uh oh!

jinbum-kim commented Oct 11, 2025

Uh oh!

rueian commented Oct 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jinbum-kim commented Oct 2, 2025 •

edited

Loading