Skip to content

Conversation

@jinbum-kim
Copy link
Contributor

@jinbum-kim jinbum-kim commented Oct 2, 2025

Why are these changes needed?

This PR is related to #52864
The v1 autoscaler monitor currently is pulling metrics from two different modules in GCS:

  • GcsResourceManager (v1, legacy): manages node_resource_usages_ and updates it at two different intervals (UpdateNodeResourceUsage every 0.1s, UpdateResourceLoads every 1s).
  • GcsAutoscalerStateManager (v2): manages node_resource_info_ and updates it via UpdateResourceLoadAndUsage. This module is already the source for the v2 autoscaler.
Field Source (before, v1) Source (after) Change? Notes
current cluster resources RaySyncer GcsResourceManager::UpdateNodeResourceUsage 100ms (raylet_report_resources_period_milliseconds) gcs_resource_manager.cc#L170
current pending resources GcsServer GcsResourceManager::UpdateResourceLoads 1s (gcs_pull_resource_loads_period_milliseconds) gcs_server.cc#L422

Because these two modules update asynchronously, the autoscaler can end up seeing inconsistent resource states. That causes a race condition where extra nodes may be launched before the updated availability actually shows up. In practice, this means clusters can become over-provisioned even though the demand was already satisfied.

In the long run, the right fix is to fully switch the v1 autoscaler over to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like v2 already does. But since v1 will eventually be deprecated, this PR takes a practical interim step: it merges the necessary info from both GcsResourceManager::HandleGetAllResourceUsage and GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid approach.

This keeps v1 correct without big changes, while still leaving the path open for a clean migration to v2 later on.

Details

This PR follows the fix suggested by @rueian in #52864 by switching the v1 autoscaler's node state source from GcsResourceManager::HandleGetAllResourceUsage to GcsAutoscalerStateManager::HandleGetClusterResourceState.

Root Cause: The v1 autoscaler previously getting data from two asynchronous update cycles:

  • Node resources: updated every ~100ms via UpdateNodeResourceUsage
  • Resource demands: updated every ~1s via UpdateResourceLoads

This created a race condition where newly allocated resources would be visible before demand metrics updated, causing the autoscaler to incorrectly perceive unmet demand and launch extra nodes.

The Fix: By using v2's HandleGetClusterResourceState for node iteration, both current resources and pending demands now come from the same consistent snapshot (same tick), so the extra-node race condition goes away.

Proposed Changes in update_load_metrics()

This PR updates how the v1 autoscaler collects cluster metrics.

Most node state information is now taken from v2 (GcsAutoscalerStateManager::HandleGetClusterResourceState), while certain fields still rely on v1 (GcsResourceManager::HandleGetAllResourceUsage) because v2 doesn't have an equivalent yet.

Field Source (before, v1) Source (after) Change? Notes
Node states (id, ip, resources, idle duration) gcs.proto#L526-L527 (resources_batch_data.batch) autoscaler.proto#L206-L212 (cluster_resource_state.node_states) O Now aligned with v2. Verified no regressions in tests.
waiting_bundles / infeasible_bundles resource_load_by_shape same as before X v2 does not separate ready vs infeasible requests. Still needed for metrics/debugging.
pending_placement_groups placement_group_load same as before X No validated equivalent in v2 yet. May migrate later.
cluster_full response flag (cluster_full_of_actors_detected) same as before X No replacement in v2 fields, so kept as is.

Additional Notes

  • This hybrid approach addresses the race condition while still using legacy fields where v2 has no equivalent.
  • All existing autoscaler monitor tests still pass, which shows that the change is backward-compatible and does not break existing behavior.

Changed Behavior (Observed)

(Autoscaler config & serving code are same as this #52864)

After switching to v2 autoscaler state (cluster resource), the issue no longer occurs:

  • Even with gcs_pull_resource_loads_period_milliseconds=20000, Node Provider only launches a single ray.worker.4090.standard node. (No extra requests for additional nodes are observed.)

debug.log

Related issue number

Closes #52864

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
  • Unit tests
  • Release tests
  • This PR is not tested :(

@jinbum-kim jinbum-kim requested a review from a team as a code owner October 2, 2025 12:44
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an over-provisioning issue in the v1 autoscaler by switching the source of node state information to cluster_resource_state for more consistent data. The changes in python/ray/autoscaler/_private/monitor.py correctly implement this hybrid approach, and the new test file python/ray/tests/test_monitor_cluster_state.py provides good coverage for the new logic. My review includes a recommendation to fix a pre-existing performance issue where cluster-wide computations are redundantly performed within a per-node loop, and a minor suggestion to improve code conciseness. The changes are a good step towards resolving the race condition.

Comment on lines 281 to 283
resources = {}
for k, v in resource_message.resources_total.items():
for k, v in resource_message.total_resources.items():
resources[k] = v
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop to create the resources dictionary can be simplified by using the dict() constructor directly.

Suggested change
resources = {}
for k, v in resource_message.resources_total.items():
for k, v in resource_message.total_resources.items():
resources[k] = v
resources = dict(resource_message.total_resources)

@jinbum-kim jinbum-kim marked this pull request as draft October 2, 2025 12:47
cursor[bot]

This comment was marked as outdated.

@jinbum-kim jinbum-kim marked this pull request as ready for review October 2, 2025 13:06
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Oct 2, 2025
… to fix over-provisioning

Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
@jinbum-kim jinbum-kim force-pushed the monitor-cluster-state-sync branch from 73524c5 to b7a670b Compare October 2, 2025 13:35
@jjyao
Copy link
Collaborator

jjyao commented Oct 2, 2025

@rueian could you review?

@nadongjun
Copy link
Contributor

Would be great to see it merged soon.

@rueian
Copy link
Contributor

rueian commented Oct 3, 2025

@rueian could you review?

Sure!

Would be great to see it merged soon.

Hi @nadongjun, by the way, it is now more recommended to use autoscaler v2 directly.

def update_load_metrics(self):
"""Fetches resource usage data from GCS and updates load metrics."""

response = self.gcs_client.get_all_resource_usage(timeout=60)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No,monitor.py still need the response from the v1 method because some fields rely on it.

As I mentioned in the (Proposed Changes in update_load_metrics()) section, several metrics passed into LoadMetrics.update() don’t have a direct equivalent in cluster_resource_state(#L253) (or would require a bigger refactor to align). To minimize changes, these fields continue to use the v1 method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Could you also add a TODO comment on top of it explaining why we still need it and what should be done to remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I just now added a TODO comment above the line to explain why it’s still needed and how it can be removed.

@rueian rueian requested a review from Copilot October 10, 2025 22:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a race condition in the v1 autoscaler that causes over-provisioning by switching the node state source from GcsResourceManager to GcsAutoscalerStateManager. The fix ensures consistent resource and demand data by using the same snapshot source (cluster_resource_state) that the v2 autoscaler already uses.

Key Changes:

  • Switch node state data source from resources_batch_data.batch to cluster_resource_state.node_states
  • Use cluster_resource_state for node iteration while keeping legacy fields for resource demands
  • Add comprehensive test coverage for the hybrid approach

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
python/ray/autoscaler/_private/monitor.py Updated update_load_metrics() to use cluster_resource_state for node information while maintaining legacy resource demand parsing
python/ray/tests/test_monitor.py Added test to verify the hybrid approach works correctly, ensuring node data comes from cluster_resource_state

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@rueian
Copy link
Contributor

rueian commented Oct 10, 2025

LGTM, but tests are failing.

@jinbum-kim
Copy link
Contributor Author

@rueian I found an issue in the code, fixed it — all tests were passed.

In detail, although I mentioned that the cluster_full flag was derived from the legacy v1 data, I discovered that the code was actually using the v2 data inside the loop. I fixed it to correctly use the v1 data, and confirmed that all tests now passed.

@rueian rueian added the go add ONLY when ready to merge, run all tests label Oct 11, 2025
@rueian
Copy link
Contributor

rueian commented Oct 11, 2025

All tests are passed. @jjyao, please merge this when you get a chance. 🙏

@edoakes edoakes merged commit 216b780 into ray-project:master Oct 13, 2025
7 checks passed
@jinbum-kim jinbum-kim deleted the monitor-cluster-state-sync branch October 13, 2025 14:01
harshit-anyscale pushed a commit that referenced this pull request Oct 15, 2025
… to fix over-provisioning (#57130)

This PR is related to
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)
The v1 autoscaler monitor currently is pulling metrics from two
different modules in GCS:

- **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_`
and updates it at two different intervals (`UpdateNodeResourceUsage`
every 0.1s, `UpdateResourceLoads` every 1s).
- **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_`
and updates it via `UpdateResourceLoadAndUsage`. This module is already
the source for the v2 autoscaler.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| current cluster resources | RaySyncer |
`GcsResourceManager::UpdateNodeResourceUsage` | 100ms
(`raylet_report_resources_period_milliseconds`) |
[gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170)
|
| current pending resources | GcsServer |
`GcsResourceManager::UpdateResourceLoads` | 1s
(`gcs_pull_resource_loads_period_milliseconds`) |
[gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422)
|

Because these two modules update asynchronously, the autoscaler can end
up seeing inconsistent resource states. That causes a race condition
where extra nodes may be launched before the updated availability
actually shows up. In practice, this means clusters can become
over-provisioned even though the demand was already satisfied.

In the long run, the right fix is to fully switch the v1 autoscaler over
to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like
v2 already does. But since v1 will eventually be deprecated, this PR
takes a practical interim step: it merges the necessary info from both
GcsResourceManager::HandleGetAllResourceUsage and
GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid
approach.

This keeps v1 correct without big changes, while still leaving the path
open for a clean migration to v2 later on.

## Details

This PR follows the fix suggested by @rueian in #52864 by switching the
v1 autoscaler's node state source from
`GcsResourceManager::HandleGetAllResourceUsage` to
`GcsAutoscalerStateManager::HandleGetClusterResourceState`.

Root Cause: The v1 autoscaler previously getting data from two
asynchronous update cycles:

- Node resources: updated every ~100ms via `UpdateNodeResourceUsage`
- Resource demands: updated every ~1s via `UpdateResourceLoads`

This created a race condition where newly allocated resources would be
visible before demand metrics updated, causing the autoscaler to
incorrectly perceive unmet demand and launch extra nodes.

The Fix: By using v2's `HandleGetClusterResourceState` for node
iteration, both current resources and pending demands now come from the
same consistent snapshot (same tick), so the extra-node race condition
goes away.

## Proposed Changes in update_load_metrics()

This PR updates how the v1 autoscaler collects cluster metrics.

Most node state information is now taken from **v2
(`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while
certain fields still rely on **v1
(`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't
have an equivalent yet.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| Node states (id, ip, resources, idle duration) |
[gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527)
(`resources_batch_data.batch`) |
[autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212)
(`cluster_resource_state.node_states`) | O | Now aligned with v2.
Verified no regressions in tests. |
| waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same
as before | X | v2 does not separate ready vs infeasible requests. Still
needed for metrics/debugging. |
| pending_placement_groups | `placement_group_load` | same as before | X
| No validated equivalent in v2 yet. May migrate later. |
| cluster_full | response flag (`cluster_full_of_actors_detected`) |
same as before | X | No replacement in v2 fields, so kept as is. |

### Additional Notes

- This hybrid approach addresses the race condition while still using
legacy fields where v2 has no equivalent.
- All existing autoscaler monitor tests still pass, which shows that the
change is backward-compatible and does not break existing behavior.

## Changed Behavior (Observed)

(Autoscaler config & serving code are same as this
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864))

After switching to v2 autoscaler state (cluster resource), the issue no
longer occurs:

- Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node
Provider only launches a single `ray.worker.4090.standard` node. (No
extra requests for additional nodes are observed.)


[debug.log](https://github.com/user-attachments/files/22659163/debug.log)

## Related issue number

Closes #52864 

Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
… to fix over-provisioning (ray-project#57130)

This PR is related to
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)
The v1 autoscaler monitor currently is pulling metrics from two
different modules in GCS:

- **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_`
and updates it at two different intervals (`UpdateNodeResourceUsage`
every 0.1s, `UpdateResourceLoads` every 1s).
- **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_`
and updates it via `UpdateResourceLoadAndUsage`. This module is already
the source for the v2 autoscaler.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| current cluster resources | RaySyncer |
`GcsResourceManager::UpdateNodeResourceUsage` | 100ms
(`raylet_report_resources_period_milliseconds`) |
[gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170)
|
| current pending resources | GcsServer |
`GcsResourceManager::UpdateResourceLoads` | 1s
(`gcs_pull_resource_loads_period_milliseconds`) |
[gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422)
|

Because these two modules update asynchronously, the autoscaler can end
up seeing inconsistent resource states. That causes a race condition
where extra nodes may be launched before the updated availability
actually shows up. In practice, this means clusters can become
over-provisioned even though the demand was already satisfied.

In the long run, the right fix is to fully switch the v1 autoscaler over
to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like
v2 already does. But since v1 will eventually be deprecated, this PR
takes a practical interim step: it merges the necessary info from both
GcsResourceManager::HandleGetAllResourceUsage and
GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid
approach.

This keeps v1 correct without big changes, while still leaving the path
open for a clean migration to v2 later on.

## Details

This PR follows the fix suggested by @rueian in ray-project#52864 by switching the
v1 autoscaler's node state source from
`GcsResourceManager::HandleGetAllResourceUsage` to
`GcsAutoscalerStateManager::HandleGetClusterResourceState`.

Root Cause: The v1 autoscaler previously getting data from two
asynchronous update cycles:

- Node resources: updated every ~100ms via `UpdateNodeResourceUsage`
- Resource demands: updated every ~1s via `UpdateResourceLoads`

This created a race condition where newly allocated resources would be
visible before demand metrics updated, causing the autoscaler to
incorrectly perceive unmet demand and launch extra nodes.

The Fix: By using v2's `HandleGetClusterResourceState` for node
iteration, both current resources and pending demands now come from the
same consistent snapshot (same tick), so the extra-node race condition
goes away.

## Proposed Changes in update_load_metrics()

This PR updates how the v1 autoscaler collects cluster metrics.

Most node state information is now taken from **v2
(`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while
certain fields still rely on **v1
(`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't
have an equivalent yet.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| Node states (id, ip, resources, idle duration) |
[gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527)
(`resources_batch_data.batch`) |
[autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212)
(`cluster_resource_state.node_states`) | O | Now aligned with v2.
Verified no regressions in tests. |
| waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same
as before | X | v2 does not separate ready vs infeasible requests. Still
needed for metrics/debugging. |
| pending_placement_groups | `placement_group_load` | same as before | X
| No validated equivalent in v2 yet. May migrate later. |
| cluster_full | response flag (`cluster_full_of_actors_detected`) |
same as before | X | No replacement in v2 fields, so kept as is. |

### Additional Notes

- This hybrid approach addresses the race condition while still using
legacy fields where v2 has no equivalent.
- All existing autoscaler monitor tests still pass, which shows that the
change is backward-compatible and does not break existing behavior.

## Changed Behavior (Observed)

(Autoscaler config & serving code are same as this
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864))

After switching to v2 autoscaler state (cluster resource), the issue no
longer occurs:

- Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node
Provider only launches a single `ray.worker.4090.standard` node. (No
extra requests for additional nodes are observed.)


[debug.log](https://github.com/user-attachments/files/22659163/debug.log)

## Related issue number

Closes ray-project#52864 

Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
… to fix over-provisioning (ray-project#57130)

This PR is related to
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)
The v1 autoscaler monitor currently is pulling metrics from two
different modules in GCS:

- **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_`
and updates it at two different intervals (`UpdateNodeResourceUsage`
every 0.1s, `UpdateResourceLoads` every 1s).
- **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_`
and updates it via `UpdateResourceLoadAndUsage`. This module is already
the source for the v2 autoscaler.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| current cluster resources | RaySyncer |
`GcsResourceManager::UpdateNodeResourceUsage` | 100ms
(`raylet_report_resources_period_milliseconds`) |
[gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170)
|
| current pending resources | GcsServer |
`GcsResourceManager::UpdateResourceLoads` | 1s
(`gcs_pull_resource_loads_period_milliseconds`) |
[gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422)
|

Because these two modules update asynchronously, the autoscaler can end
up seeing inconsistent resource states. That causes a race condition
where extra nodes may be launched before the updated availability
actually shows up. In practice, this means clusters can become
over-provisioned even though the demand was already satisfied.

In the long run, the right fix is to fully switch the v1 autoscaler over
to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like
v2 already does. But since v1 will eventually be deprecated, this PR
takes a practical interim step: it merges the necessary info from both
GcsResourceManager::HandleGetAllResourceUsage and
GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid
approach.

This keeps v1 correct without big changes, while still leaving the path
open for a clean migration to v2 later on.

## Details

This PR follows the fix suggested by @rueian in ray-project#52864 by switching the
v1 autoscaler's node state source from
`GcsResourceManager::HandleGetAllResourceUsage` to
`GcsAutoscalerStateManager::HandleGetClusterResourceState`.

Root Cause: The v1 autoscaler previously getting data from two
asynchronous update cycles:

- Node resources: updated every ~100ms via `UpdateNodeResourceUsage`
- Resource demands: updated every ~1s via `UpdateResourceLoads`

This created a race condition where newly allocated resources would be
visible before demand metrics updated, causing the autoscaler to
incorrectly perceive unmet demand and launch extra nodes.

The Fix: By using v2's `HandleGetClusterResourceState` for node
iteration, both current resources and pending demands now come from the
same consistent snapshot (same tick), so the extra-node race condition
goes away.

## Proposed Changes in update_load_metrics()

This PR updates how the v1 autoscaler collects cluster metrics.

Most node state information is now taken from **v2
(`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while
certain fields still rely on **v1
(`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't
have an equivalent yet.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| Node states (id, ip, resources, idle duration) |
[gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527)
(`resources_batch_data.batch`) |
[autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212)
(`cluster_resource_state.node_states`) | O | Now aligned with v2.
Verified no regressions in tests. |
| waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same
as before | X | v2 does not separate ready vs infeasible requests. Still
needed for metrics/debugging. |
| pending_placement_groups | `placement_group_load` | same as before | X
| No validated equivalent in v2 yet. May migrate later. |
| cluster_full | response flag (`cluster_full_of_actors_detected`) |
same as before | X | No replacement in v2 fields, so kept as is. |

### Additional Notes

- This hybrid approach addresses the race condition while still using
legacy fields where v2 has no equivalent.
- All existing autoscaler monitor tests still pass, which shows that the
change is backward-compatible and does not break existing behavior.

## Changed Behavior (Observed)

(Autoscaler config & serving code are same as this
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864))

After switching to v2 autoscaler state (cluster resource), the issue no
longer occurs:

- Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node
Provider only launches a single `ray.worker.4090.standard` node. (No
extra requests for additional nodes are observed.)

[debug.log](https://github.com/user-attachments/files/22659163/debug.log)

## Related issue number

Closes ray-project#52864

Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
… to fix over-provisioning (#57130)

This PR is related to
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)
The v1 autoscaler monitor currently is pulling metrics from two
different modules in GCS:

- **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_`
and updates it at two different intervals (`UpdateNodeResourceUsage`
every 0.1s, `UpdateResourceLoads` every 1s).
- **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_`
and updates it via `UpdateResourceLoadAndUsage`. This module is already
the source for the v2 autoscaler.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| current cluster resources | RaySyncer |
`GcsResourceManager::UpdateNodeResourceUsage` | 100ms
(`raylet_report_resources_period_milliseconds`) |
[gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170)
|
| current pending resources | GcsServer |
`GcsResourceManager::UpdateResourceLoads` | 1s
(`gcs_pull_resource_loads_period_milliseconds`) |
[gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422)
|

Because these two modules update asynchronously, the autoscaler can end
up seeing inconsistent resource states. That causes a race condition
where extra nodes may be launched before the updated availability
actually shows up. In practice, this means clusters can become
over-provisioned even though the demand was already satisfied.

In the long run, the right fix is to fully switch the v1 autoscaler over
to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like
v2 already does. But since v1 will eventually be deprecated, this PR
takes a practical interim step: it merges the necessary info from both
GcsResourceManager::HandleGetAllResourceUsage and
GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid
approach.

This keeps v1 correct without big changes, while still leaving the path
open for a clean migration to v2 later on.

## Details

This PR follows the fix suggested by @rueian in #52864 by switching the
v1 autoscaler's node state source from
`GcsResourceManager::HandleGetAllResourceUsage` to
`GcsAutoscalerStateManager::HandleGetClusterResourceState`.

Root Cause: The v1 autoscaler previously getting data from two
asynchronous update cycles:

- Node resources: updated every ~100ms via `UpdateNodeResourceUsage`
- Resource demands: updated every ~1s via `UpdateResourceLoads`

This created a race condition where newly allocated resources would be
visible before demand metrics updated, causing the autoscaler to
incorrectly perceive unmet demand and launch extra nodes.

The Fix: By using v2's `HandleGetClusterResourceState` for node
iteration, both current resources and pending demands now come from the
same consistent snapshot (same tick), so the extra-node race condition
goes away.

## Proposed Changes in update_load_metrics()

This PR updates how the v1 autoscaler collects cluster metrics.

Most node state information is now taken from **v2
(`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while
certain fields still rely on **v1
(`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't
have an equivalent yet.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| Node states (id, ip, resources, idle duration) |
[gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527)
(`resources_batch_data.batch`) |
[autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212)
(`cluster_resource_state.node_states`) | O | Now aligned with v2.
Verified no regressions in tests. |
| waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same
as before | X | v2 does not separate ready vs infeasible requests. Still
needed for metrics/debugging. |
| pending_placement_groups | `placement_group_load` | same as before | X
| No validated equivalent in v2 yet. May migrate later. |
| cluster_full | response flag (`cluster_full_of_actors_detected`) |
same as before | X | No replacement in v2 fields, so kept as is. |

### Additional Notes

- This hybrid approach addresses the race condition while still using
legacy fields where v2 has no equivalent.
- All existing autoscaler monitor tests still pass, which shows that the
change is backward-compatible and does not break existing behavior.

## Changed Behavior (Observed)

(Autoscaler config & serving code are same as this
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864))

After switching to v2 autoscaler state (cluster resource), the issue no
longer occurs:

- Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node
Provider only launches a single `ray.worker.4090.standard` node. (No
extra requests for additional nodes are observed.)


[debug.log](https://github.com/user-attachments/files/22659163/debug.log)

## Related issue number

Closes #52864 

Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
… to fix over-provisioning (ray-project#57130)

This PR is related to
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)
The v1 autoscaler monitor currently is pulling metrics from two
different modules in GCS:

- **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_`
and updates it at two different intervals (`UpdateNodeResourceUsage`
every 0.1s, `UpdateResourceLoads` every 1s).
- **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_`
and updates it via `UpdateResourceLoadAndUsage`. This module is already
the source for the v2 autoscaler.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| current cluster resources | RaySyncer |
`GcsResourceManager::UpdateNodeResourceUsage` | 100ms
(`raylet_report_resources_period_milliseconds`) |
[gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170)
|
| current pending resources | GcsServer |
`GcsResourceManager::UpdateResourceLoads` | 1s
(`gcs_pull_resource_loads_period_milliseconds`) |
[gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422)
|

Because these two modules update asynchronously, the autoscaler can end
up seeing inconsistent resource states. That causes a race condition
where extra nodes may be launched before the updated availability
actually shows up. In practice, this means clusters can become
over-provisioned even though the demand was already satisfied.

In the long run, the right fix is to fully switch the v1 autoscaler over
to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like
v2 already does. But since v1 will eventually be deprecated, this PR
takes a practical interim step: it merges the necessary info from both
GcsResourceManager::HandleGetAllResourceUsage and
GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid
approach.

This keeps v1 correct without big changes, while still leaving the path
open for a clean migration to v2 later on.

## Details

This PR follows the fix suggested by @rueian in ray-project#52864 by switching the
v1 autoscaler's node state source from
`GcsResourceManager::HandleGetAllResourceUsage` to
`GcsAutoscalerStateManager::HandleGetClusterResourceState`.

Root Cause: The v1 autoscaler previously getting data from two
asynchronous update cycles:

- Node resources: updated every ~100ms via `UpdateNodeResourceUsage`
- Resource demands: updated every ~1s via `UpdateResourceLoads`

This created a race condition where newly allocated resources would be
visible before demand metrics updated, causing the autoscaler to
incorrectly perceive unmet demand and launch extra nodes.

The Fix: By using v2's `HandleGetClusterResourceState` for node
iteration, both current resources and pending demands now come from the
same consistent snapshot (same tick), so the extra-node race condition
goes away.

## Proposed Changes in update_load_metrics()

This PR updates how the v1 autoscaler collects cluster metrics.

Most node state information is now taken from **v2
(`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while
certain fields still rely on **v1
(`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't
have an equivalent yet.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| Node states (id, ip, resources, idle duration) |
[gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527)
(`resources_batch_data.batch`) |
[autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212)
(`cluster_resource_state.node_states`) | O | Now aligned with v2.
Verified no regressions in tests. |
| waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same
as before | X | v2 does not separate ready vs infeasible requests. Still
needed for metrics/debugging. |
| pending_placement_groups | `placement_group_load` | same as before | X
| No validated equivalent in v2 yet. May migrate later. |
| cluster_full | response flag (`cluster_full_of_actors_detected`) |
same as before | X | No replacement in v2 fields, so kept as is. |

### Additional Notes

- This hybrid approach addresses the race condition while still using
legacy fields where v2 has no equivalent.
- All existing autoscaler monitor tests still pass, which shows that the
change is backward-compatible and does not break existing behavior.

## Changed Behavior (Observed)

(Autoscaler config & serving code are same as this
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864))

After switching to v2 autoscaler state (cluster resource), the issue no
longer occurs:

- Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node
Provider only launches a single `ray.worker.4090.standard` node. (No
extra requests for additional nodes are observed.)


[debug.log](https://github.com/user-attachments/files/22659163/debug.log)

## Related issue number

Closes ray-project#52864 

Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
… to fix over-provisioning (ray-project#57130)

This PR is related to
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)
The v1 autoscaler monitor currently is pulling metrics from two
different modules in GCS:

- **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_`
and updates it at two different intervals (`UpdateNodeResourceUsage`
every 0.1s, `UpdateResourceLoads` every 1s).
- **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_`
and updates it via `UpdateResourceLoadAndUsage`. This module is already
the source for the v2 autoscaler.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| current cluster resources | RaySyncer |
`GcsResourceManager::UpdateNodeResourceUsage` | 100ms
(`raylet_report_resources_period_milliseconds`) |
[gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170)
|
| current pending resources | GcsServer |
`GcsResourceManager::UpdateResourceLoads` | 1s
(`gcs_pull_resource_loads_period_milliseconds`) |
[gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422)
|

Because these two modules update asynchronously, the autoscaler can end
up seeing inconsistent resource states. That causes a race condition
where extra nodes may be launched before the updated availability
actually shows up. In practice, this means clusters can become
over-provisioned even though the demand was already satisfied.

In the long run, the right fix is to fully switch the v1 autoscaler over
to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like
v2 already does. But since v1 will eventually be deprecated, this PR
takes a practical interim step: it merges the necessary info from both
GcsResourceManager::HandleGetAllResourceUsage and
GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid
approach.

This keeps v1 correct without big changes, while still leaving the path
open for a clean migration to v2 later on.

## Details

This PR follows the fix suggested by @rueian in ray-project#52864 by switching the
v1 autoscaler's node state source from
`GcsResourceManager::HandleGetAllResourceUsage` to
`GcsAutoscalerStateManager::HandleGetClusterResourceState`.

Root Cause: The v1 autoscaler previously getting data from two
asynchronous update cycles:

- Node resources: updated every ~100ms via `UpdateNodeResourceUsage`
- Resource demands: updated every ~1s via `UpdateResourceLoads`

This created a race condition where newly allocated resources would be
visible before demand metrics updated, causing the autoscaler to
incorrectly perceive unmet demand and launch extra nodes.

The Fix: By using v2's `HandleGetClusterResourceState` for node
iteration, both current resources and pending demands now come from the
same consistent snapshot (same tick), so the extra-node race condition
goes away.

## Proposed Changes in update_load_metrics()

This PR updates how the v1 autoscaler collects cluster metrics.

Most node state information is now taken from **v2
(`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while
certain fields still rely on **v1
(`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't
have an equivalent yet.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| Node states (id, ip, resources, idle duration) |
[gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527)
(`resources_batch_data.batch`) |
[autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212)
(`cluster_resource_state.node_states`) | O | Now aligned with v2.
Verified no regressions in tests. |
| waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same
as before | X | v2 does not separate ready vs infeasible requests. Still
needed for metrics/debugging. |
| pending_placement_groups | `placement_group_load` | same as before | X
| No validated equivalent in v2 yet. May migrate later. |
| cluster_full | response flag (`cluster_full_of_actors_detected`) |
same as before | X | No replacement in v2 fields, so kept as is. |

### Additional Notes

- This hybrid approach addresses the race condition while still using
legacy fields where v2 has no equivalent.
- All existing autoscaler monitor tests still pass, which shows that the
change is backward-compatible and does not break existing behavior.

## Changed Behavior (Observed)

(Autoscaler config & serving code are same as this
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864))

After switching to v2 autoscaler state (cluster resource), the issue no
longer occurs:

- Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node
Provider only launches a single `ray.worker.4090.standard` node. (No
extra requests for additional nodes are observed.)

[debug.log](https://github.com/user-attachments/files/22659163/debug.log)

## Related issue number

Closes ray-project#52864

Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Autoscaler][v1] Autoscaler launches extra nodes despite fulfilled resource demand

5 participants