Commit acb6ed2
[core][autoscaler][v1] use cluster_resource_state for node state info to fix over-provisioning (#57130)
This PR is related to
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)
The v1 autoscaler monitor currently is pulling metrics from two
different modules in GCS:
- **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_`
and updates it at two different intervals (`UpdateNodeResourceUsage`
every 0.1s, `UpdateResourceLoads` every 1s).
- **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_`
and updates it via `UpdateResourceLoadAndUsage`. This module is already
the source for the v2 autoscaler.
| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| current cluster resources | RaySyncer |
`GcsResourceManager::UpdateNodeResourceUsage` | 100ms
(`raylet_report_resources_period_milliseconds`) |
[gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170)
|
| current pending resources | GcsServer |
`GcsResourceManager::UpdateResourceLoads` | 1s
(`gcs_pull_resource_loads_period_milliseconds`) |
[gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422)
|
Because these two modules update asynchronously, the autoscaler can end
up seeing inconsistent resource states. That causes a race condition
where extra nodes may be launched before the updated availability
actually shows up. In practice, this means clusters can become
over-provisioned even though the demand was already satisfied.
In the long run, the right fix is to fully switch the v1 autoscaler over
to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like
v2 already does. But since v1 will eventually be deprecated, this PR
takes a practical interim step: it merges the necessary info from both
GcsResourceManager::HandleGetAllResourceUsage and
GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid
approach.
This keeps v1 correct without big changes, while still leaving the path
open for a clean migration to v2 later on.
## Details
This PR follows the fix suggested by @rueian in #52864 by switching the
v1 autoscaler's node state source from
`GcsResourceManager::HandleGetAllResourceUsage` to
`GcsAutoscalerStateManager::HandleGetClusterResourceState`.
Root Cause: The v1 autoscaler previously getting data from two
asynchronous update cycles:
- Node resources: updated every ~100ms via `UpdateNodeResourceUsage`
- Resource demands: updated every ~1s via `UpdateResourceLoads`
This created a race condition where newly allocated resources would be
visible before demand metrics updated, causing the autoscaler to
incorrectly perceive unmet demand and launch extra nodes.
The Fix: By using v2's `HandleGetClusterResourceState` for node
iteration, both current resources and pending demands now come from the
same consistent snapshot (same tick), so the extra-node race condition
goes away.
## Proposed Changes in update_load_metrics()
This PR updates how the v1 autoscaler collects cluster metrics.
Most node state information is now taken from **v2
(`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while
certain fields still rely on **v1
(`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't
have an equivalent yet.
| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| Node states (id, ip, resources, idle duration) |
[gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527)
(`resources_batch_data.batch`) |
[autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212)
(`cluster_resource_state.node_states`) | O | Now aligned with v2.
Verified no regressions in tests. |
| waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same
as before | X | v2 does not separate ready vs infeasible requests. Still
needed for metrics/debugging. |
| pending_placement_groups | `placement_group_load` | same as before | X
| No validated equivalent in v2 yet. May migrate later. |
| cluster_full | response flag (`cluster_full_of_actors_detected`) |
same as before | X | No replacement in v2 fields, so kept as is. |
### Additional Notes
- This hybrid approach addresses the race condition while still using
legacy fields where v2 has no equivalent.
- All existing autoscaler monitor tests still pass, which shows that the
change is backward-compatible and does not break existing behavior.
## Changed Behavior (Observed)
(Autoscaler config & serving code are same as this
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864))
After switching to v2 autoscaler state (cluster resource), the issue no
longer occurs:
- Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node
Provider only launches a single `ray.worker.4090.standard` node. (No
extra requests for additional nodes are observed.)
[debug.log](https://github.com/user-attachments/files/22659163/debug.log)
## Related issue number
Closes #52864
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>1 parent 5ecf9a5 commit acb6ed2
File tree
2 files changed
+98
-14
lines changed- python/ray
- autoscaler/_private
- tests
2 files changed
+98
-14
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
243 | 243 | | |
244 | 244 | | |
245 | 245 | | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
246 | 249 | | |
247 | 250 | | |
248 | 251 | | |
| |||
259 | 262 | | |
260 | 263 | | |
261 | 264 | | |
262 | | - | |
| 265 | + | |
263 | 266 | | |
264 | | - | |
| 267 | + | |
265 | 268 | | |
266 | 269 | | |
267 | 270 | | |
268 | | - | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
269 | 278 | | |
270 | 279 | | |
271 | 280 | | |
272 | 281 | | |
273 | 282 | | |
274 | 283 | | |
275 | | - | |
| 284 | + | |
276 | 285 | | |
277 | 286 | | |
278 | 287 | | |
279 | 288 | | |
280 | 289 | | |
281 | 290 | | |
282 | | - | |
| 291 | + | |
283 | 292 | | |
284 | 293 | | |
285 | 294 | | |
286 | 295 | | |
287 | 296 | | |
288 | 297 | | |
289 | | - | |
290 | | - | |
291 | | - | |
292 | | - | |
293 | | - | |
294 | | - | |
295 | | - | |
296 | | - | |
| 298 | + | |
| 299 | + | |
297 | 300 | | |
298 | 301 | | |
299 | 302 | | |
| |||
319 | 322 | | |
320 | 323 | | |
321 | 324 | | |
322 | | - | |
| 325 | + | |
323 | 326 | | |
324 | 327 | | |
325 | 328 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
| 2 | + | |
2 | 3 | | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
7 | 13 | | |
| 14 | + | |
8 | 15 | | |
9 | 16 | | |
10 | 17 | | |
| |||
51 | 58 | | |
52 | 59 | | |
53 | 60 | | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
54 | 135 | | |
55 | 136 | | |
0 commit comments