Skip to content

[autoscaler] Autoscaler should avoid using ip address to identify nodes (as node id of node provider) #19086

Open
@ericl

Description

Currently, most node providers use the IP address to identify the node. This IP address is used to match the list of "running" nodes from the node provider with the utilization statistics reported by the GCS (you can find this matching function in LoadMetrics.prune_active_ips).

However, using IP address has problems in situations where there may be multiple logical nodes on a single machine. This happens in (1) on-prem cluster managers that allocate multiple containers to the same IP, and (2) testing locally with multiple raylets. Hence, we have this hacky use_node_ids_as_ip option in the autoscaler which sometimes uses the raylet generated NodeId as the IP address.

Many of these deployment inconsistencies would be resolved if the autoscaler used node ids to identify nodes in the first place. This would require (1) generating node ids when launching nodes, and (2) propagating the node id to the ray start command so the node will report resource stats under its assigned node id. We can then remove use_node_ids_as_ip and other mis-use of ips as identifiers.

cc @AmeerHajAli @DmitriGekhtman @sasha-s

Metadata

Assignees

Labels

P2Important issue, but not time-criticalcoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issuesenhancementRequest for new feature and/or capabilityobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions