[autoscaler] Autoscaler should avoid using ip address to identify nodes (as node id of node provider)

Currently, most node providers use the IP address to identify the node. This IP address is used to match the list of "running" nodes from the node provider with the utilization statistics reported by the GCS (you can find this matching function in LoadMetrics.prune_active_ips).

However, using IP address has problems in situations where there may be multiple logical nodes on a single machine. This happens in (1) on-prem cluster managers that allocate multiple containers to the same IP, and (2) testing locally with multiple raylets. Hence, we have this hacky `use_node_ids_as_ip` option in the autoscaler which sometimes uses the raylet generated NodeId as the IP address.

Many of these deployment inconsistencies would be resolved if the autoscaler used node ids to identify nodes in the first place. This would require (1) generating node ids when launching nodes, and (2) propagating the node id to the ray start command so the node will report resource stats under its assigned node id. We can then remove `use_node_ids_as_ip` and other mis-use of ips as identifiers.

cc @AmeerHajAli @DmitriGekhtman @sasha-s 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Autoscaler should avoid using ip address to identify nodes (as node id of node provider) #19086

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development