[autoscaler] Autoscaler should avoid using ip address to identify nodes (as node id of node provider) #19086
Description
Currently, most node providers use the IP address to identify the node. This IP address is used to match the list of "running" nodes from the node provider with the utilization statistics reported by the GCS (you can find this matching function in LoadMetrics.prune_active_ips).
However, using IP address has problems in situations where there may be multiple logical nodes on a single machine. This happens in (1) on-prem cluster managers that allocate multiple containers to the same IP, and (2) testing locally with multiple raylets. Hence, we have this hacky use_node_ids_as_ip
option in the autoscaler which sometimes uses the raylet generated NodeId as the IP address.
Many of these deployment inconsistencies would be resolved if the autoscaler used node ids to identify nodes in the first place. This would require (1) generating node ids when launching nodes, and (2) propagating the node id to the ray start command so the node will report resource stats under its assigned node id. We can then remove use_node_ids_as_ip
and other mis-use of ips as identifiers.