reconnecting cluster slave nodes to the head node fails #336
Closed
Description
Steps to reproduce:
Start head node:
scripts/start_ray.sh --num-cpus 100 --num-gpus 0 --num-workers 100 --head
Login to the second (slave) node and start ray, pointing to the head node:
./scripts/start_ray.sh --redis-address <headnode_ip:redis_port>
Then stop Ray on the slave node :
./scripts/stop_ray.sh
And now try to start Ray on the slave node again:
./scripts/start_ray.sh --redis-address <headnode_ip:redis_port>
Waiting for redis server at <headnode_ip:redis_port> to respond...
Using IP address ####### for this node.
Traceback (most recent call last):
File "/data/atumanov/ray/scripts/start_ray.py", line 109, in <module>
check_no_existing_redis_clients(node_ip_address, args.redis_address)
File "/data/atumanov/ray/scripts/start_ray.py", line 34, in check_no_existing_redis_clients
raise Exception("This Redis instance is already connected to clients with this IP address.")
Exception: This Redis instance is already connected to clients with this IP address.
Takeaway: starting and stopping Ray on slave nodes is not idempotent and it should be.