Skip to content

reconnecting cluster slave nodes to the head node fails #336

Closed
@atumanov

Description

Steps to reproduce:
Start head node:
scripts/start_ray.sh --num-cpus 100 --num-gpus 0 --num-workers 100 --head
Login to the second (slave) node and start ray, pointing to the head node:
./scripts/start_ray.sh --redis-address <headnode_ip:redis_port>
Then stop Ray on the slave node :
./scripts/stop_ray.sh

And now try to start Ray on the slave node again:

./scripts/start_ray.sh --redis-address <headnode_ip:redis_port>
Waiting for redis server at <headnode_ip:redis_port> to respond...
Using IP address ####### for this node.
Traceback (most recent call last):
  File "/data/atumanov/ray/scripts/start_ray.py", line 109, in <module>
    check_no_existing_redis_clients(node_ip_address, args.redis_address)
  File "/data/atumanov/ray/scripts/start_ray.py", line 34, in check_no_existing_redis_clients
    raise Exception("This Redis instance is already connected to clients with this IP address.")
Exception: This Redis instance is already connected to clients with this IP address.

Takeaway: starting and stopping Ray on slave nodes is not idempotent and it should be.

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn't

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions