Skip to content

Conversation

@richardliaw
Copy link
Contributor

@richardliaw richardliaw commented Oct 2, 2018

Adds some utilities for creating and killing nodes in a cluster. Useful for fault tolerance tests.

TODOs:

  • Add basic tests for verifying functionality
    • Add kill/add node test
  • Documentation
  • Write a duplicate test for component_failure_test.py
  • Add sane defaults

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8507/
Test PASSed.

@robertnishihara
Copy link
Collaborator

Related to #609.

# Log monitor doesn't die for some reason
worker.kill_log_monitor()
worker.kill_plasma_store()
# TODO(rliaw): how to wait for raylet timeout?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertnishihara any idea for this?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8562/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8565/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8711/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8713/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8714/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8716/
Test PASSed.

@richardliaw richardliaw changed the title [wip] Cluster Utilities for Fault Tolerance Tests Cluster Utilities for Fault Tolerance Tests Oct 18, 2018
logger = logging.getLogger(__name__)


class Cluster():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster(object) to support Python 2

def __init__(self, initialize_head=False, connect=False, **head_node_args):
"""Initializes the cluster.

Arguments:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args:

"""
self.head_node = None
self.worker_nodes = {}
self.redis_address = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think None would be better than ""

self.add_node(**head_node_args)
if connect:
ray.init(redis_address=self.redis_address)
elif connect:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just raise an exception earlier on if not initialize_head and connect

self.process_dict["log_monitor"][0].kill()
self.process_dict["log_monitor"][0].wait()

def kill_allprocesses(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kill_all_processes

dies before worker nodes die?

Returns:
List of all nodes, including the head node."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

""" on the next line


def wait_for_nodes(self):
"""Waits for all nodes to be registered with global state."""
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this try/except?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to avoid a bad error message when this is called before ray.init, you can check ray.is_connected

self.remove_node(self.head_node)


class Node():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node(object)

worker.kill_plasma_store()
worker.process_dict[services.PROCESS_TYPE_RAYLET][0].wait()
assert not worker.any_processes_alive(), worker.live_processes()
print("Success")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

worker2 = g.add_node()
g.wait_for_nodes()
g.remove_node(worker2)
g.wait_for_nodes()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so this is actually waiting for the node to be removed from the client table, right? I guess that's a good thing to test, but unfortunately it takes 10s for that to happen, per call, so I'm guessing these tests are really slow? This might be too much of a burden on our CI...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this test takes about 15 seconds to run... is there someway we can test this (maybe reduce the 10s wait?)

"""Abstraction for a Ray node."""

def __init__(self, process_dict):
# TODO(rliaw): Is there a unique identifier for a node?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each node has a client_id in the client table. What kind of identifier are you looking for?

Copy link
Contributor Author

@richardliaw richardliaw Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah; something like that (this is a non-blocking issue though). What would be a way to get the client_id without first calling ray.init? Ideally, it would be part of the output of services.start_ray_node.

g.remove_node(worker2)
g.wait_for_nodes()
ray.shutdown()
g.shutdown()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray.shutdown and g.shutdown are typically the kind of thing you would put in a pytest fixture (after yield) so that they always get called in the event of a test failure. Not sure if it makes sense here, but something to consider..

worker.process_dict[services.PROCESS_TYPE_RAYLET][0].wait()
assert not worker.any_processes_alive(), worker.live_processes()
print("Success")
g.shutdown()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a ray.shutdown in this test also, right?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8741/
Test FAILed.

@robertnishihara
Copy link
Collaborator

@richardliaw it looks like you're running the test for legacy Ray as well as xray. You probably want to remove it from the legacy Ray test suite.

@richardliaw
Copy link
Contributor Author

Done.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8749/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8753/
Test PASSed.

@richardliaw
Copy link
Contributor Author

@robertnishihara tests passing

@robertnishihara robertnishihara merged commit 40c4148 into ray-project:master Oct 21, 2018
@robertnishihara
Copy link
Collaborator

Thanks @richardliaw!

@robertnishihara robertnishihara deleted the cluster_testing branch October 21, 2018 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants