-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[xray] Implement actor reconstruction. #3063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is looking pretty good! I'll leave more detailed comments later, but for now, just wanted to clarify a couple things. It looks like this will only handle reconstruction for failed actor processes, but not for node death, is that right? Also, it looks like this will always reconstruct the actor, even if the actor is no longer needed for anything. I would probably err on the side of only reconstructing the actor if a method on it is called.
To fix the first issue, I would probably rely on the ReconstructionPolicy to determine when the actor should be reconstructed (i.e. if the node where the actor lived dies and a method is called on the actor). More precisely, I would call ReconstructionPolicy::ListenAndMaybeReconstruct on the actor's creation object if: (a) we fail to forward an actor task to a node, or (b) we can't find a location for the actor.
For the second issue, I think we only need to handle it in the case where the actor process dies, but the node is still alive. It may make sense to broadcast an actor death notification, rather than the broadcasting the BEING_RECONSTRUCTED notification right way. Other nodes can determine whether the actor can be reconstructed based on the length of the log. There are many ways to do this, so I'm open to suggestions as well.
src/common/format/common.fbs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if you want it to be infinite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can set it to the INT_MAX, which essentially means infinite. (Even if the actor reconstructs every second, it takes 136 years to exceed this number.)
src/ray/gcs/format/gcs.fbs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "ready" is the right word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, I meant 'already'
src/ray/raylet/node_manager.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the .hex
|
@raulchen this PR seems to be overriding the current actor behavior (don't reconstruct the actor and instead raise exceptions). However, I think we want to keep that behavior as an option (possibly as the default). |
|
@robertnishihara the default behavior doesn't change. By default, |
b89f439 to
f9ececf
Compare
|
Test FAILed. |
|
Hi @stephanie-wang @robertnishihara. Since this PR is already pretty large, I'd like to implement the 2 unfinished TODOs in follow-up PRs. Could you help take a look at this PR? thank you. |
|
Test PASSed. |
|
Also, any idea about how to test the node failure case? I'm trying to use 2 docker containers to manually mimic the process. Is there an easier way? |
|
Test PASSed. |
|
The best way to test the multi-node case is probably through this tool which @richardliaw just built. #3008 |
|
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me to merge on an experimental basis, since the default is still to not reconstruct.
python/ray/worker.py
Outdated
| third-party libraries or to reclaim resources that cannot easily be | ||
| released, e.g., GPU memory that was acquired by TensorFlow). By | ||
| default this is infinite. | ||
| * **max_reconstructions**: Only for *actors*. This sepcifies the maximum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specifies
|
|
||
| @Test | ||
| public void testActorReconstruction() throws InterruptedException, IOException { | ||
| ActorCreationOptions options = new ActorCreationOptions(new HashMap<>(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume there's a test already that check we don't attempt the zero case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a test in python that checks we don't attempt twice when max_reconstructions=1. I think that's also okay?
src/ray/raylet/node_manager.cc
Outdated
| for (const auto &actor_entry : actor_registry_) { | ||
| if (actor_entry.second.GetNodeManagerId() == client_id && | ||
| actor_entry.second.GetState() == ActorState::ALIVE) { | ||
| RAY_LOG(DEBUG) << "Actor " << actor_entry.first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAY_LOG(WARN)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably INFO is better, because it's expected that sometimes actor would fail.
I changed the log in HandleDisconnectedActor, because that function will be called in both cases (actor process dies & node dies).
|
Test PASSed. |
|
Test PASSed. |
|
Test PASSed. |
|
Test PASSed. |
|
Test PASSed. |
|
On Travis it looks like |
|
Test FAILed. |
|
Test FAILed. |
|
@stephanie-wang this is the PR that will fix the following, right? def test_actor_start_failure(start_connected_cluster):
"""Kill node during actor start."""
cluster = start_connected_cluster
node = cluster.add_node(resources=dict(CPU=1))
@ray.remote(num_cpus=1)
class TestActor(object):
def sleeps(self):
time.sleep(5)
return 123
two_actors = [TestActor.remote() for i in range(2)]
two_actors_ret = [act.sleeps.remote() for act in two_actors]
cluster.remove_node(node)
cluster.wait_for_nodes() # should take less than 3 seconds
assert ray.global_state.cluster_resources()["CPU"] == 1
start = time.time()
res, remain = ray.wait(two_actors_ret)
res2, _ = ray.wait(remain)
duration = time.time() - start
assert duration < 6 |
|
Test PASSed. |
|
This PR has been a bit too messy. I'm closing this and using #3332 instead. |
What do these changes do?
This PR implements actor reconstruction for raylet mode.
max_actor_reconstructionsoption in@ray.remote(), which indicates how many times this actor should be reconstructed.TODOs
Related issue number
#2868