Skip to content

Conversation

@raulchen
Copy link
Contributor

@raulchen raulchen commented Oct 15, 2018

What do these changes do?

This PR implements actor reconstruction for raylet mode.

  • When an actor dies accidentally (either because the process dies or because the whole node dies), raylet backend will automatically reconstruct the actor by replaying its creation task.
  • Reconstruction is turned off by default, users can enable it by specifying a max_actor_reconstructions option in @ray.remote(), which indicates how many times this actor should be reconstructed.

TODOs

  • Now this PR only supports the case where the actor process is dead, also handle the case where the whole node is down.
  • Java support is implemented, also implement front-end support for Python.
  • Undo changes that only needed for local debugging.
  • Document code.

Related issue number

#2868

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8650/
Test PASSed.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is looking pretty good! I'll leave more detailed comments later, but for now, just wanted to clarify a couple things. It looks like this will only handle reconstruction for failed actor processes, but not for node death, is that right? Also, it looks like this will always reconstruct the actor, even if the actor is no longer needed for anything. I would probably err on the side of only reconstructing the actor if a method on it is called.

To fix the first issue, I would probably rely on the ReconstructionPolicy to determine when the actor should be reconstructed (i.e. if the node where the actor lived dies and a method is called on the actor). More precisely, I would call ReconstructionPolicy::ListenAndMaybeReconstruct on the actor's creation object if: (a) we fail to forward an actor task to a node, or (b) we can't find a location for the actor.

For the second issue, I think we only need to handle it in the case where the actor process dies, but the node is still alive. It may make sense to broadcast an actor death notification, rather than the broadcasting the BEING_RECONSTRUCTED notification right way. Other nodes can determine whether the actor can be reconstructed based on the length of the log. There are many ways to do this, so I'm open to suggestions as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you want it to be infinite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can set it to the INT_MAX, which essentially means infinite. (Even if the actor reconstructs every second, it takes 136 years to exceed this number.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "ready" is the right word.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, I meant 'already'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the .hex

@robertnishihara
Copy link
Collaborator

@raulchen this PR seems to be overriding the current actor behavior (don't reconstruct the actor and instead raise exceptions). However, I think we want to keep that behavior as an option (possibly as the default).

@raulchen
Copy link
Contributor Author

@robertnishihara the default behavior doesn't change. By default, max_actor_reconstructions is set to 0, in this case, the actor won't be reconstructed. Then if you call this actor again, an exception will be raised.

@raulchen raulchen force-pushed the reconstruct_actor branch 2 times, most recently from b89f439 to f9ececf Compare October 20, 2018 07:59
@raulchen raulchen changed the title [WIP] [xray] Implement actor reconstruction. [xray] Implement actor reconstruction. Oct 20, 2018
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8764/
Test FAILed.

@raulchen
Copy link
Contributor Author

Hi @stephanie-wang @robertnishihara. Since this PR is already pretty large, I'd like to implement the 2 unfinished TODOs in follow-up PRs. Could you help take a look at this PR? thank you.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8765/
Test PASSed.

@raulchen
Copy link
Contributor Author

Also, any idea about how to test the node failure case? I'm trying to use 2 docker containers to manually mimic the process. Is there an easier way?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8766/
Test PASSed.

@robertnishihara
Copy link
Collaborator

The best way to test the multi-node case is probably through this tool which @richardliaw just built. #3008

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8989/
Test FAILed.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me to merge on an experimental basis, since the default is still to not reconstruct.

third-party libraries or to reclaim resources that cannot easily be
released, e.g., GPU memory that was acquired by TensorFlow). By
default this is infinite.
* **max_reconstructions**: Only for *actors*. This sepcifies the maximum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifies


@Test
public void testActorReconstruction() throws InterruptedException, IOException {
ActorCreationOptions options = new ActorCreationOptions(new HashMap<>(), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there's a test already that check we don't attempt the zero case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a test in python that checks we don't attempt twice when max_reconstructions=1. I think that's also okay?

for (const auto &actor_entry : actor_registry_) {
if (actor_entry.second.GetNodeManagerId() == client_id &&
actor_entry.second.GetState() == ActorState::ALIVE) {
RAY_LOG(DEBUG) << "Actor " << actor_entry.first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAY_LOG(WARN)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably INFO is better, because it's expected that sometimes actor would fail.
I changed the log in HandleDisconnectedActor, because that function will be called in both cases (actor process dies & node dies).

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9072/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9071/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9093/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9092/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9095/
Test PASSed.

@robertnishihara
Copy link
Collaborator

On Travis it looks like test/actor_test.py::test_local_scheduler_dying is hanging in two of the Travis jobs and test/actor_test.py::test_reconstruction_suppression is hanging in one.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9127/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9142/
Test FAILed.

@richardliaw
Copy link
Contributor

@stephanie-wang this is the PR that will fix the following, right?

def test_actor_start_failure(start_connected_cluster):
    """Kill node during actor start."""
    cluster = start_connected_cluster
    node = cluster.add_node(resources=dict(CPU=1))

    @ray.remote(num_cpus=1)
    class TestActor(object):
        def sleeps(self):
            time.sleep(5)
            return 123

    two_actors = [TestActor.remote() for i in range(2)]
    two_actors_ret = [act.sleeps.remote() for act in two_actors]
    cluster.remove_node(node)
    cluster.wait_for_nodes()  # should take less than 3 seconds
    assert ray.global_state.cluster_resources()["CPU"] == 1

    start = time.time()
    res, remain = ray.wait(two_actors_ret)
    res2, _ = ray.wait(remain)
    duration = time.time() - start
    assert duration < 6

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9245/
Test PASSed.

@raulchen
Copy link
Contributor Author

This PR has been a bit too messy. I'm closing this and using #3332 instead.

@raulchen raulchen closed this Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants