[xray] Implement actor reconstruction. #3063

raulchen · 2018-10-15T04:22:18Z

What do these changes do?

This PR implements actor reconstruction for raylet mode.

When an actor dies accidentally (either because the process dies or because the whole node dies), raylet backend will automatically reconstruct the actor by replaying its creation task.
Reconstruction is turned off by default, users can enable it by specifying a max_actor_reconstructions option in @ray.remote(), which indicates how many times this actor should be reconstructed.

TODOs

Now this PR only supports the case where the actor process is dead, also handle the case where the whole node is down.
Java support is implemented, also implement front-end support for Python.
Undo changes that only needed for local debugging.
Document code.

Related issue number

#2868

AmplabJenkins · 2018-10-15T05:53:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8650/
Test PASSed.

stephanie-wang

Thanks, this is looking pretty good! I'll leave more detailed comments later, but for now, just wanted to clarify a couple things. It looks like this will only handle reconstruction for failed actor processes, but not for node death, is that right? Also, it looks like this will always reconstruct the actor, even if the actor is no longer needed for anything. I would probably err on the side of only reconstructing the actor if a method on it is called.

To fix the first issue, I would probably rely on the ReconstructionPolicy to determine when the actor should be reconstructed (i.e. if the node where the actor lived dies and a method is called on the actor). More precisely, I would call ReconstructionPolicy::ListenAndMaybeReconstruct on the actor's creation object if: (a) we fail to forward an actor task to a node, or (b) we can't find a location for the actor.

For the second issue, I think we only need to handle it in the case where the actor process dies, but the node is still alive. It may make sense to broadcast an actor death notification, rather than the broadcasting the BEING_RECONSTRUCTED notification right way. Other nodes can determine whether the actor can be reconstructed based on the length of the log. There are many ways to do this, so I'm open to suggestions as well.

robertnishihara · 2018-10-16T04:57:58Z

src/common/format/common.fbs

What if you want it to be infinite?

you can set it to the INT_MAX, which essentially means infinite. (Even if the actor reconstructs every second, it takes 136 years to exceed this number.)

robertnishihara · 2018-10-16T04:58:46Z

src/ray/gcs/format/gcs.fbs

I don't think "ready" is the right word.

typo, I meant 'already'

robertnishihara · 2018-10-16T05:01:32Z

src/ray/raylet/node_manager.cc

remove the .hex

robertnishihara · 2018-10-16T05:06:23Z

@raulchen this PR seems to be overriding the current actor behavior (don't reconstruct the actor and instead raise exceptions). However, I think we want to keep that behavior as an option (possibly as the default).

raulchen · 2018-10-16T05:36:05Z

@robertnishihara the default behavior doesn't change. By default, max_actor_reconstructions is set to 0, in this case, the actor won't be reconstructed. Then if you call this actor again, an exception will be raised.

AmplabJenkins · 2018-10-20T08:29:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8764/
Test FAILed.

raulchen · 2018-10-20T08:34:12Z

Hi @stephanie-wang @robertnishihara. Since this PR is already pretty large, I'd like to implement the 2 unfinished TODOs in follow-up PRs. Could you help take a look at this PR? thank you.

AmplabJenkins · 2018-10-20T09:33:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8765/
Test PASSed.

raulchen · 2018-10-20T09:36:48Z

Also, any idea about how to test the node failure case? I'm trying to use 2 docker containers to manually mimic the process. Is there an easier way?

AmplabJenkins · 2018-10-20T09:53:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8766/
Test PASSed.

robertnishihara · 2018-10-20T16:40:02Z

The best way to test the multi-node case is probably through this tool which @richardliaw just built. #3008

AmplabJenkins · 2018-10-31T23:50:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8989/
Test FAILed.

ericl

This looks fine to me to merge on an experimental basis, since the default is still to not reconstruct.

ericl · 2018-10-31T21:34:55Z

python/ray/worker.py

      third-party libraries or to reclaim resources that cannot easily be
      released, e.g., GPU memory that was acquired by TensorFlow). By
      default this is infinite.
+    * **max_reconstructions**: Only for *actors*. This sepcifies the maximum


ericl · 2018-10-31T21:35:26Z

java/test/src/main/java/org/ray/api/test/ActorReconstructionTest.java

+
+  @Test
+  public void testActorReconstruction() throws InterruptedException, IOException {
+    ActorCreationOptions options = new ActorCreationOptions(new HashMap<>(), 1);


I assume there's a test already that check we don't attempt the zero case?

there's a test in python that checks we don't attempt twice when max_reconstructions=1. I think that's also okay?

ericl · 2018-10-31T21:39:31Z

src/ray/raylet/node_manager.cc

+  for (const auto &actor_entry : actor_registry_) {
+    if (actor_entry.second.GetNodeManagerId() == client_id &&
+        actor_entry.second.GetState() == ActorState::ALIVE) {
+      RAY_LOG(DEBUG) << "Actor " << actor_entry.first


RAY_LOG(WARN)

probably INFO is better, because it's expected that sometimes actor would fail.
I changed the log in HandleDisconnectedActor, because that function will be called in both cases (actor process dies & node dies).

AmplabJenkins · 2018-11-05T05:52:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9072/
Test PASSed.

AmplabJenkins · 2018-11-05T05:55:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9071/
Test PASSed.

AmplabJenkins · 2018-11-05T14:27:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9093/
Test PASSed.

AmplabJenkins · 2018-11-05T14:28:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9092/
Test PASSed.

AmplabJenkins · 2018-11-05T14:33:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9095/
Test PASSed.

robertnishihara · 2018-11-06T17:39:40Z

On Travis it looks like test/actor_test.py::test_local_scheduler_dying is hanging in two of the Travis jobs and test/actor_test.py::test_reconstruction_suppression is hanging in one.

AmplabJenkins · 2018-11-07T00:40:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9127/
Test FAILed.

AmplabJenkins · 2018-11-07T03:03:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9142/
Test FAILed.

richardliaw · 2018-11-08T02:06:33Z

@stephanie-wang this is the PR that will fix the following, right?

def test_actor_start_failure(start_connected_cluster):
    """Kill node during actor start."""
    cluster = start_connected_cluster
    node = cluster.add_node(resources=dict(CPU=1))

    @ray.remote(num_cpus=1)
    class TestActor(object):
        def sleeps(self):
            time.sleep(5)
            return 123

    two_actors = [TestActor.remote() for i in range(2)]
    two_actors_ret = [act.sleeps.remote() for act in two_actors]
    cluster.remove_node(node)
    cluster.wait_for_nodes()  # should take less than 3 seconds
    assert ray.global_state.cluster_resources()["CPU"] == 1

    start = time.time()
    res, remain = ray.wait(two_actors_ret)
    res2, _ = ray.wait(remain)
    duration = time.time() - start
    assert duration < 6

…into reconstruct_actor

AmplabJenkins · 2018-11-09T23:19:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9245/
Test PASSed.

raulchen · 2018-11-15T12:25:16Z

This PR has been a bit too messy. I'm closing this and using #3332 instead.

raulchen requested review from robertnishihara and stephanie-wang October 15, 2018 04:22

stephanie-wang reviewed Oct 15, 2018

View reviewed changes

robertnishihara reviewed Oct 16, 2018

View reviewed changes

src/ray/raylet/node_manager.cc Outdated

Copy link

Collaborator

robertnishihara Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the .hex

raulchen force-pushed the reconstruct_actor branch 2 times, most recently from b89f439 to f9ececf Compare October 20, 2018 07:59

raulchen changed the title ~~[WIP] [xray] Implement actor reconstruction.~~ [xray] Implement actor reconstruction. Oct 20, 2018

raulchen added 12 commits October 22, 2018 10:39

Add max_actor_reconstructions in task spec

7cf2a8f

Add option in java annotation

7a1863c

add remaining_reconstructions to ActorTableData

4eb34dd

add max_reconstructions in ActorTableData

2850f4d

Add ActorState

50ccf3f

Update actor table when actor is dead or being reconstructed

c715f6d

fix SubmitTask

eea9a48

implement ReconstructActor

fea81b0

fix

5ad03dc

add some comments

a60ec79

add a test in java

9c7d2e7

trigger actor reconstruction by task

63cab7b

Refactor bookkeeping for actor dummy objects

c523b13

ericl reviewed Nov 4, 2018

View reviewed changes

raulchen added 2 commits November 5, 2018 12:18

small fixes

dabc3f5

fix lint

539a5de

raulchen added 4 commits November 5, 2018 20:19

fix head_node_cluster

3fc4b44

fix test_actor_reconstruction

7793737

fix

f123c25

refine a log

a6b35a9

stephanie-wang added 4 commits November 6, 2018 14:29

todo

54392fa

Put failed actor tasks back in the queue

5484e8d

Update logging message

60cd255

Merge branch 'master' into reconstruct_actor

feaab8f

Merge branch 'master' into reconstruct_actor

1090e5e

stephanie-wang added 3 commits November 9, 2018 13:23

Merge branch 'master' into reconstruct_actor

6187f53

Add temporary timeout for travis testing

b7dea38

Merge branch 'reconstruct_actor' of github.com:ant-tech-alliance/ray …

c55e320

…into reconstruct_actor

raulchen mentioned this pull request Nov 15, 2018

[xray] Implement Actor Reconstruction #3332

Merged

raulchen closed this Nov 15, 2018

[xray] Implement actor reconstruction. #3063

[xray] Implement actor reconstruction. #3063

Uh oh!

Conversation

raulchen commented Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue number

Uh oh!

AmplabJenkins commented Oct 15, 2018

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertnishihara commented Oct 16, 2018

Uh oh!

raulchen commented Oct 16, 2018

Uh oh!

AmplabJenkins commented Oct 20, 2018

Uh oh!

raulchen commented Oct 20, 2018

Uh oh!

AmplabJenkins commented Oct 20, 2018

Uh oh!

raulchen commented Oct 20, 2018

Uh oh!

AmplabJenkins commented Oct 20, 2018

Uh oh!

robertnishihara commented Oct 20, 2018

Uh oh!

AmplabJenkins commented Oct 31, 2018

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Nov 5, 2018

Uh oh!

AmplabJenkins commented Nov 5, 2018

Uh oh!

AmplabJenkins commented Nov 5, 2018

Uh oh!

AmplabJenkins commented Nov 5, 2018

Uh oh!

AmplabJenkins commented Nov 5, 2018

Uh oh!

robertnishihara commented Nov 6, 2018

Uh oh!

AmplabJenkins commented Nov 7, 2018

Uh oh!

AmplabJenkins commented Nov 7, 2018

Uh oh!

richardliaw commented Nov 8, 2018

Uh oh!

AmplabJenkins commented Nov 9, 2018

Uh oh!

raulchen commented Nov 15, 2018

Uh oh!

Reviewers

Assignees

Labels

raulchen commented Oct 15, 2018 •

edited

Loading