Plasma and worker node failure. #373

stephanie-wang · 2017-03-16T02:22:36Z

This replaces #347.

AmplabJenkins · 2017-03-16T02:37:11Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-03-16T02:37:12Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/306/
Test FAILed.

pcmoritz · 2017-03-16T04:11:50Z

python/ray/monitor.py

+    table. Insertions are ignored. Cleanup of the associate state in the state
+    tables should be handled by the caller.
+
+    As documented in common/redis_module/ray_redis_module.c, the format for the


This comment is outdated now!

pcmoritz · 2017-03-16T05:23:55Z

src/common/redis_module/ray_redis_module.cc

@@ -73,6 +73,11 @@ flatbuffers::Offset<flatbuffers::String> RedisStringToFlatbuf(
 * Publish a notification to a client's notification channel about an insertion
 * or deletion to the db client table.
 *
+ * The format for the published notification is:


TODO: think about flatbufferizing this!

pcmoritz · 2017-03-16T05:26:44Z

src/common/redis_module/ray_redis_module.cc

-  if (!published) {
-    return RedisModule_ReplyWithError(ctx, "PUBLISH unsuccessful");
+    if (!published) {
+      RedisModule_CloseKey(db_client_table_key);


move this out of the if (to the front) and combine it with the one below the if

pcmoritz · 2017-03-16T05:29:20Z

src/common/redis_module/ray_redis_module.cc

@@ -159,14 +164,20 @@ int Connect_RedisCommand(RedisModuleCtx *ctx,
  RedisModuleKey *db_client_table_key =
      OpenPrefixedKey(ctx, DB_CLIENT_PREFIX, ray_client_id, REDISMODULE_WRITE);

+  if (RedisModule_KeyType(db_client_table_key) != REDISMODULE_KEYTYPE_EMPTY) {


pcmoritz · 2017-03-16T05:31:09Z

src/common/redis_module/ray_redis_module.cc

+  RedisModule_HashGet(db_client_table_key, REDISMODULE_HASH_CFIELDS, "deleted",
+                      &deleted_string, NULL);
+  long long deleted;
+  int parsed = RedisModule_StringToLongLong(deleted_string, &deleted);


I have the feeling a bunch of this code could be improved if we move more flatbuffers. Do you want to do this as a followup PR?

Yes, good point! I'll leave a TODO and do it another PR.

pcmoritz · 2017-03-16T05:39:48Z

src/plasma/plasma_manager.cc

@@ -765,35 +779,27 @@ void process_transfer_request(event_loop *loop,
    return;
  }

+  /* Allocate and append the request to the transfer queue. */
+  ObjectBuffer obj_buffer;


Now that we have the new PascalCase naming convention, let's try to move away from these abbreviations and write object_buffer instead!

pcmoritz · 2017-03-16T05:41:59Z

src/plasma/plasma_manager.cc

+  /* We pass in 0 to indicate that the command should return immediately. */
+  plasma_get(conn->manager_state->plasma_conn, &obj_id, 1, 0, &obj_buffer);
+  if (obj_buffer.data_size == -1) {
+    /* If the object wasn't locally available, exit immediately. If the object


pcmoritz

Looks great, I made some small comments.

AmplabJenkins · 2017-03-17T00:04:17Z

Build finished. Test FAILed.

AmplabJenkins · 2017-03-17T00:04:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/312/
Test FAILed.

robertnishihara · 2017-03-17T05:15:47Z

src/common/state/table.h

@@ -44,6 +44,11 @@ typedef struct {
  table_fail_callback fail_callback;
 } RetryInfo;

+static const RetryInfo heartbeat_retry = {


We aren't using this struct initialization anywhere else in the code base (because normally it doesn't compile in C++, I have no idea why it's compiling here).

E.g., http://stackoverflow.com/questions/5790534/static-structure-initialization-with-tags-in-c or http://stackoverflow.com/questions/11516657/c-structure-initialization. Obviously it's compiling so I'm mistaken about something, but it seems best to avoid this.

Can we just define it in plasma_manager_send_heartbeat? Or better yet, just pass in NULL to use the default retry info?

Oh, we're using it in db_client_table.cc, but yeah, we can define it there instead. I don't want to pass in NULL since the default retry is to try things infinitely (we should just try heartbeats once).

robertnishihara · 2017-03-17T05:20:00Z

src/plasma/plasma_manager.cc

-    event_loop_add_file(manager_state->loop, manager_conn->fd, EVENT_LOOP_WRITE,
-                        send_queued_request, manager_conn);
+    if (manager_conn->transfer_queue == NULL) {
+      /* If we already have a connection to this manager and its inactive,


its -> it's

AmplabJenkins · 2017-03-17T08:08:23Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-03-17T08:08:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/324/
Test FAILed.

AmplabJenkins · 2017-03-17T08:30:11Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-03-17T08:30:11Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/325/
Test FAILed.

AmplabJenkins · 2017-03-17T19:03:24Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-03-17T19:03:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/326/
Test FAILed.

robertnishihara · 2017-03-17T20:23:52Z

src/local_scheduler/local_scheduler.cc

@@ -472,9 +471,10 @@ void process_plasma_notification(event_loop *loop,
  uint8_t *notification = read_message_async(loop, client_sock);
  if (!notification) {
    /* The store has closed the socket. */
-    LocalSchedulerState_free(state);
-    LOG_FATAL(
+    kill(getpid(), SIGTERM);


Nice job tracking this down!

This was a pretty nasty bug, so probably worth documentation exactly what was going on.

The problem is the following, right?

When a Python script finished, Ray would call cleanup in services.py, which would kill the plasma store. That would cause this code here to get run and the local scheduler would call LocalSchedulerState_free and begin to clean up its state (e.g., killing its workers), but then services.py would also send a kill signal to the local scheduler, which would then cause LocalSchedulerState_free to start running again and then the array of workers was already freed so it was looking at invalid data and would try to call kill on PID 0, which cause problems (e.g., it looked like it caused the driver to die).

It's not completely clear to me that this solves the problem, but I think that you can't interrupt the SIGINT handler with another SIGINT https://www.gnu.org/software/libc/manual/html_node/Signals-in-Handler.html#Signals-in-Handler, but the SIGINT handler may still get run twice.

robertnishihara · 2017-03-17T20:24:57Z

src/local_scheduler/local_scheduler.cc

@@ -252,8 +252,7 @@ void start_worker(LocalSchedulerState *state, ActorID actor_id) {
  execvp(start_actor_worker_command[0],
         (char *const *) start_actor_worker_command);
  free(start_actor_worker_command);
-  LocalSchedulerState_free(state);
-  LOG_FATAL("Failed to start worker");
+  kill(getpid(), SIGTERM);


Probably worth adding some description here (see comment below).

robertnishihara · 2017-03-17T20:26:10Z

test/component_failures_test.py

+        ray.services.all_processes[ray.services.PROCESS_TYPE_GLOBAL_SCHEDULER][0],
+        ]:
+      process.terminate()
+      process.wait()


Want to add another test does the same thing as this, but does all of the process.terminate() calls and then all of the process.wait() calls instead of serializing them?

robertnishihara · 2017-03-17T20:31:57Z

src/common/state/db_client_table.cc

+void plasma_manager_send_heartbeat(DBHandle *db_handle) {
+  RetryInfo heartbeat_retry = {.num_retries = 0,
+                               .timeout = HEARTBEAT_TIMEOUT_MILLISECONDS,
+                               .fail_callback = NULL};


Would you mind replacing this with

RetryInfo heartbeat_retry; heartbeat_retry.num_retries = 0; heartbeat_retry.timeout = HEARTBEAT_TIMEOUT_MILLISECONDS; heartbeat_retry.fail_callback = NULL;

Note that we use the default retry info for the local scheduler heartbeats. I agree retrying doesn't make sense (anyway, I'm not suggesting that we change it here).

AmplabJenkins · 2017-03-17T21:39:53Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-03-17T21:39:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/327/
Test FAILed.

AmplabJenkins · 2017-03-17T22:25:15Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-03-17T22:25:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/328/
Test FAILed.

robertnishihara · 2017-03-17T22:26:12Z

test/component_failures_test.py

+        ]:
+      process.terminate()
+      time.sleep(0.1)
+      process.kill()


why was it necessary to add the process.kill()? why wasn't process.terminate() enough?

In earlier commits, some of these tests were hanging (example logs). I wasn't able to confirm it locally, but I'm pretty sure it was hanging at one of the process.wait() lines.

AmplabJenkins · 2017-03-17T23:47:34Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-17T23:47:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/330/
Test PASSed.

robertnishihara · 2017-03-18T00:05:09Z

This should address #264.

pcmoritz reviewed Mar 16, 2017

View reviewed changes

stephanie-wang and others added 17 commits March 16, 2017 16:31

Failing test case

e7c9d78

Local scheduler exits cleanly after plasma store dies

b533119

Tolerate one plasma store failure

1aef870

Tolerate plasma store failures on all nodes except head node

85ecc9b

Plasma manager heartbeats

7fba7e9

Component failure tests

8e90e8c

Don't run the helper for Python testing

0533833

Fix C test

da00bf0

Fix hanging plasma transfer test

6beed99

Fix python3

c07b953

Consolidate ClientConnection code

ea339d7

Fix valgrind test

bdda37c

fix c test

ba8b92d

We can restart worker nodes!

b32c1ae

Fix flatbuffers bug

3118c0d

Address comments

3058b1d

Only register actual workers with the local scheduler

d090c33

stephanie-wang force-pushed the node-failure branch from 37db5c4 to d090c33 Compare March 16, 2017 23:35

robertnishihara mentioned this pull request Mar 17, 2017

Plasma manager blocks on an object transfer when the object is evicted. #264

Closed

robertnishihara reviewed Mar 17, 2017

View reviewed changes

Add test case that tests for driver liveness, fix local scheduler bug

470fdcd

Clean up after tests

2e4983b

Allocate retry info on the stack

8f3110a

robertnishihara reviewed Mar 17, 2017

View reviewed changes

Send SIGKILL before waiting

abe9ea8

Relax unit test conditions

477868d

robertnishihara reviewed Mar 17, 2017

View reviewed changes

robertnishihara changed the title ~~[WIP] Plasma and worker node failure~~ Plasma and worker node failure. Mar 17, 2017

Driver liveness test case and documentation

2f3a3e2

robertnishihara merged commit 12c9618 into ray-project:master Mar 18, 2017

robertnishihara deleted the node-failure branch March 18, 2017 00:04

Plasma and worker node failure. #373

Plasma and worker node failure. #373

Conversation

stephanie-wang commented Mar 16, 2017

AmplabJenkins commented Mar 16, 2017

AmplabJenkins commented Mar 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 17, 2017

AmplabJenkins commented Mar 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 17, 2017

AmplabJenkins commented Mar 17, 2017

AmplabJenkins commented Mar 17, 2017

AmplabJenkins commented Mar 17, 2017

AmplabJenkins commented Mar 17, 2017

AmplabJenkins commented Mar 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 17, 2017

AmplabJenkins commented Mar 17, 2017

AmplabJenkins commented Mar 17, 2017

AmplabJenkins commented Mar 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 17, 2017

AmplabJenkins commented Mar 17, 2017

robertnishihara commented Mar 18, 2017