Plasma and worker node failure #347

stephanie-wang · 2017-03-08T04:29:33Z

This provides fault tolerance in the case of a Plasma store or manager failure on a worker node. As a result, this also provides fault tolerance for a worker node.

To provide fault detection, Plasma managers send heartbeats to the Python monitoring process. Plasma managers that time out are deleted (using tombstones) from the db_client table by the monitor process. Once the deletion is published, the monitor process cleans up the object table by removing the dead Plasma manager from any location entries.

This also fixes a couple potential bugs in the Plasma manager including:

Looking up new locations for an object, if the manager has already tried requesting a transfer from all known locations.
Canceling transfer requests for objects that aren't local, rather than blocking forever.

AmplabJenkins · 2017-03-08T04:47:03Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-03-08T04:47:03Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/229/
Test FAILed.

AmplabJenkins · 2017-03-08T06:27:09Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-03-08T06:27:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/233/
Test FAILed.

AmplabJenkins · 2017-03-08T19:39:44Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-08T19:39:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/234/
Test PASSed.

AmplabJenkins · 2017-03-08T23:39:54Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-08T23:39:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/237/
Test PASSed.

AmplabJenkins · 2017-03-09T01:04:32Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-09T01:04:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/241/
Test PASSed.

AmplabJenkins · 2017-03-09T02:14:41Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-09T02:14:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/242/
Test PASSed.

AmplabJenkins · 2017-03-10T21:46:50Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-03-10T21:46:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/256/
Test FAILed.

robertnishihara · 2017-03-10T21:42:20Z

python/ray/monitor.py

        ok = self.redis.execute_command("RAY.TASK_TABLE_UPDATE",
                                        task_id,
                                        TASK_STATUS_LOST,
                                        NIL_ID)
        if ok != b"OK":
          log.warn("Failed to update lost task for dead scheduler.")

+  def cleanup_object_table(self):


Just to make sure I understand what's happening, the following scenario is possible, right?

The monitor mistakenly marks a plasma manager as dead.

The monitor removes the manager from the DB client table.

The monitor removes the manager from all entries of the object table.

The manager adds a new object to the object table.

That could lead to a get hanging, right?

Yeah, this will be an issue if the manager then dies after the add completes (if the manager is still alive, it will still be able to serve requests). Right now the failure scenario that we're handling is that components are actually dead when we timeout their heartbeats.

robertnishihara · 2017-03-10T21:57:05Z

python/ray/plasma/test/test.py

-      self.client2.transfer("127.0.0.1", self.port1, object_id2)
+      # Transfer the buffer to the the other Plasma store. There is a race
+      # condition on the create and transfer of the object, so keep trying
+      # until the object appears on the second Plasma store.


Oh, good catch.

robertnishihara · 2017-03-10T22:00:05Z

src/common/common.h

@@ -20,6 +20,14 @@ extern "C" {
 }
 #endif

+/* The duration between heartbeats. These are sent by the plasma manager and


This should probably be

/** The duration ... * local scheduler. */ And similarly below.

robertnishihara · 2017-03-10T22:04:56Z

src/common/state/table.h

@@ -44,6 +44,11 @@ typedef struct {
  table_fail_callback fail_callback;
 } RetryInfo;

+static const RetryInfo heartbeat_retry = {


this is not valid C++, right? how did this compile?

Also, we probably shouldn't retry heartbeats, right?

Oh I see that you're using these retries as the actual mechanism for sending heartbeats.

I think it's valid?

Yeah, it's a bit of a hack, but I thought it was better than having to allocate/deallocate the memory every time. Probably we should rethink this one we do redis.c redux. :)

AmplabJenkins · 2017-03-10T22:59:28Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-03-10T22:59:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/257/
Test PASSed.

robertnishihara · 2017-03-10T23:46:48Z

src/common/state/table.cc

@@ -81,8 +81,13 @@ int64_t table_timeout_handler(event_loop *loop,

  CHECK(callback_data->retry.num_retries >= 0 ||
        callback_data->retry.num_retries == -1);
-  LOG_WARN("retrying operation %s, retry_count = %d", callback_data->label,
-           callback_data->retry.num_retries);
+  if (callback_data->retry.timeout > HEARTBEAT_TIMEOUT_MILLISECONDS) {


won't this suppress retry messages from things other than heartbeats?

Yes. If you have any ideas on how to do this better, I'm open. Maybe comparing the label string?

robertnishihara · 2017-03-10T23:48:49Z

src/common/state/db_client_table.h

+ * heartbeat contains this database client's ID. Heartbeats can be subscribed
+ * to through the plasma_managers channel. Once called, this "retries" the
+ * heartbeat operation forever, every HEARTBEAT_TIMEOUT_MILLISECONDS
+ * milliseconds.


I see, all of this is to avoid a malloc? It feels cleaner to me to have the timer in the manager instead of in redis code. Especially since that's what we're doing with the local scheduler and we'll need to do it if we want to include load information or other information in the heartbeats.

Hmm the original reason was actually that I wanted to use a RetryInfo with 0 retries, and I'd thought that I couldn't do that, but I think I can work around it now.

robertnishihara · 2017-03-10T23:52:40Z

python/ray/monitor.py

+      time.sleep(HEARTBEAT_TIMEOUT_MILLISECONDS * 1e-3)
+
+
+if __name__ == '__main__':


"__main__"

robertnishihara · 2017-03-10T23:57:33Z

python/ray/monitor.py

+        # If the data was an integer, then the message was a response to an
+        # initial subscription request.
+        is_subscribe = int(data)
+        message_handler = self.subscribe_handler


should we also add here assert(not self.subscribed[channel])?

or just get rid of the try/except and instead do an

if not self.subscribed[channel]: is_subscribe = int(data) ... else: ...

robertnishihara · 2017-03-10T23:59:40Z

python/ray/monitor.py

+    """Handle a notification from the db_client table from Redis.
+
+    This handler processes any notifications for deletions from the db_client
+    table. Insertions are ignored. Cleanup of the associate state in the state


associate -> associated

robertnishihara · 2017-03-16T02:08:51Z

python/ray/monitor.py

+    manager.
+    """
+    # The first DB_CLIENT_ID_SIZE characters are the client ID.
+    db_client_id = data[:DB_CLIENT_ID_SIZE]


We should also assert that len(data) == DB_CLIENT_ID_SIZE, right?

I think I will fix this in another PR that switches to flatbuffers for plasma manager, like @pcmoritz suggested.

robertnishihara mentioned this pull request Mar 8, 2017

reconnecting cluster slave nodes to the head node fails #336

Closed

stephanie-wang force-pushed the plasma-manager-failure branch from 5ea0a48 to 2b54b3e Compare March 8, 2017 23:27

robertnishihara mentioned this pull request Mar 10, 2017

Plasma manager blocks on an object transfer when the object is evicted. #264

Closed

stephanie-wang and others added 14 commits March 10, 2017 13:30

Failing test case

ca430ca

Local scheduler exits cleanly after plasma store dies

628fb88

Tolerate one plasma store failure

4c1d2a2

Tolerate plasma store failures on all nodes except head node

8126fae

Plasma manager heartbeats

2299b9b

Component failure tests

d1a8cb5

Don't run the helper for Python testing

72a6d58

Fix C test

7f71669

Fix hanging plasma transfer test

42fbfa6

Fix python3

4490336

Consolidate ClientConnection code

2b8f230

Fix valgrind test

babf68f

fix c test

be2760c

We can restart worker nodes!

957d9b5

pcmoritz force-pushed the plasma-manager-failure branch from aadf37d to 957d9b5 Compare March 10, 2017 21:30

robertnishihara reviewed Mar 10, 2017

View reviewed changes

kill processes one after another

8e80e49

robertnishihara reviewed Mar 10, 2017

View reviewed changes

python/ray/monitor.py

time.sleep(HEARTBEAT_TIMEOUT_MILLISECONDS * 1e-3)

if __name__ == '__main__':

Copy link

Collaborator

robertnishihara Mar 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"__main__"

robertnishihara reviewed Mar 10, 2017

View reviewed changes

robertnishihara reviewed Mar 16, 2017

View reviewed changes

stephanie-wang mentioned this pull request Mar 16, 2017

Plasma and worker node failure. #373

Merged

stephanie-wang closed this Mar 16, 2017

stephanie-wang deleted the plasma-manager-failure branch February 19, 2019 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plasma and worker node failure #347

Plasma and worker node failure #347

stephanie-wang commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 9, 2017

AmplabJenkins commented Mar 9, 2017

AmplabJenkins commented Mar 9, 2017

AmplabJenkins commented Mar 9, 2017

AmplabJenkins commented Mar 10, 2017

AmplabJenkins commented Mar 10, 2017

robertnishihara Mar 10, 2017

stephanie-wang Mar 16, 2017

robertnishihara Mar 10, 2017

robertnishihara Mar 10, 2017 •

edited

Loading

robertnishihara Mar 10, 2017

robertnishihara Mar 10, 2017

robertnishihara Mar 10, 2017

stephanie-wang Mar 16, 2017

AmplabJenkins commented Mar 10, 2017

AmplabJenkins commented Mar 10, 2017

robertnishihara Mar 10, 2017

stephanie-wang Mar 16, 2017

robertnishihara Mar 10, 2017

stephanie-wang Mar 16, 2017

robertnishihara Mar 10, 2017

robertnishihara Mar 10, 2017

robertnishihara Mar 10, 2017

robertnishihara Mar 10, 2017

robertnishihara Mar 16, 2017

stephanie-wang Mar 16, 2017

		time.sleep(HEARTBEAT_TIMEOUT_MILLISECONDS * 1e-3)


		if __name__ == '__main__':

Plasma and worker node failure #347

Plasma and worker node failure #347

Conversation

stephanie-wang commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 8, 2017

AmplabJenkins commented Mar 9, 2017

AmplabJenkins commented Mar 9, 2017

AmplabJenkins commented Mar 9, 2017

AmplabJenkins commented Mar 9, 2017

AmplabJenkins commented Mar 10, 2017

AmplabJenkins commented Mar 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertnishihara Mar 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 10, 2017

AmplabJenkins commented Mar 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertnishihara Mar 10, 2017 •

edited

Loading