Fix bugs in plasma manager transfer #1188

stephanie-wang · 2017-11-07T02:29:45Z

This fixes error handling between plasma managers when one manager dies during a transfer. Previously, we were not detecting and handling errors on the receiving end. This detects the error by looking for an EOF returned by read, and aborts the object that was being received so that we can later recreate the object. This also refactors the send and receive code (in write_object_chunk and read_object_chunk, respectively) to be more similar.

AmplabJenkins · 2017-11-07T02:56:41Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-11-07T02:56:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2280/
Test FAILed.

robertnishihara

Nice fixes!

robertnishihara · 2017-11-07T02:45:04Z

src/plasma/plasma_manager.cc

       * yet, so send the initial data request. */
      err = handle_sigpipe(
          plasma::SendDataReply(conn->fd, buf->object_id.to_plasma_id(),
                                buf->data_size, buf->metadata_size),
          conn->fd);
+      conn->cursor = 0;


it seems that by providing ClientConnection_request_finished and ClientConnection_finish_request, you're trying to hide conn->cursor as an implementation detail. But that seems inconsistent with directly accessing conn->cursor here and in other places in this file.

Ah it was more to make sure that we could keep the values consistent, in case we have to change it again and don't want to change the -1 check everywhere.

AmplabJenkins · 2017-11-09T06:34:50Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-11-09T06:34:50Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2321/
Test FAILed.

AmplabJenkins · 2017-11-09T20:30:22Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-11-09T20:30:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2328/
Test FAILed.

AmplabJenkins · 2017-11-09T22:44:10Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-11-09T22:44:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2333/
Test PASSed.

AmplabJenkins · 2017-11-09T23:14:16Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-11-09T23:14:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2331/
Test FAILed.

AmplabJenkins · 2017-11-11T03:08:11Z

Build finished. Test PASSed.

AmplabJenkins · 2017-11-11T03:08:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2352/
Test PASSed.

This reverts commit e00fbd58dc4a632f58383549b19fb9057b305a14.

AmplabJenkins · 2017-11-12T22:10:12Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-11-12T22:10:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2398/
Test PASSed.

AmplabJenkins · 2017-11-13T22:45:35Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-11-13T22:45:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2404/
Test PASSed.

…ead_object_chunk

AmplabJenkins · 2017-11-14T05:25:23Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-11-14T05:25:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2408/
Test PASSed.

robertnishihara

This looks really great! I left some questions.

It seems like the right testing set up for the managers is to start a bunch of stores/managers and have them continually request objects from each other and then to continually kill connections and make sure they properly clean up and reinitiate the connections.

However, I'm not sure how to go about "killing connections" other than just killing the managers themselves.

robertnishihara · 2017-11-14T18:08:50Z

src/plasma/plasma_manager.cc

@@ -589,13 +592,14 @@ void send_queued_request(event_loop *loop,
    break;
  case MessageType_PlasmaDataReply:
    LOG_DEBUG("Transferring object to manager");
-    if (conn->cursor == 0) {
-      /* If the cursor is zero, we haven't sent any requests for this object
+    if (ClientConnection_request_finished(conn)) {


We used to be checking conn->cursor == 0, and now we're checking conn->cursor == -1. Is the danger that if we only use 0 instead of 0 and -1 then we may accidentally go through this code block twice if write_object_chunk fails to write something?

Or did it just seem cleaner to do it this way?

Yeah, I was worried about the case you mentioned happening. I guess the chances are small, but it seems cleaner to do it this way anyway.

robertnishihara · 2017-11-14T18:22:48Z

src/plasma/plasma_manager.cc

-    LOG_ERROR("read error");
-  } else if (r == 0) {
-    LOG_DEBUG("end of file");
+  if (r <= 0) {


Before it looks like r == 0 meant we read an EOF. Is that an error?

Yeah, it would be EOF for a normal file, but since this is over the network, and we stop reading after reading exactly the right number of bytes, we should never get 0 here, unless there was a network error.

robertnishihara · 2017-11-14T18:23:41Z

src/plasma/plasma_manager.cc

-    LOG_DEBUG("end of file");
+  if (r <= 0) {
+    LOG_ERROR("Read error");
+    return errno;
  } else {


Given that we return in the if block, I have a slight preference for removing this else block.

robertnishihara · 2017-11-14T18:26:55Z

src/plasma/plasma_manager.cc

+     * we're done. */
+    if (conn->cursor == buf->data_size + buf->metadata_size) {
+      ClientConnection_finish_request(conn);
+    }
    return 0;


The prior code looks like it was just returning 0 (meaning "not done with the transfer") in the event of an error. This makes it seem like it could just get into a situation where it kept trying to process the same object but never finished and so a connection between two managers would get blocked in some sense. Was that happening? I'm just wondering how this could possibly have been working if we weren't propagating errors from the call to read back to the caller..

Yeah, this was the bug that we uncovered when testing with a live cluster. I don't think we run into it normally because the chance of a disconnection was small.

robertnishihara · 2017-11-14T18:32:14Z

src/plasma/plasma_manager.cc

-  int done = read_object_chunk(conn, buf);
-  if (!done) {
-    return;
+  int err = read_object_chunk(conn, buf);


Now that we're handling dropped connections, we don't really need ignore_data_chunk anymore, right? E.g., the receiver could simply hang up on the sender if the receiver doesn't want the object that the sender is sending, right?

Not sure which would be more efficient.

Hmm, I think we still need it so that we can get any objects that were queued behind the object we want to ignore. I think this case is if we already have the object that's getting transferred, so we just want to ignore the new copy.

robertnishihara · 2017-11-14T18:39:30Z

src/plasma/plasma_manager.cc

+     * we're done. */
+    if (conn->cursor == buf->data_size + buf->metadata_size) {
+      ClientConnection_finish_request(conn);
+    }


Should be impossible since we pass s into the read call above, but might be worth having a sanity check

CHECK(conn->cursor <= buf->data_size + buf->metadata_size);

AmplabJenkins · 2017-11-14T21:35:17Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-11-14T21:35:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2412/
Test PASSed.

pcmoritz

+1 LGTM

AmplabJenkins · 2017-11-16T01:15:38Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-11-16T01:15:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2424/
Test PASSed.

robertnishihara reviewed Nov 7, 2017

View reviewed changes

stephanie-wang force-pushed the plasma-manager-failures branch from 8623506 to e481ccc Compare November 9, 2017 21:11

stephanie-wang added 11 commits November 12, 2017 13:34

Use ray-project/arrow:abort-objects branch

bf9ead7

Plasma client test for plasma abort

5ddce2e

Set plasma manager connection cursor to -1 when not in use

4d2db5c

Handle transfer errors between plasma managers, abort unsealed objects

4c04d09

Add TODO for local scheduler exiting on plasma manager death

c9a640b

Revert "Plasma client test for plasma abort"

73f5940

This reverts commit e00fbd58dc4a632f58383549b19fb9057b305a14.

Upgrade arrow to version with PlasmaClient::Abort

299a22d

Fix plasma manager test

2fb08d8

Fix plasma test

d246c63

Temporarily use arrow fork for testing

5c669ef

fix and set arrow commit

63c7f45

stephanie-wang force-pushed the plasma-manager-failures branch from d1776ef to 63c7f45 Compare November 12, 2017 21:37

Fix plasma test

ed35319

Fix plasma manager test and make write_object_chunk consistent with r…

dfb0f02

…ead_object_chunk

robertnishihara reviewed Nov 14, 2017

View reviewed changes

style

c728e2b

robertnishihara approved these changes Nov 14, 2017

View reviewed changes

upgrade arrow

8dca63b

pcmoritz approved these changes Nov 16, 2017

View reviewed changes

pcmoritz merged commit c70430f into ray-project:master Nov 16, 2017

pcmoritz deleted the plasma-manager-failures branch November 16, 2017 06:32

Fix bugs in plasma manager transfer #1188

Fix bugs in plasma manager transfer #1188

Conversation

stephanie-wang commented Nov 7, 2017 • edited Loading

AmplabJenkins commented Nov 7, 2017

AmplabJenkins commented Nov 7, 2017

robertnishihara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Nov 9, 2017

AmplabJenkins commented Nov 9, 2017

AmplabJenkins commented Nov 9, 2017

AmplabJenkins commented Nov 9, 2017

AmplabJenkins commented Nov 9, 2017

AmplabJenkins commented Nov 9, 2017

AmplabJenkins commented Nov 9, 2017

AmplabJenkins commented Nov 9, 2017

AmplabJenkins commented Nov 11, 2017

AmplabJenkins commented Nov 11, 2017

AmplabJenkins commented Nov 12, 2017

AmplabJenkins commented Nov 12, 2017

AmplabJenkins commented Nov 13, 2017

AmplabJenkins commented Nov 13, 2017

AmplabJenkins commented Nov 14, 2017

AmplabJenkins commented Nov 14, 2017

robertnishihara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang Nov 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Nov 14, 2017

AmplabJenkins commented Nov 14, 2017

pcmoritz left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Nov 16, 2017

AmplabJenkins commented Nov 16, 2017

stephanie-wang commented Nov 7, 2017 •

edited

Loading

stephanie-wang Nov 14, 2017 •

edited

Loading