Skip to content

Fix bugs in plasma manager transfer #1188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Nov 16, 2017

Conversation

stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented Nov 7, 2017

This fixes error handling between plasma managers when one manager dies during a transfer. Previously, we were not detecting and handling errors on the receiving end. This detects the error by looking for an EOF returned by read, and aborts the object that was being received so that we can later recreate the object. This also refactors the send and receive code (in write_object_chunk and read_object_chunk, respectively) to be more similar.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2280/
Test FAILed.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fixes!

* yet, so send the initial data request. */
err = handle_sigpipe(
plasma::SendDataReply(conn->fd, buf->object_id.to_plasma_id(),
buf->data_size, buf->metadata_size),
conn->fd);
conn->cursor = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that by providing ClientConnection_request_finished and ClientConnection_finish_request, you're trying to hide conn->cursor as an implementation detail. But that seems inconsistent with directly accessing conn->cursor here and in other places in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it was more to make sure that we could keep the values consistent, in case we have to change it again and don't want to change the -1 check everywhere.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2321/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2328/
Test FAILed.

@stephanie-wang stephanie-wang force-pushed the plasma-manager-failures branch from 8623506 to e481ccc Compare November 9, 2017 21:11
@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2333/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2331/
Test FAILed.

@AmplabJenkins
Copy link

Build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2352/
Test PASSed.

@stephanie-wang stephanie-wang force-pushed the plasma-manager-failures branch from d1776ef to 63c7f45 Compare November 12, 2017 21:37
@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2398/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2404/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2408/
Test PASSed.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great! I left some questions.

It seems like the right testing set up for the managers is to start a bunch of stores/managers and have them continually request objects from each other and then to continually kill connections and make sure they properly clean up and reinitiate the connections.

However, I'm not sure how to go about "killing connections" other than just killing the managers themselves.

@@ -589,13 +592,14 @@ void send_queued_request(event_loop *loop,
break;
case MessageType_PlasmaDataReply:
LOG_DEBUG("Transferring object to manager");
if (conn->cursor == 0) {
/* If the cursor is zero, we haven't sent any requests for this object
if (ClientConnection_request_finished(conn)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to be checking conn->cursor == 0, and now we're checking conn->cursor == -1. Is the danger that if we only use 0 instead of 0 and -1 then we may accidentally go through this code block twice if write_object_chunk fails to write something?

Or did it just seem cleaner to do it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was worried about the case you mentioned happening. I guess the chances are small, but it seems cleaner to do it this way anyway.

LOG_ERROR("read error");
} else if (r == 0) {
LOG_DEBUG("end of file");
if (r <= 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before it looks like r == 0 meant we read an EOF. Is that an error?

Copy link
Contributor Author

@stephanie-wang stephanie-wang Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would be EOF for a normal file, but since this is over the network, and we stop reading after reading exactly the right number of bytes, we should never get 0 here, unless there was a network error.

LOG_DEBUG("end of file");
if (r <= 0) {
LOG_ERROR("Read error");
return errno;
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we return in the if block, I have a slight preference for removing this else block.

* we're done. */
if (conn->cursor == buf->data_size + buf->metadata_size) {
ClientConnection_finish_request(conn);
}
return 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior code looks like it was just returning 0 (meaning "not done with the transfer") in the event of an error. This makes it seem like it could just get into a situation where it kept trying to process the same object but never finished and so a connection between two managers would get blocked in some sense. Was that happening? I'm just wondering how this could possibly have been working if we weren't propagating errors from the call to read back to the caller..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was the bug that we uncovered when testing with a live cluster. I don't think we run into it normally because the chance of a disconnection was small.

int done = read_object_chunk(conn, buf);
if (!done) {
return;
int err = read_object_chunk(conn, buf);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're handling dropped connections, we don't really need ignore_data_chunk anymore, right? E.g., the receiver could simply hang up on the sender if the receiver doesn't want the object that the sender is sending, right?

Not sure which would be more efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think we still need it so that we can get any objects that were queued behind the object we want to ignore. I think this case is if we already have the object that's getting transferred, so we just want to ignore the new copy.

* we're done. */
if (conn->cursor == buf->data_size + buf->metadata_size) {
ClientConnection_finish_request(conn);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be impossible since we pass s into the read call above, but might be worth having a sanity check

CHECK(conn->cursor <= buf->data_size + buf->metadata_size);

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2412/
Test PASSed.

Copy link
Contributor

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2424/
Test PASSed.

@pcmoritz pcmoritz merged commit c70430f into ray-project:master Nov 16, 2017
@pcmoritz pcmoritz deleted the plasma-manager-failures branch November 16, 2017 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants