Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protect raylet against bad messages #4003

Merged

Conversation

zhijunfu
Copy link
Contributor

@zhijunfu zhijunfu commented Feb 9, 2019

This is the PR for issue #3915. This fixes the raylet crashes caused by malformed messages.

This change updates node manager so that when it discovers a new node manager, it first sends a ConnectClient message to it with its client ID, before sending any other messages, so that the remote node manager knows that further messages from this connection come from a valid source.

This change then modifies client_connection, so that when receiving a malformed message (with incorrect protocol version), we'll check if it's from a valid source that has previously sent us a ConnectClient message, if it is then it's likely to be a real bug, otherwise we can know it's not from a legitimate source, and we just print a message and don't further process this connection.

@zhijunfu
Copy link
Contributor Author

zhijunfu commented Feb 9, 2019

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11722/
Test PASSed.

@@ -242,7 +244,11 @@ void ClientConnection<T>::ProcessMessageHeader(const boost::system::error_code &
}

// If there was no error, make sure the protocol version matches.
RAY_CHECK(read_version_ == RayConfig::instance().ray_protocol_version());
if (!CheckProtocolVersion()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be slightly clearer if we just have CheckProtocolVersion() here (not returning anything), and then inside of CheckProtocolVersion we call ServerConnection<T>::Close(); where the comment says "and stop processing the connection".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to skip the rest of this function if CheckProtocolVersion fails, thus it still needs to return a value.

@pcmoritz
Copy link
Contributor

pcmoritz commented Feb 9, 2019

Thanks, I left some comments.

Let's also change the ray_protocol_version to 0x5241590000000000 to contain some magic characters at the beginning (hex for "RAY" in this case), and add a comment about it. Rationale for this is that it could still be possible that some random program sends an int64_t which is zero at the beginning (zeros are quite common), but it is much less likely that a program sends this particular number. There is still plenty of space to encode the protocol version in the lower bits.

@robertnishihara
Copy link
Collaborator

@pcmoritz I don't think we should rely on ray_protocol_version for anything because we could easily get rid of it in the future.

@pcmoritz
Copy link
Contributor

pcmoritz commented Feb 9, 2019

If we get rid of it in the future, we need to find a different mechanism to do it then. We need a magic cookie to send before the raylet tries to interpret the next int64_t as a size of the next message (and potentially crash).

@robertnishihara
Copy link
Collaborator

robertnishihara commented Feb 10, 2019 via email

@pcmoritz
Copy link
Contributor

pcmoritz commented Feb 10, 2019

Let's just rename the protocol_version to ray_cookie.

@zhijunfu
Copy link
Contributor Author

updated according to comments, and also renamed ray_protocol_version to ray_cookie

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11738/
Test FAILed.

@pcmoritz
Copy link
Contributor

Thanks, I added a python end-to-end test and fixed the linting.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11750/
Test PASSed.

@pcmoritz pcmoritz force-pushed the protect-ray-against-bad-message branch from 664e627 to 636a5a7 Compare February 11, 2019 03:41
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11759/
Test PASSed.

@guoyuhong
Copy link
Contributor

The Travis failure is not related to this PR. I will merge it.

@guoyuhong guoyuhong merged commit 7097ba3 into ray-project:master Feb 11, 2019
@guoyuhong guoyuhong deleted the protect-ray-against-bad-message branch February 11, 2019 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants