Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raylet memory corruption fixes #2591

Merged
merged 4 commits into from
Aug 9, 2018

Conversation

atumanov
Copy link
Contributor

@atumanov atumanov commented Aug 7, 2018

What do these changes do?

There are two memory corruption fixes here:

  1. boost::system::error_code message() call returns a corrupted string, even immediately after initializing the error as follows:
    boost::system::error_code error;
    The (temp?) fix is just to replace it with a generic message indicating failure to write message to the socket.
  2. an erase is called on an unordered_map iterator, which is then subsequently accessed.
    Iterators to the erased element are invalidated:
    https://stackoverflow.com/questions/38854265/unordered-map-element-being-deleted
    Valgrind was reporting a bad read, which is fixed by this PR.

Related issue number

fixes : #2584
subsumes: #2548

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7301/
Test PASSed.

@@ -84,7 +84,7 @@ ray::Status ServerConnection<T>::WriteMessage(int64_t type, int64_t length,
boost::system::error_code error;
WriteBuffer(message_buffers, error);
if (error) {
return ray::Status::IOError(error.message());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks strange why error.message() is corrupted. May we could print error.value() as well since that's valid? That might also give some hints on what kind of errors encountered.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it'd be preferable to do strerror(ec.value()) as in #2548 so that we have the error message, even despite the fact that this is probably not platform independent.

We can address this by factoring out all of the calls to strerror(ec.value()) in a separate function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not certain that ec.value() is consistent with the errno numbers expected by strerror. There may be another way that's specific to boost. I am looking into it.

@robertnishihara
Copy link
Collaborator

@atumanov so are you saying that it is pretty certain that the bug is in boost (as opposed to our code)?

@atumanov
Copy link
Contributor Author

atumanov commented Aug 8, 2018

@robertnishihara , I am not 100% sure, but:

  1. without this fix, calls to boost::system::error_code::message() return a corrupted string even when the boost error code object has just been instantiated.
  2. valgrind throws an error on access to boost::system::error_code::message()
  3. with this fix, valgrind doesn't throw any errors (for the task forwarding unit test in this issue), which means that valgrind doesn't detect any memory corruption issues leading up to the code path where the boost error code is instantiated.

From that, I have some degree of confidence that it's not our code, but not 100% confidence.

I added a utility function to convert boost error codes to ray status objects and cleaned up the code in several places to use this utility function in this PR.

@atumanov
Copy link
Contributor Author

atumanov commented Aug 8, 2018

this PR now subsumes #2548

Copy link
Contributor

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks :)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7346/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7355/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7362/
Test FAILed.

@pcmoritz
Copy link
Contributor

pcmoritz commented Aug 9, 2018

jenkins retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7370/
Test PASSed.

@pcmoritz pcmoritz merged commit df7ee7f into ray-project:master Aug 9, 2018
@atumanov atumanov deleted the raylet-memory-corruption-fix branch August 28, 2018 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[xray] Valgrind errors during fault scenario
5 participants