Skip to content

Conversation

@stephanie-wang
Copy link
Contributor

Upon a clean exit, the local scheduler sends a null heartbeat to the global scheduler to notify that it is about to die. This allows the global scheduler to avoid waiting for the heartbeat timeout before marking the local scheduler as dead.

@stephanie-wang stephanie-wang force-pushed the local-scheduler-null-heartbeat branch from 0c5dd6c to 4e29392 Compare September 11, 2017 18:55
@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1817/
Test PASSed.

@stephanie-wang stephanie-wang force-pushed the local-scheduler-null-heartbeat branch from 4e29392 to 3dca693 Compare September 11, 2017 19:57
@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1818/
Test PASSed.

@pcmoritz
Copy link
Contributor

Thanks, this looks good! Did you try if the tests get any faster as a result of this?

* scheduler. */
double dynamic_resources[ResourceIndex_MAX];
/** Whether the local scheduler is dead. If true, then all other fields
* should be ignored. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space before should


void redis_local_scheduler_table_disconnect(DBHandle *db) {
flatbuffers::FlatBufferBuilder fbb;
LocalSchedulerInfoMessageBuilder builder(fbb);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere else I think we are constructing these messages with just a single big call to the CreateLocalSchedulerInfoMessage method (or equivalent). Any particular reason for doing it this way instead? Just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I did this to make it explicit that the other fields are unset, but we could also set them to an empty LocalSchedulerInfo. Do you have a preference?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I don't have a preference, the current way seems fine to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it turns out this was a bug. It was causing the local scheduler valgrind error that we've been seeing. Hopefully fixed by #1037.

It was causing an assertion failure

local_scheduler_tests: /home/travis/build/ray-project/ray/python/ray/core/flatbuffers_ep-prefix/src/flatbuffers_ep-install/include/flatbuffers/flatbuffers.h:890: void flatbuffers::FlatBufferBuilder::NotNested(): Assertion `!nested' failed.

in the test

bash ../../../src/local_scheduler/test/run_valgrind.sh

@robertnishihara
Copy link
Collaborator

I just ran

python test/component_failures_test.py ComponentFailureTest.testLocalSchedulerFailed

expecting it to be faster (that's where this code is being tested, right?), but it took 18 seconds both with and without this PR.

Should this code speed up that test?

@stephanie-wang
Copy link
Contributor Author

Ah yeah, @pcmoritz and @robertnishihara, turns out since we're making a synchronous Redis call in the SIGTERM handler, it takes longer for the local schedulers to exit. I reordered the calls in the component failures test, so it's more likely that the local scheduler exits before the SIGKILL.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1831/
Test PASSed.

@robertnishihara
Copy link
Collaborator

That makes sense, but the change doesn't seem to make a difference for me.

I just looked into this a little by putting some print statements in LocalSchedulerState_free, running ray.init() in Python, manually killing the local scheduler from the command line with kill, and then seeing what gets printed.

It seems to hang at the first call to https://github.com/ray-project/ray/blob/master/src/local_scheduler/local_scheduler.cc#L166.

If I instead start Ray with ray.init(num_workers=0), then the code path seems to work.

@stephanie-wang
Copy link
Contributor Author

Hmm I'm not able to reproduce that behavior. Can you try printing the return value of waitpid in kill_worker and check if there is an error?

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1837/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1829/
Test FAILed.

@robertnishihara
Copy link
Collaborator

retest this please

@robertnishihara
Copy link
Collaborator

Well the call to waitpid seems to be the one that is hanging. It's unrelated to this PR since it also happens for me on the master.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1838/
Test PASSed.

@robertnishihara
Copy link
Collaborator

I think I've seen the valgrind error before. I don't remember the exact problem, but sending SIGINTs to a process seems like it might cause valgrind to complain. Anyway, the valgrind error may become more common after this PR, so if that's the case we should either change the test or figure out how to fix it.

@robertnishihara robertnishihara merged commit 74ac806 into ray-project:master Sep 12, 2017
@robertnishihara robertnishihara deleted the local-scheduler-null-heartbeat branch September 12, 2017 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants