-
Notifications
You must be signed in to change notification settings - Fork 7k
Local scheduler sends a null heartbeat to global scheduler #962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local scheduler sends a null heartbeat to global scheduler #962
Conversation
0c5dd6c to
4e29392
Compare
|
Merged build finished. Test PASSed. |
|
Test PASSed. |
4e29392 to
3dca693
Compare
|
Merged build finished. Test PASSed. |
|
Test PASSed. |
|
Thanks, this looks good! Did you try if the tests get any faster as a result of this? |
| * scheduler. */ | ||
| double dynamic_resources[ResourceIndex_MAX]; | ||
| /** Whether the local scheduler is dead. If true, then all other fields | ||
| * should be ignored. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing space before should
|
|
||
| void redis_local_scheduler_table_disconnect(DBHandle *db) { | ||
| flatbuffers::FlatBufferBuilder fbb; | ||
| LocalSchedulerInfoMessageBuilder builder(fbb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everywhere else I think we are constructing these messages with just a single big call to the CreateLocalSchedulerInfoMessage method (or equivalent). Any particular reason for doing it this way instead? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I did this to make it explicit that the other fields are unset, but we could also set them to an empty LocalSchedulerInfo. Do you have a preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, I don't have a preference, the current way seems fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it turns out this was a bug. It was causing the local scheduler valgrind error that we've been seeing. Hopefully fixed by #1037.
It was causing an assertion failure
local_scheduler_tests: /home/travis/build/ray-project/ray/python/ray/core/flatbuffers_ep-prefix/src/flatbuffers_ep-install/include/flatbuffers/flatbuffers.h:890: void flatbuffers::FlatBufferBuilder::NotNested(): Assertion `!nested' failed.
in the test
bash ../../../src/local_scheduler/test/run_valgrind.sh
|
I just ran expecting it to be faster (that's where this code is being tested, right?), but it took 18 seconds both with and without this PR. Should this code speed up that test? |
|
Ah yeah, @pcmoritz and @robertnishihara, turns out since we're making a synchronous Redis call in the |
|
Merged build finished. Test PASSed. |
|
Test PASSed. |
|
That makes sense, but the change doesn't seem to make a difference for me. I just looked into this a little by putting some print statements in It seems to hang at the first call to If I instead start Ray with |
|
Hmm I'm not able to reproduce that behavior. Can you try printing the return value of |
|
Merged build finished. Test PASSed. |
|
Test PASSed. |
|
Merged build finished. Test FAILed. |
|
Test FAILed. |
|
retest this please |
|
Well the call to |
|
Merged build finished. Test PASSed. |
|
Test PASSed. |
|
I think I've seen the valgrind error before. I don't remember the exact problem, but sending SIGINTs to a process seems like it might cause valgrind to complain. Anyway, the valgrind error may become more common after this PR, so if that's the case we should either change the test or figure out how to fix it. |
Upon a clean exit, the local scheduler sends a null heartbeat to the global scheduler to notify that it is about to die. This allows the global scheduler to avoid waiting for the heartbeat timeout before marking the local scheduler as dead.