Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xray] Log warnings for asio handlers that take too long #2601

Merged
merged 3 commits into from
Aug 9, 2018

Conversation

stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented Aug 8, 2018

What do these changes do?

This adds warnings when an event handler takes too long to process a message from a ClientConnection. Ideally, we would also be able to do this in the future for all handlers on the event loop (e.g., timer expiration handlers).

This also changes the timer used for heartbeats to one that uses the std::chrono::steady_clock, and logs a warning if the last heartbeat was more than 500ms ago.

@stephanie-wang stephanie-wang requested review from robertnishihara and elibol and removed request for elibol August 8, 2018 01:41
@atumanov atumanov self-requested a review August 8, 2018 01:58
@atumanov
Copy link
Contributor

atumanov commented Aug 8, 2018

What are the advantages of killing raylet when it missed num_heartbeats? I understand that the monitor will mark it as dead, but I don't understand why it's better to kill raylet in this case rather than letting it make progress. To make it consistent with the monitor action? Could we have the monitor mark the node as live again when it starts receiving the heartbeats from that node again?

Switching to the monotonic clock (as we've done with legacy Ray) makes sense to me.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7326/
Test PASSed.

@stephanie-wang stephanie-wang changed the title [xray] Add fatal check for heartbeat timer drift [xray] Log warnings for asio handlers that take too long Aug 8, 2018
@stephanie-wang
Copy link
Contributor Author

@atumanov, yeah, the fatal check probably isn't that helpful. I changed it to log a warning if the last heartbeat was too long ago.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7330/
Test PASSed.

@pcmoritz pcmoritz merged commit 2de9bfc into ray-project:master Aug 9, 2018
@pcmoritz pcmoritz deleted the timer-improvements branch August 9, 2018 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants