Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keep-alive message between worker and scheduler #2907

Merged
merged 5 commits into from
Aug 2, 2019

Conversation

mrocklin
Copy link
Member

This is effectively a heartbeat, but much simpler and less frequent than
our current heartbeats

Fixes #2524

This is effectively a heartbeat, but much simpler and less frequent than
our current heartbeats

Fixes dask#2524
@lr4d
Copy link
Contributor

lr4d commented Aug 1, 2019

Getting this error on the scheduler side:

2019-08-01 13:30:48,297 ERROR    <lambda>() got an unexpected keyword argument 'worker' (distributed.core)
Traceback (most recent call last):
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/core.py", line 477, in handle_stream
    handler(**merge(extra, msg))
TypeError: <lambda>() got an unexpected keyword argument 'worker'

@mrocklin
Copy link
Member Author

mrocklin commented Aug 1, 2019

Thanks @lr4d . Handled

Also, what is a good time for the frequency here? Every minute? Every ten minutes? Every hour?

@lr4d
Copy link
Contributor

lr4d commented Aug 1, 2019

I'd keep it below 5-10 minutes. For HAProxy the treshold for killing a connection on which no data is sent is 60 minutes, but I don't know what this may be for similar tools.

@lr4d
Copy link
Contributor

lr4d commented Aug 2, 2019

This is appears to be working fine now for our HAProxy setup, I left the cluster alive for 6 hours and no disconnections or error messages took place.
Thanks @mrocklin

@mrocklin mrocklin merged commit 4dc3d19 into dask:master Aug 2, 2019
@mrocklin mrocklin deleted the scheduler-worker-keep-alive branch August 2, 2019 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keep alive for inactive worker/scheduler connection
2 participants