-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep alive for inactive worker/scheduler connection #2524
Comments
Thank you for the excellently worded issue, and my apologies for the delay in response. I'm hearing two possible solutions:
Either is fine. If you were interested and have the time to investigate option 1 I think that that would be best, but it's also more work. |
I've been taking a look at the codebase and I found this function distributed/distributed/comm/tcp.py Line 47 in 0a5b8da
Hence, it seems like TCP keep-alives should already be enabled on scheduler-worker connections. Is there anything I may be missing here? |
I have confirmed TCP keep-alives are being sent on the long running connection by looking at the TCP traffic when the heartbeat is disabled.
The scheduler is at
|
Sending TCP keep-alives will ensure that firewalls and networking equipment does not close the Dask connection. HAProxy however will close the connection if there is no traffic on the application layer, so even with TCP keep-alives it will close the connection (https://stackoverflow.com/questions/32634980/haproxy-closes-long-living-tcp-connections-ignoring-tcp-keepalive). The proposal of Matthew to move the heartbeat to the long-running connection sounds sane and should solve this problem. |
thanks @StephanErb |
We could solve this by moving the heartbeat as suggested, but it seems like this might have other effects? As an alternative, maybe just a periodic callback that sends a trivial message across every minute or so? We could make a new route that did nothing and send an operation that just hits that route. |
This is effectively a heartbeat, but much simpler and less frequent than our current heartbeats Fixes dask#2524
Would this work? #2907 |
This is effectively a heartbeat, but much simpler and less frequent than our current heartbeats Fixes #2524
I noticed that the connection between workers and schedulers (and effectively all
Server
subclasses usinghandle_stream
) are kept open to continuously listen to incoming requests. This connection is solely used to listen tostream_handler
requests while, for instance theheartbeat
, uses a different connection from the connection pool. This means that for a longer time of cluster inactivity, the primary connection is inactive. The issue about inactive connections is that some external tools are inclined to kill inactive connections (for instance, we're using haproxy and we kill inactive connections after 1h) which requires for the worker to reconnect and register with the scheduler every hour during inactivity.I was wondering if this is an issue for somebody else and whether it would be desired to either send a keep alive using the open connection every X secs (simple message to keep the connection active) or even send the hearbeat over this connection. I believe the first option could be implemented fairly easily in
handle_stream
while the latter is a bit more difficult. In either case I wanted to get some feedback before implementing anything.(dedicated worker-scheduler connection creation see here, reconnect if connection is broken see here)
The text was updated successfully, but these errors were encountered: