Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep alive for inactive worker/scheduler connection #2524

Closed
fjetter opened this issue Feb 12, 2019 · 7 comments · Fixed by #2907
Closed

Keep alive for inactive worker/scheduler connection #2524

fjetter opened this issue Feb 12, 2019 · 7 comments · Fixed by #2907

Comments

@fjetter
Copy link
Member

fjetter commented Feb 12, 2019

I noticed that the connection between workers and schedulers (and effectively all Server subclasses using handle_stream) are kept open to continuously listen to incoming requests. This connection is solely used to listen to stream_handler requests while, for instance the heartbeat, uses a different connection from the connection pool. This means that for a longer time of cluster inactivity, the primary connection is inactive. The issue about inactive connections is that some external tools are inclined to kill inactive connections (for instance, we're using haproxy and we kill inactive connections after 1h) which requires for the worker to reconnect and register with the scheduler every hour during inactivity.

I was wondering if this is an issue for somebody else and whether it would be desired to either send a keep alive using the open connection every X secs (simple message to keep the connection active) or even send the hearbeat over this connection. I believe the first option could be implemented fairly easily in handle_stream while the latter is a bit more difficult. In either case I wanted to get some feedback before implementing anything.

(dedicated worker-scheduler connection creation see here, reconnect if connection is broken see here)

@mrocklin
Copy link
Member

Thank you for the excellently worded issue, and my apologies for the delay in response.

I'm hearing two possible solutions:

  1. Move the heartbeat to the long-running connection.

    I suspect that the current reason for the heartbeat to be on a separate connection is to trigger a reconnect if something goes wrong with the long-running connection. However, as you point out in Race condition between worker heartbeat and reconnect #2525 we're already handling this, so perhaps this reason is not sufficient (or in fact, problematic).

    If so then moving the heartbeat to the long-running connection seems like a good idea

  2. Add a second keep-alive route to the server (probably in the handlers in core.py::Server) and add a periodic callback that sends a message every minute or so. This is easy to do and has low impact, but slightly increases complexity.

Either is fine. If you were interested and have the time to investigate option 1 I think that that would be best, but it's also more work.

@lr4d
Copy link
Contributor

lr4d commented Jul 16, 2019

I've been taking a look at the codebase and I found this function

def set_tcp_timeout(stream):
, which appears to do what @fjetter suggests, that is it uses TCP keep-alives, sending a packet every x seconds if the connection is idle.

Hence, it seems like TCP keep-alives should already be enabled on scheduler-worker connections.

Is there anything I may be missing here?

@lr4d
Copy link
Contributor

lr4d commented Jul 16, 2019

I have confirmed TCP keep-alives are being sent on the long running connection by looking at the TCP traffic when the heartbeat is disabled.

scheduler_1_4aa062432c65 | distributed.comm.tcp - DEBUG - Setting TCP keepalive: nprobes=10, idle=10, interval=2
worker_1_ac8a9b1b2af2 | distributed.comm.tcp - DEBUG - Setting TCP keepalive: nprobes=10, idle=10, interval=2

The scheduler is at tcp://172.22.0.2:8786:

root@f07be9f53108:/# tcpdump -pn "host 172.22.0.2 and tcp"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:57:47.361159 IP 172.22.0.4.46782 > 172.22.0.2.8786: Flags [.], ack 1838822340, win 229, options [nop,nop,TS val 48374592 ecr 48373568], length 0
15:57:47.361326 IP 172.22.0.2.8786 > 172.22.0.4.46782: Flags [.], ack 1, win 235, options [nop,nop,TS val 48374592 ecr 48373568], length 0
15:57:47.361359 IP 172.22.0.2.8786 > 172.22.0.4.46782: Flags [.], ack 1, win 235, options [nop,nop,TS val 48374592 ecr 48373568], length 0
15:57:47.361385 IP 172.22.0.4.46782 > 172.22.0.2.8786: Flags [.], ack 1, win 229, options [nop,nop,TS val 48374592 ecr 48373568], length 0
15:57:57.600890 IP 172.22.0.4.46782 > 172.22.0.2.8786: Flags [.], ack 1, win 229, options [nop,nop,TS val 48375616 ecr 48374592], length 0
15:57:57.600986 IP 172.22.0.2.8786 > 172.22.0.4.46782: Flags [.], ack 1, win 235, options [nop,nop,TS val 48375616 ecr 48374592], length 0
15:57:57.601001 IP 172.22.0.2.8786 > 172.22.0.4.46782: Flags [.], ack 1, win 235, options [nop,nop,TS val 48375616 ecr 48374592], length 0
15:57:57.601025 IP 172.22.0.4.46782 > 172.22.0.2.8786: Flags [.], ack 1, win 229, options [nop,nop,TS val 48375616 ecr 48375616], length 0
15:58:07.808206 IP 172.22.0.2.8786 > 172.22.0.4.46782: Flags [.], ack 1, win 235, options [nop,nop,TS val 48376640 ecr 48375616], length 0
15:58:07.808296 IP 172.22.0.4.46782 > 172.22.0.2.8786: Flags [.], ack 1, win 229, options [nop,nop,TS val 48376640 ecr 48375616], length 0
15:58:07.816519 IP 172.22.0.4.46782 > 172.22.0.2.8786: Flags [.], ack 1, win 229, options [nop,nop,TS val 48376641 ecr 48375616], length 0
15:58:07.816616 IP 172.22.0.2.8786 > 172.22.0.4.46782: Flags [.], ack 1, win 235, options [nop,nop,TS val 48376641 ecr 48376640], length 0
15:58:18.047129 IP 172.22.0.2.8786 > 172.22.0.4.46782: Flags [.], ack 1, win 235, options [nop,nop,TS val 48377664 ecr 48376640], length 0
15:58:18.047129 IP 172.22.0.4.46782 > 172.22.0.2.8786: Flags [.], ack 1, win 229, options [nop,nop,TS val 48377664 ecr 48376641], length 0
15:58:18.047184 IP 172.22.0.4.46782 > 172.22.0.2.8786: Flags [.], ack 1, win 229, options [nop,nop,TS val 48377664 ecr 48376641], length 0
15:58:18.047232 IP 172.22.0.2.8786 > 172.22.0.4.46782: Flags [.], ack 1, win 235, options [nop,nop,TS val 48377664 ecr 48377664], length 0
15:58:28.286598 IP 172.22.0.4.46782 > 172.22.0.2.8786: Flags [.], ack 1, win 229, options [nop,nop,TS val 48378688 ecr 48377664], length 0
15:58:28.286623 IP 172.22.0.2.8786 > 172.22.0.4.46782: Flags [.], ack 1, win 235, options [nop,nop,TS val 48378688 ecr 48377664], length 0
15:58:28.286652 IP 172.22.0.4.46782 > 172.22.0.2.8786: Flags [.], ack 1, win 229, options [nop,nop,TS val 48378688 ecr 48377664], length 0
15:58:28.286745 IP 172.22.0.2.8786 > 172.22.0.4.46782: Flags [.], ack 1, win 235, options [nop,nop,TS val 48378688 ecr 48378688], length 0

@StephanErb
Copy link
Contributor

Sending TCP keep-alives will ensure that firewalls and networking equipment does not close the Dask connection. HAProxy however will close the connection if there is no traffic on the application layer, so even with TCP keep-alives it will close the connection (https://stackoverflow.com/questions/32634980/haproxy-closes-long-living-tcp-connections-ignoring-tcp-keepalive).

The proposal of Matthew to move the heartbeat to the long-running connection sounds sane and should solve this problem.

@lr4d
Copy link
Contributor

lr4d commented Jul 17, 2019

thanks @StephanErb

@mrocklin
Copy link
Member

We could solve this by moving the heartbeat as suggested, but it seems like this might have other effects?

As an alternative, maybe just a periodic callback that sends a trivial message across every minute or so? We could make a new route that did nothing and send an operation that just hits that route.

mrocklin added a commit to mrocklin/distributed that referenced this issue Jul 29, 2019
This is effectively a heartbeat, but much simpler and less frequent than
our current heartbeats

Fixes dask#2524
@mrocklin
Copy link
Member

Would this work? #2907

mrocklin added a commit that referenced this issue Aug 2, 2019
This is effectively a heartbeat, but much simpler and less frequent than
our current heartbeats

Fixes #2524
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants