-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conditions under which a TCP connection may fail / close? #5678
Comments
Or a network failure somewhere along the line (which do happen often enough we'll want to be robust to them). This is why things like GRPC implement reconnect within the channel object (note that that reconnect isn't transparent, a network failure can still show up to the application layer during active RPCs). Note that the inverse (a dead remote will always show up as a closed TCP connection) is definitely false - in some setups an idle connection can close on one side without the other side detecting it immediately - this is one reason why some active application-level heartbeating is necessary.
TCP connection failures like this are detected at the OS level (you can even pause the process and keep the connection open). Python socket's are a pretty thin wrapper around the system socket, so the only real source of accidental connection closures would be at the distributed
TCP keepalives are at the OS layer, not the application, so from the TCP level the connection will appear fine. However, it is possible that a sufficiently overloaded python process may hit application level timeouts in the comms for connect/respond/whatever in which case it may appear "dead" (depending on how the application level protocol is handled). |
Yes, that's what I had in mind when asking for 3.) In a discussion we had yesterday, there was some uncertainty around whether something like this can actually happen.
It does sound like we should look into something similar. Do you have an opinion there? I would imagine if our Comm objects (or RPC / whatever) will deal with this, we would need a lot less application code
Isn't this what the TCP timeout should catch w/out another application level timeout? The TCP timeout will send (empty) packages in case the connection is idling for a certain amount of time. If, after N attempts, no package was acknowledged, the connection is announced dead.
right, I think we're handling these application level timeouts properly. At least I am not aware of any
It sounds like we need something like this as well? |
Historically we used to try to wait to see if a worker would reconnect. What I found was that this introduced enough complexity that it made more sense to just assume that CommClosedError implied a dead worker. If the worker shows up a second later then great! we'll treat it as a new worker (or possibly a new worker that has some existing data). This wasn't entirely true, but was safe, and resulted in a class of consistency errors going away. Are there advantages to trying to keep things alive? The only advantage I see is that things might run a little faster in that case. If this is the only advantage then I'd suggest that we just let broken comms imply dead workers, and deal with the slowdown. |
In a recent discussion around reconnecting clients, the question was raised about how reliable a TCP connection is and how reliable an unexpectedly closed connection can be interpreted as a dead remote.
Particularly, the question is raised whether we have or even should implement any stateful reconnect logic at all or the network layer is reliable enough for us to not do that.
This question assumes that TCP keepalive (distributed config, linux docs) is configured and the TCP User Timout is sufficiently large such that increased latencies, etc. can be effectively ignored.
I am aware of situations where firewall rules close idle connections regardless of the TCP keepalive if there are only empty packages submitted (see #2524 / #2907 / application level keep-alive).
Therefore, I would like to answer the following questions
Answers to these questions will have a major impact on ongoing and future tickets. Below a few references
BatchedSend
#5481If all of the above is answered by "TCP or is reliable enough, we should not worry" we might have to investigate whether it is something we do to make our connections unreliable but before diving into this I would like to get an answer about the our network infrastructure.
E.g.
Server.handle_stream
close the comm? #5483cc @crusaderky , @gjoseph92 , @graingert, @jcrist
The text was updated successfully, but these errors were encountered: