Description
If a node leaves the cluster with reason: disconnected
then this means something closed one of the TCP channels initiated by the master targeting the departing node. Nodes never deliberately close any incoming connections, and the master doesn't spontaneously close any outgoing connections, so we usually consider reason: disconnected
to be evidence of a network issue.
However, I recently observed a node-left
event with reason: disconnected
on a cluster running entirely on localhost
, which therefore rules out a network issue. This was the second of two node-left
events for the same node within a few seconds; the first had reason followers check retry count exceeded
. I have so far failed to reproduce this situation.
My best guess regarding the cause is that we hadn't closed everything by the time the node rejoins, so the node-join
task executes and only then does the channel get closed and the master notified that this node disconnected. (io.netty.channel.AbstractChannelHandlerContext#close()
is fire-and-forget, it just adds a task to be processed by the event loop, but that particular event loop was blocked so the closing was delayed).
I don't think the behaviour is particularly bad, the node already left the cluster, but the bug is in how we report this as a network event when really the network is fine.