Skip to content

node-left ... reason: disconnected triggered by earlier node-left event rather than network issue #67873

Closed
@DaveCTurner

Description

@DaveCTurner

If a node leaves the cluster with reason: disconnected then this means something closed one of the TCP channels initiated by the master targeting the departing node. Nodes never deliberately close any incoming connections, and the master doesn't spontaneously close any outgoing connections, so we usually consider reason: disconnected to be evidence of a network issue.

However, I recently observed a node-left event with reason: disconnected on a cluster running entirely on localhost, which therefore rules out a network issue. This was the second of two node-left events for the same node within a few seconds; the first had reason followers check retry count exceeded. I have so far failed to reproduce this situation.

My best guess regarding the cause is that we hadn't closed everything by the time the node rejoins, so the node-join task executes and only then does the channel get closed and the master notified that this node disconnected. (io.netty.channel.AbstractChannelHandlerContext#close() is fire-and-forget, it just adds a task to be processed by the event loop, but that particular event loop was blocked so the closing was delayed).

I don't think the behaviour is particularly bad, the node already left the cluster, but the bug is in how we report this as a network event when really the network is fine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed Coordination/Cluster CoordinationCluster formation and cluster state publication, including cluster membership and fault detection.>bugTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions