`node-left ... reason: disconnected` triggered by earlier `node-left` event rather than network issue

If a node leaves the cluster with `reason: disconnected` then this means something closed one of the TCP channels initiated by the master targeting the departing node. Nodes never deliberately close any incoming connections, and the master doesn't spontaneously close any outgoing connections, so we usually consider `reason: disconnected` to be evidence of a network issue.

However, I recently observed a `node-left` event with `reason: disconnected` on a cluster running entirely on `localhost`, which therefore rules out a network issue. This was the second of two `node-left` events for the same node within a few seconds; the first had reason `followers check retry count exceeded`. I have so far failed to reproduce this situation.

My best guess regarding the cause is that we hadn't closed everything by the time the node rejoins, so the `node-join` task executes and only then does the channel get closed and the master notified that this node disconnected.  (`io.netty.channel.AbstractChannelHandlerContext#close()` is fire-and-forget, it just adds a task to be processed by the event loop, but that particular event loop was blocked so the closing was delayed).

I don't think the behaviour is particularly bad, the node already left the cluster, but the bug is in how we report this as a network event when really the network is fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`node-left ... reason: disconnected` triggered by earlier `node-left` event rather than network issue #67873

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

node-left ... reason: disconnected triggered by earlier node-left event rather than network issue #67873

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`node-left ... reason: disconnected` triggered by earlier `node-left` event rather than network issue #67873