Skip to content

Improve resilience to network disconnects #95100

Open

Description

Today if a single transport channel between two nodes in a cluster disconnects due to a network issue then we close all the other channels between the nodes and fail any ongoing transport actions with some kind of ConnectTransportException. If the node which opened the channel is the elected master then the other node is removed from the cluster, failing all its shards, and it must rejoin from scratch. In other cases the nodes will attempt to reconnect without any disruption apart from the transport exceptions.

This behaviour dates back a very long way and enables some useful simplifications:

  • When a node shuts down, we promptly remove it from the cluster because it disconnects from the master.

  • When a task is cancelled, we can send the cancellation message on any of the channels between the nodes. The only1 way a cancellation can be lost en-route is if the chosen channel closes, and this means that the channel for the original task also closes, which automatically cancels the task anyway (see Cancel task and descendants on channel disconnects #56620).

However there are some drawbacks too. Occasional single-channel disconnects are a fact of life in a cloud environment, and today's behaviour is arguably an overreaction:

  • If a node is not shutting down then promptly removing it from the cluster on a disconnect will require it to rejoin the cluster from scratch and reinitialise all its shards, which seems unnecessary.

  • We fail (and cancel) all outstanding tasks between the nodes, not just the ones that relate to the affected channel.

It would be preferable to try and re-establish individual lost channels rather than tearing down the whole connection on any disconnect. This would require several other aspects to change before being able to do this, including:

  • Single-channel disconnects would no longer directly remove nodes from the cluster. We may want to fail the bundle of channels if the broken channel cannot be re-established reasonably quickly, but it may also be reasonable to keep trying to reconnect until some other fault-detection mechanism tears things down.

  • The LeaderChecker would have to retry on a ConnectTransportException rather than treating it as grounds for immediate removal of the node. If we used the same retry strategy as for other exceptions then we'd potentially remove disconnected nodes after just 2-3 seconds, so more lenience may be appropriate.

  • Instead, nodes would have to ensure that they are removed promptly from the cluster on a graceful shutdown by notifying the master before closing any connections.

  • The PublicationTransportHandler would have to retry on a ConnectTransportException, to avoid the temporarily-disconnected node from missing out on a single cluster state update which might trip the lag detector. If these retries continued until the publication timeout then that could be a 30s wait (by default), so a shorter timeout may be appropriate.

  • Task cancellations would have to use the same channel as the request being cancelled to avoid the possibility of loss.

Footnotes

  1. This reasoning doesn't work for cross-cluster connections, and even within a cluster it suffers from a potential race condition too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    :Distributed/Cluster CoordinationCluster formation and cluster state publication, including cluster membership and fault detection.:Distributed/NetworkHttp and internode communication implementations:Distributed/Task ManagementIssues for anything around the Tasks API - both persistent and node level.>enhancementSupportabilityImprove our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better.Team:DistributedMeta label for distributed team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions