Skip to content

Introduce dedicated threadpool for establishing connections #29023

Closed
@DaveCTurner

Description

@DaveCTurner

Today, we attempt to connect to nodes concurrently using the management threadpool:

threadPool.executor(ThreadPool.Names.MANAGEMENT).execute(new AbstractRunnable() {
@Override
public void onFailure(Exception e) {
// both errors and rejections are logged here. the service
// will try again after `cluster.nodes.reconnect_interval` on all nodes but the current master.
// On the master, node fault detection will remove these nodes from the cluster as their are not
// connected. Note that it is very rare that we end up here on the master.
logger.warn((Supplier<?>) () -> new ParameterizedMessage("failed to connect to {}", node), e);
}
@Override
protected void doRun() throws Exception {
try (Releasable ignored = nodeLocks.acquire(node)) {
validateAndConnectIfNeeded(node);
}
}
@Override
public void onAfter() {
latch.countDown();
}
});

Connection establishment can be time-consuming if the remote node is unresponsive, and the management threadpool is small and important, so saturating it with attempts to connect to unresponsive nodes is undesirable.

The suggested fix is to create a separate threadpool purely for establishing node-to-node connections instead. As such connections are mostly long-lived the new-connection threadpool will mostly be idle, but after a network partition it would be good for each node to try and re-establish connections to its peers using a lot more concurrency than the management threadpool can support.

Relates #28920 in which cluster state application is blocked for multiple minutes because, in part, of insufficient concurrency when attempting to connect to unresponsive peers.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions