-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Description
Today, we attempt to connect to nodes concurrently using the management threadpool:
elasticsearch/server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java
Lines 94 to 115 in 46e16b6
threadPool.executor(ThreadPool.Names.MANAGEMENT).execute(new AbstractRunnable() { | |
@Override | |
public void onFailure(Exception e) { | |
// both errors and rejections are logged here. the service | |
// will try again after `cluster.nodes.reconnect_interval` on all nodes but the current master. | |
// On the master, node fault detection will remove these nodes from the cluster as their are not | |
// connected. Note that it is very rare that we end up here on the master. | |
logger.warn((Supplier<?>) () -> new ParameterizedMessage("failed to connect to {}", node), e); | |
} | |
@Override | |
protected void doRun() throws Exception { | |
try (Releasable ignored = nodeLocks.acquire(node)) { | |
validateAndConnectIfNeeded(node); | |
} | |
} | |
@Override | |
public void onAfter() { | |
latch.countDown(); | |
} | |
}); |
Connection establishment can be time-consuming if the remote node is unresponsive, and the management threadpool is small and important, so saturating it with attempts to connect to unresponsive nodes is undesirable.
The suggested fix is to create a separate threadpool purely for establishing node-to-node connections instead. As such connections are mostly long-lived the new-connection threadpool will mostly be idle, but after a network partition it would be good for each node to try and re-establish connections to its peers using a lot more concurrency than the management threadpool can support.
Relates #28920 in which cluster state application is blocked for multiple minutes because, in part, of insufficient concurrency when attempting to connect to unresponsive peers.