Description
Today we block a thread waiting for connections to open. Threads are a precious resource, and opening a connection can be time-consuming if the remote node is unresponsive. Although #39629 mostly alleviates the effects seen in #28920, it is still possible that a poorly-timed attempt by the NodeConnectionsService
to reconnect to all the known nodes in the cluster state could saturate the small-yet-important management threadpool in a network partition.
In #29023 we suggested creating a dedicated threadpool for connections, but then the work in #35144 brought us closer to being able to open these connections asynchronously and the idea of introducing a dedicated threadpool was dropped. However it's not yet possible to open a connection fully asynchronously, so there is still a risk of saturating a threadpool during a network partition.
To avoid losing track of this, here is a meta-issue which tracks the remaining places that need to work asynchronously:
-
ConnectionManager#internalOpenConnection
,ConnectionManager#openConnection
andConnectionManager#connectToNode
(Move ConnectionManager to async APIs #42636) -
TransportService#connectToNode
(Move ConnectionManager to async APIs #42636) -
HandshakingTransportAddressConnector#connectToRemoteMasterNode
(Move ConnectionManager to async APIs #42636) -
NodeConnectionsService#ConnectionTarget
(Make NodeConnectionsService non-blocking #44211) -
Coordinator#handleJoinRequest
(Move ConnectionManager to async APIs #42636) -
RemoteClusterConnection#ConnectHandler
(Asynchronously connect to remote clusters #44825)
In each case there are quite a few tests that will need adjusting, so I think it makes sense to break the work up like this.
Connections are also opened by the transport client, but it seems less important to make these connections asynchronously.