Description
Issue
Currently, the coordinate node sends Query and Fetch network action to remote data nodes without any timeout options.
public <T extends TransportResponse> void sendChildRequest(final Transport.Connection connection, final String action,
final TransportRequest request, final Task parentTask,
final TransportResponseHandler<T> handler) {
sendChildRequest(connection, action, request, parentTask, TransportRequestOptions.EMPTY, handler);
}
It has a very bad impact, when one of the data nodes' machine is in disk failure, it can't handle I/0 operations like reading or writing data from disk but it is still connected with other nodes. This node acts as a black hole in the cluster, it stuck every shard search request from the coordinate node. Cumulative requests are increasing and consuming a lot of memory in the coordinate node, soon it will cause the coordinate node to fullGC.
We have maintained a Production Environment for about 300 nodes, and Disk Failure is very common. We try to set a timeout in search request body, like https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html#global-search-timeout. But it doesn't take effect in my situation since the timeout is mainly used for Lucene, as discussed in #9156.
So I have added the request body timeout for the query, fetch, and write network action. It seems to have a very great impact on improving cluster resistance on the Overload or Disk Failure of a node. I wonder if the solution is good enough and there is a better solution instead?