Add timeout for Search Network Action to Improve Cluster Resistance

## Issue
Currently, the coordinate node sends [Query](https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/action/search/SearchTransportService.java#L138) and [Fetch](https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/action/search/SearchTransportService.java#L163) network action to remote data nodes without any timeout options. 

```
    public <T extends TransportResponse> void sendChildRequest(final Transport.Connection connection, final String action,
                                                               final TransportRequest request, final Task parentTask,
                                                               final TransportResponseHandler<T> handler) {
        sendChildRequest(connection, action, request, parentTask, TransportRequestOptions.EMPTY, handler);
    }
```
It has a very bad impact, when one of the data nodes' machine is in disk failure,  it can't handle I/0 operations like reading or writing data from disk but it is still connected with other nodes. This node acts as a black hole in the cluster, it stuck every shard search request from the coordinate node. Cumulative requests are increasing and consuming a lot of memory in the coordinate node, soon it will cause the coordinate node to fullGC.

We have maintained a Production Environment for about 300 nodes, and Disk Failure is very common. We try to set a timeout in search request body, like https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html#global-search-timeout. But it doesn't take effect in my situation since the timeout is mainly used for Lucene, as discussed in https://github.com/elastic/elasticsearch/issues/9156.  

So I have added the request body timeout for the query, fetch, and write network action. It seems to have a very great impact on improving cluster resistance on the Overload or Disk Failure of a node. I wonder if the solution is good enough and there is a better solution instead?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add timeout for Search Network Action to Improve Cluster Resistance #60037

Issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add timeout for Search Network Action to Improve Cluster Resistance #60037

Description

Issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions