Pending task batching can be a bottleneck

**Elasticsearch version** (`bin/elasticsearch --version`): 7.15.1 (likely others, this mechanism has existed for quite a while)

**Plugins installed**: []

**JVM version** (`java -version`): Bundled

**OS version** (`uname -a` if on a Unix-like system): ESS

**Description of the problem including expected versus actual behavior**:

I saw a cluster with symptoms of stuck transport threads, in which most of the transport threads had the following stack trace:

```
       app//org.elasticsearch.cluster.service.TaskBatcher.submitTasks(TaskBatcher.java:59)
       app//org.elasticsearch.cluster.service.MasterService.submitStateUpdateTasks(MasterService.java:802)
       app//org.elasticsearch.cluster.service.ClusterService.submitStateUpdateTasks(ClusterService.java:271)
       app//org.elasticsearch.cluster.service.ClusterService.submitStateUpdateTask(ClusterService.java:252)
       app//org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedTransportHandler.messageReceived(ShardStateAction.java:271)
       app//org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedTransportHandler.messageReceived(ShardStateAction.java:255)
```

This cluster had been unhealthy for other reasons for an extended period of time and the nodes have apparently accumulated a rather large collection of unprocessed `shard-failed` tasks. When the new master is elected, these tasks are all retried individually which forms up into a thundering herd that block on the `tasksPerBatchingKey` mutex and keep the transport workers from doing more meaningful things. Consequently the nodes start to fail their follower checks and drop out of the cluster, ultimately causing the master to stand down. Another master is elected and the cycle begins again.

I see room for improvement in a couple of areas. Firstly, I think we should be batching the shard-failed requests - at least the retries - on the data nodes. Secondly, I think we should streamline the batching of tasks on the master. Rather than submitting every single task to a `PrioritizedEsThreadPoolExecutor` and having to maintain a separate map of the batches I think we should have a queue of the `ClusterStateTaskExecutor` things each of which maintains its own queue to which we can add tasks without blocking on a mutex.

Relates #77466

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pending task batching can be a bottleneck #81626

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pending task batching can be a bottleneck #81626

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions