Description
Elasticsearch version (bin/elasticsearch --version
): 7.15.1 (likely others, this mechanism has existed for quite a while)
Plugins installed: []
JVM version (java -version
): Bundled
OS version (uname -a
if on a Unix-like system): ESS
Description of the problem including expected versus actual behavior:
I saw a cluster with symptoms of stuck transport threads, in which most of the transport threads had the following stack trace:
app//org.elasticsearch.cluster.service.TaskBatcher.submitTasks(TaskBatcher.java:59)
app//org.elasticsearch.cluster.service.MasterService.submitStateUpdateTasks(MasterService.java:802)
app//org.elasticsearch.cluster.service.ClusterService.submitStateUpdateTasks(ClusterService.java:271)
app//org.elasticsearch.cluster.service.ClusterService.submitStateUpdateTask(ClusterService.java:252)
app//org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedTransportHandler.messageReceived(ShardStateAction.java:271)
app//org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedTransportHandler.messageReceived(ShardStateAction.java:255)
This cluster had been unhealthy for other reasons for an extended period of time and the nodes have apparently accumulated a rather large collection of unprocessed shard-failed
tasks. When the new master is elected, these tasks are all retried individually which forms up into a thundering herd that block on the tasksPerBatchingKey
mutex and keep the transport workers from doing more meaningful things. Consequently the nodes start to fail their follower checks and drop out of the cluster, ultimately causing the master to stand down. Another master is elected and the cycle begins again.
I see room for improvement in a couple of areas. Firstly, I think we should be batching the shard-failed requests - at least the retries - on the data nodes. Secondly, I think we should streamline the batching of tasks on the master. Rather than submitting every single task to a PrioritizedEsThreadPoolExecutor
and having to maintain a separate map of the batches I think we should have a queue of the ClusterStateTaskExecutor
things each of which maintains its own queue to which we can add tasks without blocking on a mutex.
Relates #77466