Skip to content

Pending task batching can be a bottleneck #81626

Closed
@DaveCTurner

Description

@DaveCTurner

Elasticsearch version (bin/elasticsearch --version): 7.15.1 (likely others, this mechanism has existed for quite a while)

Plugins installed: []

JVM version (java -version): Bundled

OS version (uname -a if on a Unix-like system): ESS

Description of the problem including expected versus actual behavior:

I saw a cluster with symptoms of stuck transport threads, in which most of the transport threads had the following stack trace:

       app//org.elasticsearch.cluster.service.TaskBatcher.submitTasks(TaskBatcher.java:59)
       app//org.elasticsearch.cluster.service.MasterService.submitStateUpdateTasks(MasterService.java:802)
       app//org.elasticsearch.cluster.service.ClusterService.submitStateUpdateTasks(ClusterService.java:271)
       app//org.elasticsearch.cluster.service.ClusterService.submitStateUpdateTask(ClusterService.java:252)
       app//org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedTransportHandler.messageReceived(ShardStateAction.java:271)
       app//org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedTransportHandler.messageReceived(ShardStateAction.java:255)

This cluster had been unhealthy for other reasons for an extended period of time and the nodes have apparently accumulated a rather large collection of unprocessed shard-failed tasks. When the new master is elected, these tasks are all retried individually which forms up into a thundering herd that block on the tasksPerBatchingKey mutex and keep the transport workers from doing more meaningful things. Consequently the nodes start to fail their follower checks and drop out of the cluster, ultimately causing the master to stand down. Another master is elected and the cycle begins again.

I see room for improvement in a couple of areas. Firstly, I think we should be batching the shard-failed requests - at least the retries - on the data nodes. Secondly, I think we should streamline the batching of tasks on the master. Rather than submitting every single task to a PrioritizedEsThreadPoolExecutor and having to maintain a separate map of the batches I think we should have a queue of the ClusterStateTaskExecutor things each of which maintains its own queue to which we can add tasks without blocking on a mutex.

Relates #77466

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed Coordination/Cluster CoordinationCluster formation and cluster state publication, including cluster membership and fault detection.>bugTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions