Skip to content

Slow recovery of write availability after partition of a large cluster #28920

Closed
@djjsindy

Description

@djjsindy

We have a very large cluster which have 128 nodes. This cluster have a lot of index. There are about 20,000 shards, 10000 shards is primary,the other is replica. Primary and replica locate in different racks. Write operation will always exist. In the network partition scenario the write operation will be blocked because it has to wait for replica shard failed cluster state commit. Write operation recovery time will be longer than about 10 minutes.

My opinion: Write slow recovery affected by the following three factors:

  1. Each node disconnect detection is independent. In the network partition scenario, 64 nodes disconnect. Because cluster state batch processing mechanism led to the first cluster state only the first node disconnect. This cluster state's prepare and commit must be time-out,Because this cluster state sent the node contains the remaining 63 nodes.
  2. Too many shard failed lead to task summary toString time is very long,20000 shard failed Calculating the task summary takes about 15 seconds.
  3. Same shard, the same primary term, the same allocationId shard failed request processing did not remove the duplicate request , ShardEntry does not override equals and hashCode methods.

In my scenario, I tried to do optimization based on the above mentioned. Write recovery time reduced from 10 minutes to less than 1 minute, It seems to be working.

Please take a look at these three factors can be improved ?

Elasticsearch version (bin/elasticsearch --version):
5.3.1
Plugins installed: []

JVM version (java -version):
1.8.0_112
OS version (uname -a if on a Unix-like system):
2.6.32-220.23.2.xxxxx.el6.x86_64
Description of the problem including expected versus actual behavior:
Write operation will always exist. In the network partition scenario the write operation will be blocked because it has to wait for shard failed cluster state commit. Write operation recovery time will be longer than about 10 minutes.
Expected behavior: Recovery write time is shorter
Steps to reproduce:

my config:
cluster.routing.allocation.awareness.attributes: rack_id
node.attr.rack_id: xxxxx
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts:
     - xxxxx:9300
     - xxxxx:9300
     - xxxxx:9300
cluster.routing.allocation.awareness.force.rack_id: xxxx

network partition opertion: 
sudo iptables -D INPUT 1 ;sudo iptables -D OUTPUT 1 ;sudo iptables -L -n

Provide logs (if relevant):

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions