Skip to content

Waiting for all shards to be active after a cluster restart may never be possible for a shrink step #35321

Closed
@dakrone

Description

@dakrone

Consider the following scenario:

An index with at least 1 replica is just about to start its Shrink step, so it does the following:

  1. sets the index to read-only
  2. sets the index to be allocated only on node_id:123XYZ
  3. waits for a copy of each shard on node_id:123XYZ
  4. performs the shrink step
  5. etc

If, after accomplishing step 2, but before step 3 is done, the user restarts the cluster, when the cluster comes back up, due to the allocation rule, the replicas for the index will not be allowed to be allocated because of the _id filtering performed in step 2. This leads the check in step 3 never to pass due to the check at:

if (ActiveShardCount.ALL.enoughShardsActive(clusterState, index.getName()) == false) {
logger.debug("[{}] shrink action for [{}] cannot make progress because not all shards are active",
getKey().getAction(), index.getName());
return new Result(false, new CheckShrinkReadyStep.Info("", expectedShardCount, -1));
}

And a perpetual error step op:

    "test-000039" : {
      "step" : "check-shrink-allocation",
      "step_time" : "2018-11-06T22:54:39.805Z",
      "step_time_millis" : 1541544879805,
      "step_info" : {
        "message" : "Waiting for all shards to become active",
        "node_id" : "",
        "shards_left_to_allocate" : -1,
        "expected_shards" : 2
      }
    },

Since shrink does not require all copies of the shard to be active, we should remove this check

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions