Skip to content

Stop unnecessary retries of shard-started tasks #81628

@DaveCTurner

Description

@DaveCTurner

When a data node finishes recovering a shard it notifies the master to move it to state STARTED. Today we repeat this request every time we receive a cluster state that hasn't updated the shard state yet:

if (shardRouting.initializing() && (state == IndexShardState.STARTED || state == IndexShardState.POST_RECOVERY)) {
// the master thinks we are initializing, but we are already started or on POST_RECOVERY and waiting
// for master to confirm a shard started message (either master failover, or a cluster event before
// we managed to tell the master we started), mark us as started
if (logger.isTraceEnabled()) {
logger.trace("{} master marked shard as initializing, but shard has state [{}], resending shard started to {}",
shardRouting.shardId(), state, nodes.getMasterNode());
}
if (nodes.getMasterNode() != null) {
shardStateAction.shardStarted(
shardRouting,
primaryTerm,
"master " + nodes.getMasterNode() + " marked shard as initializing, but shard state is [" + state +
"], mark shard as started",
shard.getTimestampRange(),
SHARD_STATE_ACTION_LISTENER,
clusterState);
}
}

This behaviour means if the master is busy processing (potentially thousands) of other URGENT tasks then we'll submit the same task repeatedly (potentially thousands of times). It dates back a long time but is no longer necessary: we can trust that the master will process our original request first (or we get notified that it failed). We should stop sending these unnecessary retries.

Relates #77466

Metadata

Metadata

Labels

:Distributed Indexing/RecoveryAnything around constructing a new shard, either from a local or a remote source.>bugTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions