Retry ILM steps when transient or recoverable errors are encountered 

This is a meta-issue to track and discuss the ILM steps that should be retryable and under which circumstances. This relates to the efforts on making the rollover action retryable (#44135 ) and the more general strategy ILM will employ in order to make actions more resilient and self-healing ( https://github.com/elastic/elasticsearch/issues/42824 ).

Below are all the steps we use, grouped by actions (as we'll likely not treat steps differently depending in which actions they occur they are listed only once under the first action, ordered alphabetically, they're used in). The marker Terminal/Error steps are not listed.

# Steps
- [x] InitializePolicyContextStep - Initializes the LifecycleExecutionState for an index. This should be the first Step called on an index

## AllocateAction
- [x] AllocationRoutedStep - Checks whether all shards have been correctly routed in response to an update to the allocation rules for an index
- [x] UpdateSettingsStep (also used in ForceMergeAction, ReadOnlyAction, RolloverAction, SetPriorityAction and ShrinkAction)

## DeleteAction
- [x] WaitForNoFollowersStep - A step that waits until the index it's used on is no longer a leader index
- [x] DeleteStep - deletes a single index

## ForceMergeAction
- [x] CloseIndexStep
- [x] OpenIndexStep
- [x] ForceMergeStep - Invokes a force merge on a single index
- [x] SegmentCountStep - evaluates whether force_merge was successful by checking the segment count
- [x] WaitForIndexColorStep

## FreezeAction
- [x] FreezeStep - freezes an index

## RolloverAction
- [x] CheckNotDataStreamWriteIndexStep - This step checks if the managed index is part of a data stream, in which case it will check it's not the write index
- [x] WaitForRolloverReadyStep - Waits for at least one rollover condition to be satisfied, using the Rollover API's dry_run option
- [x] RolloverStep - Unconditionally rolls over an index using the Rollover API
- [x] WaitForActiveShardsStep - Waits for the shards of the newly created index to become active
- [x] UpdateRolloverLifecycleDateStep - Copies the lifecycle reference date to a new index created by rolling over an alias.

## ShrinkAction
- [x] BranchingStep - This step changes its `getNextStepKey()` depending on the outcome of a defined predicate. It performs no changes to the cluster state
- [x] WaitForNoFollowersStep - A step that waits until the index it's used on is no longer a leader index
- [x] SetSingleNodeAllocateStep - Allocates all shards in a single index to one node
- [x] CheckShrinkReadyStep - used prior to running a shrink step in order to ensure that the index being shrunk has a copy of each shard allocated on one particular node (the node used by the require parameter) and that the shards are not relocating
- [x] ShrinkStep - Shrinks an index, using a prefix prepended to the original index name for the name of the shrunken index
- [x] ShrunkShardsAllocatedStep - Checks whether all shards in a shrunken index have been successfully allocated
- [x] CopyExecutionStateStep - Copies the execution state data from one index to another
- [x] ShrinkSetAliasStep - Following shrinking an index and deleting the original index, this step creates an alias with the same name as the original index which points to the new shrunken index
- [x] ShrunkenIndexCheckStep - Verifies that an index was created through a shrink operation, rather than created some other way

## UnfollowAction
- [x] WaitForIndexingCompleteStep
- [x] WaitForFollowShardTasksStep
- [x] PauseFollowerIndexStep
- [x] CloseFollowerIndexStep
- [x] UnfollowFollowerIndexStep
- [x] OpenFollowerIndexStep


## Scope

Any action/step that can be made to be re-tried after a failure. 

## Duration

~ 2 months

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retry ILM steps when transient or recoverable errors are encountered #48183

Steps

AllocateAction

DeleteAction

ForceMergeAction

FreezeAction

RolloverAction

ShrinkAction

UnfollowAction

Scope

Duration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Retry ILM steps when transient or recoverable errors are encountered #48183

Description

Steps

AllocateAction

DeleteAction

ForceMergeAction

FreezeAction

RolloverAction

ShrinkAction

UnfollowAction

Scope

Duration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions