Skip to content

Retry ILM steps when transient or recoverable errors are encountered  #48183

Closed
@andreidan

Description

@andreidan

This is a meta-issue to track and discuss the ILM steps that should be retryable and under which circumstances. This relates to the efforts on making the rollover action retryable (#44135 ) and the more general strategy ILM will employ in order to make actions more resilient and self-healing ( #42824 ).

Below are all the steps we use, grouped by actions (as we'll likely not treat steps differently depending in which actions they occur they are listed only once under the first action, ordered alphabetically, they're used in). The marker Terminal/Error steps are not listed.

Steps

  • InitializePolicyContextStep - Initializes the LifecycleExecutionState for an index. This should be the first Step called on an index

AllocateAction

  • AllocationRoutedStep - Checks whether all shards have been correctly routed in response to an update to the allocation rules for an index
  • UpdateSettingsStep (also used in ForceMergeAction, ReadOnlyAction, RolloverAction, SetPriorityAction and ShrinkAction)

DeleteAction

  • WaitForNoFollowersStep - A step that waits until the index it's used on is no longer a leader index
  • DeleteStep - deletes a single index

ForceMergeAction

  • CloseIndexStep
  • OpenIndexStep
  • ForceMergeStep - Invokes a force merge on a single index
  • SegmentCountStep - evaluates whether force_merge was successful by checking the segment count
  • WaitForIndexColorStep

FreezeAction

  • FreezeStep - freezes an index

RolloverAction

  • CheckNotDataStreamWriteIndexStep - This step checks if the managed index is part of a data stream, in which case it will check it's not the write index
  • WaitForRolloverReadyStep - Waits for at least one rollover condition to be satisfied, using the Rollover API's dry_run option
  • RolloverStep - Unconditionally rolls over an index using the Rollover API
  • WaitForActiveShardsStep - Waits for the shards of the newly created index to become active
  • UpdateRolloverLifecycleDateStep - Copies the lifecycle reference date to a new index created by rolling over an alias.

ShrinkAction

  • BranchingStep - This step changes its getNextStepKey() depending on the outcome of a defined predicate. It performs no changes to the cluster state
  • WaitForNoFollowersStep - A step that waits until the index it's used on is no longer a leader index
  • SetSingleNodeAllocateStep - Allocates all shards in a single index to one node
  • CheckShrinkReadyStep - used prior to running a shrink step in order to ensure that the index being shrunk has a copy of each shard allocated on one particular node (the node used by the require parameter) and that the shards are not relocating
  • ShrinkStep - Shrinks an index, using a prefix prepended to the original index name for the name of the shrunken index
  • ShrunkShardsAllocatedStep - Checks whether all shards in a shrunken index have been successfully allocated
  • CopyExecutionStateStep - Copies the execution state data from one index to another
  • ShrinkSetAliasStep - Following shrinking an index and deleting the original index, this step creates an alias with the same name as the original index which points to the new shrunken index
  • ShrunkenIndexCheckStep - Verifies that an index was created through a shrink operation, rather than created some other way

UnfollowAction

  • WaitForIndexingCompleteStep
  • WaitForFollowShardTasksStep
  • PauseFollowerIndexStep
  • CloseFollowerIndexStep
  • UnfollowFollowerIndexStep
  • OpenFollowerIndexStep

Scope

Any action/step that can be made to be re-tried after a failure.

Duration

~ 2 months

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions