Ilm retryable steps #48010

andreidan · 2019-10-14T15:17:11Z

Sketch for retrying failed ILM steps.

This is very much a draft meant to illustrate how identifying retryable steps and attempting the retry would look like if we handle them in the IndexLifecycleRunner.

dakrone · 2019-10-14T17:01:04Z

@andreidan I think this goes a little too deep down the rabbit hole with the implementation. We should start with a minimum workable version and go from there. My idea of how this would work is:

Add the isRetryable setting on Step like you have, default it to be false.
Flip this to true for a subset of steps, for instance, starting with just the rollover steps since that is the most crucial step that needs to be retried

And then, when noticing that the step is in the ERROR step we move the index back to the non-failed step with the existing moveClusterStateToFailedStep. We don't do any actual execution at the point when we move the step back.

This leaves us with something that works for some of the steps, but doesn't solve the AsyncActionStep problem (since those are not triggered be either cluster state updates or periodic intervals).

For now though, I think we can add the retry only to the WaitForRolloverReadyStep as a start. We should then start a meta-issue where we can discuss which steps should be retry-able.

I also don't think we need the exponential back-off (at least not in the first iteration) since we have the poll interval and cluster state update running that space out the retries a bit.

I also think we need to expose that ILM is on a retried step somewhere in the explain API, as well as put a message in the logs about moving out of the ERROR step because the step is retryable.

@andreidan @gwbrown - thoughts?

gwbrown · 2019-10-14T23:02:21Z

I agree with @dakrone that I don't think we need the exponential backoff at this point - waiting for the poll interval and/or waiting on cluster state updates is probably good enough. Given that, I think we can keep any retry metadata in the LifecycleExecutionState.

(Aside: Now that I think about it, we could implement exponential backoff that way too without having to keep local state: If we keep the retry count in the LifecycleExecutionState, we already have the timestamp of the most recent failure, it's just a bit of math to figure out the next time it should retry. Still don't think we need it for a first cut.)

Regarding a concern you brought up elsewhere about having to parse the exception back out of the serialized version: one way to avoid doing that would be to make the decision at the time we move into the error step in the first place - one way to do that would be to add a field Boolean automaticallyRetryable to LifecycleExecutionState and set it at the time we move into the error state based on some logic - for now, a boolean isRetryable() method on Step, perhaps something like a boolean isRetryable(Exception ex) method in the future. We could also keep a retry count in the LifecycleExecutionState easily at that point, and it shouldn't add additional cluster state update load as we're already doing a cluster state update to move the index to the error step.

andreidan · 2019-10-15T11:02:04Z

Thanks for your input @dakrone @gwbrown. I think moving the cluster state to the failed step only, without explicitly executing the step, and using the LifecycleExecutionState to keep track of retries and possibly of the step being retryable or not (given the source exception) are better ways to move this forward.

andreidan · 2019-10-18T16:27:59Z

Superseded by #48256

andreidan added 3 commits October 9, 2019 16:50

ILM test rollover on readonly index

416008b

ILM retry failed step

11be0fd

Todos and logging.

aaad221

alpar-t added the WIP label Oct 18, 2019

andreidan closed this Oct 18, 2019

andreidan mentioned this pull request Oct 29, 2019

ILM Make the check-rollover-ready step retryable #48256

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ilm retryable steps #48010

Ilm retryable steps #48010

Uh oh!

andreidan commented Oct 14, 2019

Uh oh!

dakrone commented Oct 14, 2019

Uh oh!

gwbrown commented Oct 14, 2019

Uh oh!

andreidan commented Oct 15, 2019

Uh oh!

andreidan commented Oct 18, 2019

Uh oh!

Uh oh!

Ilm retryable steps #48010

Ilm retryable steps #48010

Uh oh!

Conversation

andreidan commented Oct 14, 2019

Uh oh!

dakrone commented Oct 14, 2019

Uh oh!

gwbrown commented Oct 14, 2019

Uh oh!

andreidan commented Oct 15, 2019

Uh oh!

andreidan commented Oct 18, 2019

Uh oh!

Uh oh!