Skip to content

Ilm retryable steps #48010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

andreidan
Copy link
Contributor

Sketch for retrying failed ILM steps.

This is very much a draft meant to illustrate how identifying retryable steps and attempting the retry would look like if we handle them in the IndexLifecycleRunner.

@dakrone
Copy link
Member

dakrone commented Oct 14, 2019

@andreidan I think this goes a little too deep down the rabbit hole with the implementation. We should start with a minimum workable version and go from there. My idea of how this would work is:

  • Add the isRetryable setting on Step like you have, default it to be false.
  • Flip this to true for a subset of steps, for instance, starting with just the rollover steps since that is the most crucial step that needs to be retried

And then, when noticing that the step is in the ERROR step we move the index back to the non-failed step with the existing moveClusterStateToFailedStep. We don't do any actual execution at the point when we move the step back.

This leaves us with something that works for some of the steps, but doesn't solve the AsyncActionStep problem (since those are not triggered be either cluster state updates or periodic intervals).

For now though, I think we can add the retry only to the WaitForRolloverReadyStep as a start. We should then start a meta-issue where we can discuss which steps should be retry-able.

I also don't think we need the exponential back-off (at least not in the first iteration) since we have the poll interval and cluster state update running that space out the retries a bit.

I also think we need to expose that ILM is on a retried step somewhere in the explain API, as well as put a message in the logs about moving out of the ERROR step because the step is retryable.

@andreidan @gwbrown - thoughts?

@gwbrown
Copy link
Contributor

gwbrown commented Oct 14, 2019

I agree with @dakrone that I don't think we need the exponential backoff at this point - waiting for the poll interval and/or waiting on cluster state updates is probably good enough. Given that, I think we can keep any retry metadata in the LifecycleExecutionState.

(Aside: Now that I think about it, we could implement exponential backoff that way too without having to keep local state: If we keep the retry count in the LifecycleExecutionState, we already have the timestamp of the most recent failure, it's just a bit of math to figure out the next time it should retry. Still don't think we need it for a first cut.)

Regarding a concern you brought up elsewhere about having to parse the exception back out of the serialized version: one way to avoid doing that would be to make the decision at the time we move into the error step in the first place - one way to do that would be to add a field Boolean automaticallyRetryable to LifecycleExecutionState and set it at the time we move into the error state based on some logic - for now, a boolean isRetryable() method on Step, perhaps something like a boolean isRetryable(Exception ex) method in the future. We could also keep a retry count in the LifecycleExecutionState easily at that point, and it shouldn't add additional cluster state update load as we're already doing a cluster state update to move the index to the error step.

@andreidan
Copy link
Contributor Author

Thanks for your input @dakrone @gwbrown. I think moving the cluster state to the failed step only, without explicitly executing the step, and using the LifecycleExecutionState to keep track of retries and possibly of the step being retryable or not (given the source exception) are better ways to move this forward.

@alpar-t alpar-t added the WIP label Oct 18, 2019
@andreidan
Copy link
Contributor Author

Superseded by #48256

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants