Skip to content

Change step execution flow to be deliberate about type #34126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

dakrone
Copy link
Member

@dakrone dakrone commented Sep 27, 2018

This commit changes the way that step execution flows. Rather than have any step
run when the cluster state changes or the periodic scheduler fires, this now
runs the different types of steps at different times.

AsyncWaitStep is run at a periodic manner, ie, every 10 minutes by default
ClusterStateActionStep and ClusterStateWaitStep are run every time the
cluster state changes.
AsyncActionStep is now run only after the cluster state has been transitioned
into a new step. This prevents these non-idempotent steps from running at the
same time. It addition to being run when transitioned into, this is also run
when a node is newly elected master (only if set as the current step) so that
master failover does not fail to run the step.

This also changes the RolloverStep from an AsyncActionStep to an
AsyncWaitStep so that it can run periodically.

Relates to #29823

This commit changes the way that step execution flows. Rather than have any step
run when the cluster state changes or the periodic scheduler fires, this now
runs the different types of steps at different times.

`AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default
`ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the
cluster state changes.
`AsyncActionStep` is now run only after the cluster state has been transitioned
into a new step. This prevents these non-idempotent steps from running at the
same time. It addition to being run when transitioned into, this is also run
when a node is newly elected master (only if set as the current step) so that
master failover does not fail to run the step.

This also changes the `RolloverStep` from an `AsyncActionStep` to an
`AsyncWaitStep` so that it can run periodically.

Relates to elastic#29823
@dakrone dakrone added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Sep 27, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@dakrone
Copy link
Member Author

dakrone commented Sep 27, 2018

I checked and this does fix the issues @talevy was seeing in #33402

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dakrone I left some comments

@@ -89,4 +91,13 @@ public boolean equals(Object obj) {
Objects.equals(maxDocs, other.maxDocs);
}

// TODO: expand the information we provide?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any information from the current RolloverResponse that would be helpful here. The closest is the conditionStatus but since this is a Map<String, Boolean> it doesn't provide anything useful when the index is not rolled over because all the conditions will be false. Maybe we should open an issue for discussion suggesting to add more information to the conditionStatus in the RolloverResponse so it exposes the age, number of documents, and store size it found?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a separate issue for that, good idea.

@@ -88,6 +92,7 @@ public ClusterState execute(ClusterState currentState) throws IOException {
if (currentStep.getNextStepKey() == null) {
return currentState;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need to set nextStepKey to null here so in the clusterStateProcessed method we don't end up doing the wrong thing? I think for now it would actually work because the nextStepKey would never be an AsyncAction in this scenario but probably worth correcting this so bugs don't creep in the future becuase the nextStepKey is still pointing at the previous step?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextStepKey starts out by being null, so this will already do the right thing. I'll add an explicit initialization to null in the method though, so it's more apparent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this task potentially loops through multiple steps right? So if you have a cluster state step , another cluster state step and then null next step will get step on the first step and then remain set here when the second step's next step is null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good point, I'll fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think the behavior might be okay,

  • CSAS (cluster state action step) executes with nextStep A, moves cluster state to next step (which is also a CSAS)
  • CSAS executes with nextStep null, returns cluster state at that point
  • clusterStateProcessed call is invoked, nextStepKey is A, but A isn't an async action so it's a no-op.

Can you describe the scenario you're worried about in more detail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you say is correct right now, I am worried that if the logic in clusterStateProcessed changes in the future we could inadvertently introduce a bug that will be tricky to track down because the nextStepKey is not actually set to the nextStep at this point. I think we should make it so nextStepKey is correct so we avoid this in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've re-organized this so it is hopefully clearer

@@ -104,6 +109,7 @@ public ClusterState execute(ClusterState currentState) throws IOException {
if (currentStep.getNextStepKey() == null) {
return currentState;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same answer as above :)

logger.warn("current step [" + getCurrentStepKey(lifecycleState) + "] for index [" + indexMetaData.getIndex().getName()
+ "] with policy [" + policy + "] is not recognized");
logger.warn("current step [{}] for index [{}] with policy [{}] is not recognized",
getCurrentStepKey(lifecycleState), index, policy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a change for this PR but this should never happen now right as we aren't caching the steps int eh step registry? So we should probably throw an exception if it does?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get here if the index specifies a lifecycle.name that doesn't exist - I'll be opening a separate PR to handle that case probably today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can revisit this when we remove the policy steps registry

} else if (currentStep instanceof ClusterStateActionStep || currentStep instanceof ClusterStateWaitStep) {
logger.debug("[{}] running policy with current-step [{}]", indexMetaData.getIndex().getName(), currentStep.getKey());
clusterService.submitStateUpdateTask("ILM",
new ExecuteStepsUpdateTask(policy, indexMetaData.getIndex(), currentStep, stepRegistry, this, nowSupplier));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we said we don't need ExecuteStepsUpdateTask, but rather run ClusterStateWaitStep on the incoming cluster state and if it matches move to the next step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of whether the ClusterStateWaitStep is marked as "complete" or not, we have to issue a cluster state update, so to me we might as well keep it in the midst of the cluster state update?

@dakrone
Copy link
Member Author

dakrone commented Oct 1, 2018

@elasticmachine run the packaging tests please

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment but assuming that is addressed this LGTM

String policyName = LifecycleSettings.LIFECYCLE_NAME_SETTING.get(idxMeta.getSettings());
if (Strings.isNullOrEmpty(policyName) == false) {
StepKey stepKey = IndexLifecycleRunner.getCurrentStepKey(LifecycleExecutionState.fromIndexMetadata(idxMeta));
if (OperationMode.STOPPING == currentMode &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have lost the bit that sets the currentMode to STOPPED if no indices are in the ignore actions list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it below in triggerPolicies() but I think we need it here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'll update that to add this.

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@dakrone dakrone merged commit 388f754 into elastic:index-lifecycle Oct 3, 2018
dakrone added a commit that referenced this pull request Oct 3, 2018
This commit changes the way that step execution flows. Rather than have any step
run when the cluster state changes or the periodic scheduler fires, this now
runs the different types of steps at different times.

`AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default
`ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the
cluster state changes.
`AsyncActionStep` is now run only after the cluster state has been transitioned
into a new step. This prevents these non-idempotent steps from running at the
same time. It addition to being run when transitioned into, this is also run
when a node is newly elected master (only if set as the current step) so that
master failover does not fail to run the step.

This also changes the `RolloverStep` from an `AsyncActionStep` to an
`AsyncWaitStep` so that it can run periodically.

Relates to #29823
@dakrone dakrone deleted the ilm-separate-step-execution branch February 4, 2019 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants