Change step execution flow to be deliberate about type #34126

dakrone · 2018-09-27T21:52:20Z

This commit changes the way that step execution flows. Rather than have any step
run when the cluster state changes or the periodic scheduler fires, this now
runs the different types of steps at different times.

AsyncWaitStep is run at a periodic manner, ie, every 10 minutes by default
ClusterStateActionStep and ClusterStateWaitStep are run every time the
cluster state changes.
AsyncActionStep is now run only after the cluster state has been transitioned
into a new step. This prevents these non-idempotent steps from running at the
same time. It addition to being run when transitioned into, this is also run
when a node is newly elected master (only if set as the current step) so that
master failover does not fail to run the step.

This also changes the RolloverStep from an AsyncActionStep to an
AsyncWaitStep so that it can run periodically.

Relates to #29823

This commit changes the way that step execution flows. Rather than have any step run when the cluster state changes or the periodic scheduler fires, this now runs the different types of steps at different times. `AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default `ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the cluster state changes. `AsyncActionStep` is now run only after the cluster state has been transitioned into a new step. This prevents these non-idempotent steps from running at the same time. It addition to being run when transitioned into, this is also run when a node is newly elected master (only if set as the current step) so that master failover does not fail to run the step. This also changes the `RolloverStep` from an `AsyncActionStep` to an `AsyncWaitStep` so that it can run periodically. Relates to elastic#29823

elasticmachine · 2018-09-27T21:52:22Z

Pinging @elastic/es-core-infra

dakrone · 2018-09-27T22:18:53Z

I checked and this does fix the issues @talevy was seeing in #33402

colings86

@dakrone I left some comments

colings86 · 2018-09-28T08:01:33Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/indexlifecycle/RolloverStep.java

@@ -89,4 +91,13 @@ public boolean equals(Object obj) {
                Objects.equals(maxDocs, other.maxDocs);
    }

+    // TODO: expand the information we provide?


I don't think there is any information from the current RolloverResponse that would be helpful here. The closest is the conditionStatus but since this is a Map<String, Boolean> it doesn't provide anything useful when the index is not rolled over because all the conditions will be false. Maybe we should open an issue for discussion suggesting to add more information to the conditionStatus in the RolloverResponse so it exposes the age, number of documents, and store size it found?

I'll open a separate issue for that, good idea.

colings86 · 2018-09-28T08:14:36Z

.../plugin/ilm/src/main/java/org/elasticsearch/xpack/indexlifecycle/ExecuteStepsUpdateTask.java

@@ -88,6 +92,7 @@ public ClusterState execute(ClusterState currentState) throws IOException {
                    if (currentStep.getNextStepKey() == null) {
                        return currentState;


Do we not need to set nextStepKey to null here so in the clusterStateProcessed method we don't end up doing the wrong thing? I think for now it would actually work because the nextStepKey would never be an AsyncAction in this scenario but probably worth correcting this so bugs don't creep in the future becuase the nextStepKey is still pointing at the previous step?

nextStepKey starts out by being null, so this will already do the right thing. I'll add an explicit initialization to null in the method though, so it's more apparent.

But this task potentially loops through multiple steps right? So if you have a cluster state step , another cluster state step and then null next step will get step on the first step and then remain set here when the second step's next step is null?

Oh good point, I'll fix

Actually I think the behavior might be okay,

CSAS (cluster state action step) executes with nextStep A, moves cluster state to next step (which is also a CSAS)

CSAS executes with nextStep null, returns cluster state at that point

clusterStateProcessed call is invoked, nextStepKey is A, but A isn't an async action so it's a no-op.

Can you describe the scenario you're worried about in more detail?

What you say is correct right now, I am worried that if the logic in clusterStateProcessed changes in the future we could inadvertently introduce a bug that will be tricky to track down because the nextStepKey is not actually set to the nextStep at this point. I think we should make it so nextStepKey is correct so we avoid this in the future?

Okay, I've re-organized this so it is hopefully clearer

colings86 · 2018-09-28T08:14:45Z

.../plugin/ilm/src/main/java/org/elasticsearch/xpack/indexlifecycle/ExecuteStepsUpdateTask.java

@@ -104,6 +109,7 @@ public ClusterState execute(ClusterState currentState) throws IOException {
                        if (currentStep.getNextStepKey() == null) {
                            return currentState;


same as above

same answer as above :)

colings86 · 2018-09-28T10:12:23Z

...ck/plugin/ilm/src/main/java/org/elasticsearch/xpack/indexlifecycle/IndexLifecycleRunner.java

-            logger.warn("current step [" + getCurrentStepKey(lifecycleState) + "] for index [" + indexMetaData.getIndex().getName()
-                + "] with policy [" + policy + "] is not recognized");
+            logger.warn("current step [{}] for index [{}] with policy [{}] is not recognized",
+                getCurrentStepKey(lifecycleState), index, policy);


Probably not a change for this PR but this should never happen now right as we aren't caching the steps int eh step registry? So we should probably throw an exception if it does?

We get here if the index specifies a lifecycle.name that doesn't exist - I'll be opening a separate PR to handle that case probably today.

Sure, I can revisit this when we remove the policy steps registry

...k/plugin/ilm/src/main/java/org/elasticsearch/xpack/indexlifecycle/IndexLifecycleService.java

bleskes · 2018-09-28T19:31:18Z

...ck/plugin/ilm/src/main/java/org/elasticsearch/xpack/indexlifecycle/IndexLifecycleRunner.java

+        } else if (currentStep instanceof ClusterStateActionStep || currentStep instanceof ClusterStateWaitStep) {
+            logger.debug("[{}] running policy with current-step [{}]", indexMetaData.getIndex().getName(), currentStep.getKey());
+            clusterService.submitStateUpdateTask("ILM",
+                new ExecuteStepsUpdateTask(policy, indexMetaData.getIndex(), currentStep, stepRegistry, this, nowSupplier));


I think we said we don't need ExecuteStepsUpdateTask, but rather run ClusterStateWaitStep on the incoming cluster state and if it matches move to the next step.

Regardless of whether the ClusterStateWaitStep is marked as "complete" or not, we have to issue a cluster state update, so to me we might as well keep it in the midst of the cluster state update?

dakrone · 2018-10-01T18:09:09Z

@elasticmachine run the packaging tests please

…te-step-execution

colings86

I left a comment but assuming that is addressed this LGTM

colings86 · 2018-10-02T18:12:56Z

...k/plugin/ilm/src/main/java/org/elasticsearch/xpack/indexlifecycle/IndexLifecycleService.java

+                String policyName = LifecycleSettings.LIFECYCLE_NAME_SETTING.get(idxMeta.getSettings());
+                if (Strings.isNullOrEmpty(policyName) == false) {
+                    StepKey stepKey = IndexLifecycleRunner.getCurrentStepKey(LifecycleExecutionState.fromIndexMetadata(idxMeta));
+                    if (OperationMode.STOPPING == currentMode &&


I think we have lost the bit that sets the currentMode to STOPPED if no indices are in the ignore actions list?

I see it below in triggerPolicies() but I think we need it here too?

Good call, I'll update that to add this.

colings86

Lgtm

…te-step-execution

This commit changes the way that step execution flows. Rather than have any step run when the cluster state changes or the periodic scheduler fires, this now runs the different types of steps at different times. `AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default `ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the cluster state changes. `AsyncActionStep` is now run only after the cluster state has been transitioned into a new step. This prevents these non-idempotent steps from running at the same time. It addition to being run when transitioned into, this is also run when a node is newly elected master (only if set as the current step) so that master failover does not fail to run the step. This also changes the `RolloverStep` from an `AsyncActionStep` to an `AsyncWaitStep` so that it can run periodically. Relates to #29823

dakrone added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Sep 27, 2018

dakrone requested review from colings86 and gwbrown September 27, 2018 21:52

colings86 reviewed Sep 28, 2018

View reviewed changes

dakrone added 3 commits September 28, 2018 11:01

Remove TODO

85fd74d

Initialize nextStepKey to null

13a6633

Check for shrink method when stopping so that we can move to stopped

771c5db

bleskes reviewed Sep 28, 2018

View reviewed changes

re-organize ExecuteStepsUpdateTask

36ee149

elasticmachine mentioned this pull request Oct 1, 2018

[meta] Index Lifecycle Management Plan #29823

Closed

Merge remote-tracking branch 'origin/index-lifecycle' into ilm-separa…

c3f2b78

…te-step-execution

colings86 approved these changes Oct 2, 2018

View reviewed changes

Add missing "stop" triggering check

80d368b

colings86 approved these changes Oct 2, 2018

View reviewed changes

Merge remote-tracking branch 'origin/index-lifecycle' into ilm-separa…

85b3268

…te-step-execution

dakrone merged commit 388f754 into elastic:index-lifecycle Oct 3, 2018

dakrone deleted the ilm-separate-step-execution branch February 4, 2019 14:45

		@@ -88,6 +92,7 @@ public ClusterState execute(ClusterState currentState) throws IOException {
		if (currentStep.getNextStepKey() == null) {
		return currentState;

		@@ -104,6 +109,7 @@ public ClusterState execute(ClusterState currentState) throws IOException {
		if (currentStep.getNextStepKey() == null) {
		return currentState;

Change step execution flow to be deliberate about type #34126

Change step execution flow to be deliberate about type #34126

Uh oh!

Conversation

dakrone commented Sep 27, 2018

Uh oh!

elasticmachine commented Sep 27, 2018

Uh oh!

dakrone commented Sep 27, 2018

Uh oh!

colings86 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dakrone commented Oct 1, 2018

Uh oh!

colings86 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

colings86 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!