-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Change step execution flow to be deliberate about type #34126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change step execution flow to be deliberate about type #34126
Conversation
This commit changes the way that step execution flows. Rather than have any step run when the cluster state changes or the periodic scheduler fires, this now runs the different types of steps at different times. `AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default `ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the cluster state changes. `AsyncActionStep` is now run only after the cluster state has been transitioned into a new step. This prevents these non-idempotent steps from running at the same time. It addition to being run when transitioned into, this is also run when a node is newly elected master (only if set as the current step) so that master failover does not fail to run the step. This also changes the `RolloverStep` from an `AsyncActionStep` to an `AsyncWaitStep` so that it can run periodically. Relates to elastic#29823
Pinging @elastic/es-core-infra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dakrone I left some comments
@@ -89,4 +91,13 @@ public boolean equals(Object obj) { | |||
Objects.equals(maxDocs, other.maxDocs); | |||
} | |||
|
|||
// TODO: expand the information we provide? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any information from the current RolloverResponse
that would be helpful here. The closest is the conditionStatus
but since this is a Map<String, Boolean>
it doesn't provide anything useful when the index is not rolled over because all the conditions will be false
. Maybe we should open an issue for discussion suggesting to add more information to the conditionStatus
in the RolloverResponse
so it exposes the age, number of documents, and store size it found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a separate issue for that, good idea.
@@ -88,6 +92,7 @@ public ClusterState execute(ClusterState currentState) throws IOException { | |||
if (currentStep.getNextStepKey() == null) { | |||
return currentState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not need to set nextStepKey
to null
here so in the clusterStateProcessed
method we don't end up doing the wrong thing? I think for now it would actually work because the nextStepKey
would never be an AsyncAction
in this scenario but probably worth correcting this so bugs don't creep in the future becuase the nextStepKey is still pointing at the previous step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nextStepKey
starts out by being null, so this will already do the right thing. I'll add an explicit initialization to null
in the method though, so it's more apparent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this task potentially loops through multiple steps right? So if you have a cluster state step , another cluster state step and then null
next step will get step on the first step and then remain set here when the second step's next step is null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good point, I'll fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think the behavior might be okay,
- CSAS (cluster state action step) executes with nextStep
A
, moves cluster state to next step (which is also a CSAS) - CSAS executes with nextStep
null
, returns cluster state at that point clusterStateProcessed
call is invoked, nextStepKey isA
, butA
isn't an async action so it's a no-op.
Can you describe the scenario you're worried about in more detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you say is correct right now, I am worried that if the logic in clusterStateProcessed
changes in the future we could inadvertently introduce a bug that will be tricky to track down because the nextStepKey is not actually set to the nextStep at this point. I think we should make it so nextStepKey is correct so we avoid this in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've re-organized this so it is hopefully clearer
@@ -104,6 +109,7 @@ public ClusterState execute(ClusterState currentState) throws IOException { | |||
if (currentStep.getNextStepKey() == null) { | |||
return currentState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same answer as above :)
logger.warn("current step [" + getCurrentStepKey(lifecycleState) + "] for index [" + indexMetaData.getIndex().getName() | ||
+ "] with policy [" + policy + "] is not recognized"); | ||
logger.warn("current step [{}] for index [{}] with policy [{}] is not recognized", | ||
getCurrentStepKey(lifecycleState), index, policy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not a change for this PR but this should never happen now right as we aren't caching the steps int eh step registry? So we should probably throw an exception if it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get here if the index specifies a lifecycle.name
that doesn't exist - I'll be opening a separate PR to handle that case probably today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can revisit this when we remove the policy steps registry
...k/plugin/ilm/src/main/java/org/elasticsearch/xpack/indexlifecycle/IndexLifecycleService.java
Show resolved
Hide resolved
} else if (currentStep instanceof ClusterStateActionStep || currentStep instanceof ClusterStateWaitStep) { | ||
logger.debug("[{}] running policy with current-step [{}]", indexMetaData.getIndex().getName(), currentStep.getKey()); | ||
clusterService.submitStateUpdateTask("ILM", | ||
new ExecuteStepsUpdateTask(policy, indexMetaData.getIndex(), currentStep, stepRegistry, this, nowSupplier)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we said we don't need ExecuteStepsUpdateTask, but rather run ClusterStateWaitStep on the incoming cluster state and if it matches move to the next step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of whether the ClusterStateWaitStep
is marked as "complete" or not, we have to issue a cluster state update, so to me we might as well keep it in the midst of the cluster state update?
@elasticmachine run the packaging tests please |
…te-step-execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment but assuming that is addressed this LGTM
String policyName = LifecycleSettings.LIFECYCLE_NAME_SETTING.get(idxMeta.getSettings()); | ||
if (Strings.isNullOrEmpty(policyName) == false) { | ||
StepKey stepKey = IndexLifecycleRunner.getCurrentStepKey(LifecycleExecutionState.fromIndexMetadata(idxMeta)); | ||
if (OperationMode.STOPPING == currentMode && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have lost the bit that sets the currentMode
to STOPPED
if no indices are in the ignore actions list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it below in triggerPolicies()
but I think we need it here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I'll update that to add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
…te-step-execution
This commit changes the way that step execution flows. Rather than have any step run when the cluster state changes or the periodic scheduler fires, this now runs the different types of steps at different times. `AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default `ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the cluster state changes. `AsyncActionStep` is now run only after the cluster state has been transitioned into a new step. This prevents these non-idempotent steps from running at the same time. It addition to being run when transitioned into, this is also run when a node is newly elected master (only if set as the current step) so that master failover does not fail to run the step. This also changes the `RolloverStep` from an `AsyncActionStep` to an `AsyncWaitStep` so that it can run periodically. Relates to #29823
This commit changes the way that step execution flows. Rather than have any step
run when the cluster state changes or the periodic scheduler fires, this now
runs the different types of steps at different times.
AsyncWaitStep
is run at a periodic manner, ie, every 10 minutes by defaultClusterStateActionStep
andClusterStateWaitStep
are run every time thecluster state changes.
AsyncActionStep
is now run only after the cluster state has been transitionedinto a new step. This prevents these non-idempotent steps from running at the
same time. It addition to being run when transitioned into, this is also run
when a node is newly elected master (only if set as the current step) so that
master failover does not fail to run the step.
This also changes the
RolloverStep
from anAsyncActionStep
to anAsyncWaitStep
so that it can run periodically.Relates to #29823