-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix replication issues with nexus operations and cancelation requests #6457
Fix replication issues with nexus operations and cancelation requests #6457
Conversation
5336d15
to
02efd41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. But please have someone from nexus to take a look.
Message: "operation hasn't started yet, dropping cancel request", | ||
Response: &http.Response{ | ||
// Make up a non retryable error code to consider this error non-retryable. | ||
StatusCode: http.StatusPreconditionFailed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it use bad request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH it doesn't matter much, it's not recorded so any non-retryable error code would do.
02efd41
to
37cadea
Compare
…#6457) ## What changed? Changed the operation cancelation logic to not cancel the operation if the operation hasn't been started yet. Before this change canceling an operation would add a NEXUS_OPERATION_CANCELED event and immediately transition to the `Canceled` state, which had issues with replication and buffered events. E.g. the `Started` event would be applied before the `CancelRequested` event on one cluster and then applied in reverse order when replicating to another cluster. Example history batch taken from a test cluster contains these events: ``` EVENT_TYPE_WORKFLOW_TASK_COMPLETED EVENT_TYPE_NEXUS_OPERATION_CANCEL_REQUESTED EVENT_TYPE_REQUEST_CANCEL_EXTERNAL_WORKFLOW_EXECUTION_INITIATED EVENT_TYPE_NEXUS_OPERATION_STARTED // <-- this had to be buffered EVENT_TYPE_WORKFLOW_TASK_SCHEDULED EVENT_TYPE_WORKFLOW_TASK_STARTED ``` The new code path simply creates the cancelation state machine but doesn't actually cancel it, instead it lets the task executor attempt to resolve the machine as canceled. Eventually we'll want to properly support cancel-before-started. At the moment, we may end up starting an operation and leave it oblivious of the cancelation request. ## How did you test it? I didn't test the replication issue but the new code path avoids ordering issues by not transitioning to canceled state. Added unit tests for the executor changes and modified the state machine tests. ## Is hotfix candidate? We should patch 120 and maybe even 119. The latter will be promoted to OSS 1.25.0 but since we explicitly state that Nexus in 1.25.0 isn't suited for multi cluster deployments, we could skip this patch.
…#6457) Changed the operation cancelation logic to not cancel the operation if the operation hasn't been started yet. Before this change canceling an operation would add a NEXUS_OPERATION_CANCELED event and immediately transition to the `Canceled` state, which had issues with replication and buffered events. E.g. the `Started` event would be applied before the `CancelRequested` event on one cluster and then applied in reverse order when replicating to another cluster. Example history batch taken from a test cluster contains these events: ``` EVENT_TYPE_WORKFLOW_TASK_COMPLETED EVENT_TYPE_NEXUS_OPERATION_CANCEL_REQUESTED EVENT_TYPE_REQUEST_CANCEL_EXTERNAL_WORKFLOW_EXECUTION_INITIATED EVENT_TYPE_NEXUS_OPERATION_STARTED // <-- this had to be buffered EVENT_TYPE_WORKFLOW_TASK_SCHEDULED EVENT_TYPE_WORKFLOW_TASK_STARTED ``` The new code path simply creates the cancelation state machine but doesn't actually cancel it, instead it lets the task executor attempt to resolve the machine as canceled. Eventually we'll want to properly support cancel-before-started. At the moment, we may end up starting an operation and leave it oblivious of the cancelation request. I didn't test the replication issue but the new code path avoids ordering issues by not transitioning to canceled state. Added unit tests for the executor changes and modified the state machine tests. We should patch 120 and maybe even 119. The latter will be promoted to OSS 1.25.0 but since we explicitly state that Nexus in 1.25.0 isn't suited for multi cluster deployments, we could skip this patch.
What changed?
Changed the operation cancelation logic to not cancel the operation if the operation hasn't been started yet.
Before this change canceling an operation would add a NEXUS_OPERATION_CANCELED event and immediately transition to the
Canceled
state, which had issues with replication and buffered events. E.g. theStarted
event would be applied before theCancelRequested
event on one cluster and then applied in reverse order when replicating to another cluster.Example history batch taken from a test cluster contains these events:
The new code path simply creates the cancelation state machine but doesn't actually cancel it, instead it lets the task executor attempt to resolve the machine as canceled.
Eventually we'll want to properly support cancel-before-started. At the moment, we may end up starting an operation and leave it oblivious of the cancelation request.
How did you test it?
I didn't test the replication issue but the new code path avoids ordering issues by not transitioning to canceled state.
Added unit tests for the executor changes and modified the state machine tests.
Is hotfix candidate?
We should patch 120 and maybe even 119. The latter will be promoted to OSS 1.25.0 but since we explicitly state that Nexus in 1.25.0 isn't suited for multi cluster deployments, we could skip this patch.