Fix replication issues with nexus operations and cancelation requests #6457

bergundy · 2024-08-28T03:50:29Z

What changed?

Changed the operation cancelation logic to not cancel the operation if the operation hasn't been started yet.
Before this change canceling an operation would add a NEXUS_OPERATION_CANCELED event and immediately transition to the Canceled state, which had issues with replication and buffered events. E.g. the Started event would be applied before the CancelRequested event on one cluster and then applied in reverse order when replicating to another cluster.

Example history batch taken from a test cluster contains these events:

EVENT_TYPE_WORKFLOW_TASK_COMPLETED
EVENT_TYPE_NEXUS_OPERATION_CANCEL_REQUESTED
EVENT_TYPE_REQUEST_CANCEL_EXTERNAL_WORKFLOW_EXECUTION_INITIATED
EVENT_TYPE_NEXUS_OPERATION_STARTED // <-- this had to be buffered
EVENT_TYPE_WORKFLOW_TASK_SCHEDULED
EVENT_TYPE_WORKFLOW_TASK_STARTED

The new code path simply creates the cancelation state machine but doesn't actually cancel it, instead it lets the task executor attempt to resolve the machine as canceled.

Eventually we'll want to properly support cancel-before-started. At the moment, we may end up starting an operation and leave it oblivious of the cancelation request.

How did you test it?

I didn't test the replication issue but the new code path avoids ordering issues by not transitioning to canceled state.
Added unit tests for the executor changes and modified the state machine tests.

Is hotfix candidate?

We should patch 120 and maybe even 119. The latter will be promoted to OSS 1.25.0 but since we explicitly state that Nexus in 1.25.0 isn't suited for multi cluster deployments, we could skip this patch.

yux0

LGTM. But please have someone from nexus to take a look.

yux0 · 2024-08-28T16:58:54Z

components/nexusoperations/executors.go

+			Message: "operation hasn't started yet, dropping cancel request",
+			Response: &http.Response{
+				// Make up a non retryable error code to consider this error non-retryable.
+				StatusCode: http.StatusPreconditionFailed,


should it use bad request?

TBH it doesn't matter much, it's not recorded so any non-retryable error code would do.

…#6457) ## What changed? Changed the operation cancelation logic to not cancel the operation if the operation hasn't been started yet. Before this change canceling an operation would add a NEXUS_OPERATION_CANCELED event and immediately transition to the `Canceled` state, which had issues with replication and buffered events. E.g. the `Started` event would be applied before the `CancelRequested` event on one cluster and then applied in reverse order when replicating to another cluster. Example history batch taken from a test cluster contains these events: ``` EVENT_TYPE_WORKFLOW_TASK_COMPLETED EVENT_TYPE_NEXUS_OPERATION_CANCEL_REQUESTED EVENT_TYPE_REQUEST_CANCEL_EXTERNAL_WORKFLOW_EXECUTION_INITIATED EVENT_TYPE_NEXUS_OPERATION_STARTED // <-- this had to be buffered EVENT_TYPE_WORKFLOW_TASK_SCHEDULED EVENT_TYPE_WORKFLOW_TASK_STARTED ``` The new code path simply creates the cancelation state machine but doesn't actually cancel it, instead it lets the task executor attempt to resolve the machine as canceled. Eventually we'll want to properly support cancel-before-started. At the moment, we may end up starting an operation and leave it oblivious of the cancelation request. ## How did you test it? I didn't test the replication issue but the new code path avoids ordering issues by not transitioning to canceled state. Added unit tests for the executor changes and modified the state machine tests. ## Is hotfix candidate? We should patch 120 and maybe even 119. The latter will be promoted to OSS 1.25.0 but since we explicitly state that Nexus in 1.25.0 isn't suited for multi cluster deployments, we could skip this patch.

…#6457) Changed the operation cancelation logic to not cancel the operation if the operation hasn't been started yet. Before this change canceling an operation would add a NEXUS_OPERATION_CANCELED event and immediately transition to the `Canceled` state, which had issues with replication and buffered events. E.g. the `Started` event would be applied before the `CancelRequested` event on one cluster and then applied in reverse order when replicating to another cluster. Example history batch taken from a test cluster contains these events: ``` EVENT_TYPE_WORKFLOW_TASK_COMPLETED EVENT_TYPE_NEXUS_OPERATION_CANCEL_REQUESTED EVENT_TYPE_REQUEST_CANCEL_EXTERNAL_WORKFLOW_EXECUTION_INITIATED EVENT_TYPE_NEXUS_OPERATION_STARTED // <-- this had to be buffered EVENT_TYPE_WORKFLOW_TASK_SCHEDULED EVENT_TYPE_WORKFLOW_TASK_STARTED ``` The new code path simply creates the cancelation state machine but doesn't actually cancel it, instead it lets the task executor attempt to resolve the machine as canceled. Eventually we'll want to properly support cancel-before-started. At the moment, we may end up starting an operation and leave it oblivious of the cancelation request. I didn't test the replication issue but the new code path avoids ordering issues by not transitioning to canceled state. Added unit tests for the executor changes and modified the state machine tests. We should patch 120 and maybe even 119. The latter will be promoted to OSS 1.25.0 but since we explicitly state that Nexus in 1.25.0 isn't suited for multi cluster deployments, we could skip this patch.

bergundy requested a review from a team as a code owner August 28, 2024 03:50

bergundy force-pushed the nexus-operation-cancel-requested-replication-fix branch from 5336d15 to 02efd41 Compare August 28, 2024 15:47

yux0 approved these changes Aug 28, 2024

View reviewed changes

pdoerner approved these changes Aug 28, 2024

View reviewed changes

yycptt approved these changes Aug 28, 2024

View reviewed changes

yycptt added the release/1.26.0-120 label Aug 28, 2024

Fix replication issues with nexus operations and cancelation requests

37cadea

bergundy force-pushed the nexus-operation-cancel-requested-replication-fix branch from 02efd41 to 37cadea Compare August 28, 2024 22:35

bergundy merged commit 2551bd7 into temporalio:main Aug 28, 2024
41 of 42 checks passed

bergundy deleted the nexus-operation-cancel-requested-replication-fix branch August 28, 2024 23:30

alexshtin added the release/1.25.2 label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix replication issues with nexus operations and cancelation requests #6457

Fix replication issues with nexus operations and cancelation requests #6457

bergundy commented Aug 28, 2024

yux0 left a comment

yux0 Aug 28, 2024

bergundy Aug 28, 2024

Fix replication issues with nexus operations and cancelation requests #6457

Fix replication issues with nexus operations and cancelation requests #6457

Conversation

bergundy commented Aug 28, 2024

What changed?

How did you test it?

Is hotfix candidate?

yux0 left a comment

Choose a reason for hiding this comment

yux0 Aug 28, 2024

Choose a reason for hiding this comment

bergundy Aug 28, 2024

Choose a reason for hiding this comment