Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix replication issues with nexus operations and cancelation requests #6457

Conversation

bergundy
Copy link
Member

What changed?

Changed the operation cancelation logic to not cancel the operation if the operation hasn't been started yet.
Before this change canceling an operation would add a NEXUS_OPERATION_CANCELED event and immediately transition to the Canceled state, which had issues with replication and buffered events. E.g. the Started event would be applied before the CancelRequested event on one cluster and then applied in reverse order when replicating to another cluster.

Example history batch taken from a test cluster contains these events:

EVENT_TYPE_WORKFLOW_TASK_COMPLETED
EVENT_TYPE_NEXUS_OPERATION_CANCEL_REQUESTED
EVENT_TYPE_REQUEST_CANCEL_EXTERNAL_WORKFLOW_EXECUTION_INITIATED
EVENT_TYPE_NEXUS_OPERATION_STARTED // <-- this had to be buffered
EVENT_TYPE_WORKFLOW_TASK_SCHEDULED
EVENT_TYPE_WORKFLOW_TASK_STARTED

The new code path simply creates the cancelation state machine but doesn't actually cancel it, instead it lets the task executor attempt to resolve the machine as canceled.

Eventually we'll want to properly support cancel-before-started. At the moment, we may end up starting an operation and leave it oblivious of the cancelation request.

How did you test it?

I didn't test the replication issue but the new code path avoids ordering issues by not transitioning to canceled state.
Added unit tests for the executor changes and modified the state machine tests.

Is hotfix candidate?

We should patch 120 and maybe even 119. The latter will be promoted to OSS 1.25.0 but since we explicitly state that Nexus in 1.25.0 isn't suited for multi cluster deployments, we could skip this patch.

@bergundy bergundy requested a review from a team as a code owner August 28, 2024 03:50
@bergundy bergundy force-pushed the nexus-operation-cancel-requested-replication-fix branch from 5336d15 to 02efd41 Compare August 28, 2024 15:47
Copy link
Contributor

@yux0 yux0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But please have someone from nexus to take a look.

Message: "operation hasn't started yet, dropping cancel request",
Response: &http.Response{
// Make up a non retryable error code to consider this error non-retryable.
StatusCode: http.StatusPreconditionFailed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it use bad request?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH it doesn't matter much, it's not recorded so any non-retryable error code would do.

@bergundy bergundy force-pushed the nexus-operation-cancel-requested-replication-fix branch from 02efd41 to 37cadea Compare August 28, 2024 22:35
@bergundy bergundy merged commit 2551bd7 into temporalio:main Aug 28, 2024
41 of 42 checks passed
@bergundy bergundy deleted the nexus-operation-cancel-requested-replication-fix branch August 28, 2024 23:30
yycptt pushed a commit that referenced this pull request Aug 29, 2024
…#6457)

## What changed?

Changed the operation cancelation logic to not cancel the operation if
the operation hasn't been started yet.
Before this change canceling an operation would add a
NEXUS_OPERATION_CANCELED event and immediately transition to the
`Canceled` state, which had issues with replication and buffered events.
E.g. the `Started` event would be applied before the `CancelRequested`
event on one cluster and then applied in reverse order when replicating
to another cluster.

Example history batch taken from a test cluster contains these events:

```
EVENT_TYPE_WORKFLOW_TASK_COMPLETED
EVENT_TYPE_NEXUS_OPERATION_CANCEL_REQUESTED
EVENT_TYPE_REQUEST_CANCEL_EXTERNAL_WORKFLOW_EXECUTION_INITIATED
EVENT_TYPE_NEXUS_OPERATION_STARTED // <-- this had to be buffered
EVENT_TYPE_WORKFLOW_TASK_SCHEDULED
EVENT_TYPE_WORKFLOW_TASK_STARTED
```

The new code path simply creates the cancelation state machine but
doesn't actually cancel it, instead it lets the task executor attempt to
resolve the machine as canceled.

Eventually we'll want to properly support cancel-before-started. At the
moment, we may end up starting an operation and leave it oblivious of
the cancelation request.

## How did you test it?

I didn't test the replication issue but the new code path avoids
ordering issues by not transitioning to canceled state.
Added unit tests for the executor changes and modified the state machine
tests.

## Is hotfix candidate?

We should patch 120 and maybe even 119. The latter will be promoted to
OSS 1.25.0 but since we explicitly state that Nexus in 1.25.0 isn't
suited for multi cluster deployments, we could skip this patch.
alexshtin pushed a commit that referenced this pull request Nov 5, 2024
…#6457)

Changed the operation cancelation logic to not cancel the operation if
the operation hasn't been started yet.
Before this change canceling an operation would add a
NEXUS_OPERATION_CANCELED event and immediately transition to the
`Canceled` state, which had issues with replication and buffered events.
E.g. the `Started` event would be applied before the `CancelRequested`
event on one cluster and then applied in reverse order when replicating
to another cluster.

Example history batch taken from a test cluster contains these events:

```
EVENT_TYPE_WORKFLOW_TASK_COMPLETED
EVENT_TYPE_NEXUS_OPERATION_CANCEL_REQUESTED
EVENT_TYPE_REQUEST_CANCEL_EXTERNAL_WORKFLOW_EXECUTION_INITIATED
EVENT_TYPE_NEXUS_OPERATION_STARTED // <-- this had to be buffered
EVENT_TYPE_WORKFLOW_TASK_SCHEDULED
EVENT_TYPE_WORKFLOW_TASK_STARTED
```

The new code path simply creates the cancelation state machine but
doesn't actually cancel it, instead it lets the task executor attempt to
resolve the machine as canceled.

Eventually we'll want to properly support cancel-before-started. At the
moment, we may end up starting an operation and leave it oblivious of
the cancelation request.

I didn't test the replication issue but the new code path avoids
ordering issues by not transitioning to canceled state.
Added unit tests for the executor changes and modified the state machine
tests.

We should patch 120 and maybe even 119. The latter will be promoted to
OSS 1.25.0 but since we explicitly state that Nexus in 1.25.0 isn't
suited for multi cluster deployments, we could skip this patch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants