Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible NDEs caused by LAs & immediate cancels resolving in different order upon replay #808

Merged
merged 4 commits into from
Sep 3, 2024

Conversation

Sushisource
Copy link
Member

What was changed

Always resolve LAs last within one activation

I am confident this does not need a flag for the reasons explained in the docstring for the test. If you were in this situation before, you were experiencing UB that would likely lead to an NDE. The only situation this change impacts is exactly that situation. The test docstring explains why you can't force something else to happen.

However, we should still test this change in lang SDKs with the fuzzer before releasing them with it.

Why?

Well explained by docstrings

Checklist

  1. Closes [Bug] Local activity combined with cancellation of something that doesn't wait can get out of order on replay #803

  2. How was this tested:
    Added reproing test

  3. Any docs updates needed?

@Sushisource Sushisource requested a review from a team as a code owner August 30, 2024 23:59
Comment on lines +2003 to +2004
Attributes::WorkflowExecutionStartedEventAttributes(_) => { EventType::WorkflowExecutionStarted }
Attributes::WorkflowExecutionCompletedEventAttributes(_) => { EventType::WorkflowExecutionCompleted }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rustfmt decided to go hard on some stuff I guess

Comment on lines +1331 to +1332
/// 6. local activity resolutions
/// 7. queries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Python still does its own job ordering: https://github.com/temporalio/sdk-python/blob/4aef4bfda19e56c7e7abac36b69e98f60861468c/temporalio/worker/_workflow_instance.py#L343-L356. But it's more primitive so this should still work (it's idea of 5 and 6 is put together). The problem is where we tick and run conditions is each "job set" so we cannot easily remove the Python logic. We probably do need to make sure to handle workflow init first. It hasn't even been updated to handle that message I don't think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should refactor it to work the way I describe in the one comment in here / the way it now is in TS, but yes that'll take a little time.

@Sushisource Sushisource merged commit 6ff21fd into master Sep 3, 2024
6 checks passed
@Sushisource Sushisource deleted the la-resolve-same-time-as-cancel-bug branch September 3, 2024 21:20
Sushisource added a commit that referenced this pull request Sep 3, 2024
…erent order upon replay (#808)

(cherry picked from commit 6ff21fd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Local activity combined with cancellation of something that doesn't wait can get out of order on replay
2 participants