-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle shard ownership lost when reading history tasks #7101
Conversation
ctx context.Context, | ||
request *persistence.GetHistoryTasksRequest, | ||
) (*persistence.GetHistoryTasksResponse, error) { | ||
if err := s.errorByState(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in I understand correctly main change - you "wrap" s.executionManager.GetHistoryTask with
s.errorByState()
and
s.handleReadError()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. s.handleReadError is the key part which will check for shardOwnership lost and will put shard context in an invalid state. Then a new shard will be created once shard controller realizes the shard context is invalid.
## What changed? <!-- Describe what has changed in this PR --> - Wrap ExecutionManager.GetHistoryTasks in shard context to handler shardOwnershipLost errors. - Will need to follow up and audit all usage of this method and other methods on execution manager. This PR is only for fixing the issue we are seeing. ## Why? <!-- Tell your future self why have you made these changes --> - If shard has no (api) traffic, and ownership already got lost in the background (from some persistence implementations). Task processing can get stuck forever trying to load tasks and see this shard ownership lost error. ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - Unit test ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
What changed?
Why?
How did you test it?
Potential risks
Documentation
Is hotfix candidate?