Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle shard ownership lost when reading history tasks #7101

Merged
merged 1 commit into from
Jan 17, 2025

Conversation

yycptt
Copy link
Member

@yycptt yycptt commented Jan 17, 2025

What changed?

  • Wrap ExecutionManager.GetHistoryTasks in shard context to handler shardOwnershipLost errors.
  • Will need to follow up and audit all usage of this method and other methods on execution manager. This PR is only for fixing the issue we are seeing.

Why?

  • If shard has no (api) traffic, and ownership already got lost in the background (from some persistence implementations). Task processing can get stuck forever trying to load tasks and see this shard ownership lost error.

How did you test it?

  • Unit test

Potential risks

Documentation

Is hotfix candidate?

@yycptt yycptt requested a review from alfred-landrum January 17, 2025 00:47
@yycptt yycptt requested a review from a team as a code owner January 17, 2025 00:47
ctx context.Context,
request *persistence.GetHistoryTasksRequest,
) (*persistence.GetHistoryTasksResponse, error) {
if err := s.errorByState(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in I understand correctly main change - you "wrap" s.executionManager.GetHistoryTask with

s.errorByState()
and
s.handleReadError()
?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. s.handleReadError is the key part which will check for shardOwnership lost and will put shard context in an invalid state. Then a new shard will be created once shard controller realizes the shard context is invalid.

@yycptt yycptt enabled auto-merge (squash) January 17, 2025 01:00
@yycptt yycptt merged commit 72fa968 into temporalio:main Jan 17, 2025
49 checks passed
@yycptt yycptt deleted the sol-get-history-tasks branch January 17, 2025 01:11
stephanos pushed a commit to stephanos/temporal that referenced this pull request Jan 17, 2025
## What changed?
<!-- Describe what has changed in this PR -->
- Wrap ExecutionManager.GetHistoryTasks in shard context to handler
shardOwnershipLost errors.
- Will need to follow up and audit all usage of this method and other
methods on execution manager. This PR is only for fixing the issue we
are seeing.

## Why?
<!-- Tell your future self why have you made these changes -->
- If shard has no (api) traffic, and ownership already got lost in the
background (from some persistence implementations). Task processing can
get stuck forever trying to load tasks and see this shard ownership lost
error.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- Unit test

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants