Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable retry on mutable state checksum verification failure #5691

Merged
merged 2 commits into from
Feb 26, 2024

Conversation

Shaddoll
Copy link
Contributor

What changed?

  • Enable retry on mutable state checksum failure to reload the mutable state from database
  • Improve logging on mutable state checksum failure

Why?
In the past 3 months, we're bothered by corrupted workflows from our internal sql database. And we suspect that it's because the mutable state we read from database under some edge case is from an inconsistent view of database. We enabled checksum and verified that there is some checksum failures in production, but we don't have the details in the logs and still don't know the root cause.
We add a retry mechanism and hope this will temporarily fix the issue.

How did you test it?

Potential risks
this is protected by feature flag, we can disable the feature flag if there is any issue

Release notes

Documentation Changes

@coveralls
Copy link

coveralls commented Feb 24, 2024

Pull Request Test Coverage Report for Build 018de695-881b-4bfa-bfe2-711d7ec1aabb

Details

  • -8 of 46 (82.61%) changed or added relevant lines in 3 files are covered.
  • 94 unchanged lines in 14 files lost coverage.
  • Overall coverage increased (+0.001%) to 62.901%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/execution/mutable_state_builder.go 18 20 90.0%
service/history/execution/context.go 19 25 76.0%
Files with Coverage Reduction New Missed Lines %
client/history/client.go 2 38.1%
common/membership/hashring.go 2 82.23%
common/peerprovider/ringpopprovider/config.go 2 81.58%
common/persistence/historyManager.go 2 66.67%
common/task/weighted_round_robin_task_scheduler.go 2 89.05%
service/history/execution/mutable_state_util.go 2 37.63%
service/history/handler/handler.go 2 49.97%
service/matching/taskListManager.go 2 80.2%
service/frontend/api/handler.go 4 62.11%
common/task/fifo_task_scheduler.go 5 84.54%
Totals Coverage Status
Change from base Build 018de68a-6e55-47a9-8a47-2075a2b23e51: 0.001%
Covered Lines: 92970
Relevant Lines: 147804

💛 - Coveralls

@Shaddoll Shaddoll merged commit 169b8f1 into uber:master Feb 26, 2024
17 checks passed
@Shaddoll Shaddoll deleted the checksum branch February 26, 2024 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants