Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Watchdog Workflow with Corrupt Workflow Fix #4713

Merged
merged 17 commits into from
Mar 2, 2022

Conversation

demirkayaender
Copy link
Contributor

@demirkayaender demirkayaender commented Jan 26, 2022

What changed?
Added a watchdog workflow to auto fix known issues without requiring oncall to interfere

Depends on uber/cadence-idl#99

Why?
There are some known issues outside of Cadence's codebase. For example, some deleted records may resurrect when using Cassandra as storage. This causes inconsistency in server, create infinite task retries and eventually causes oncall alerts. The more workflows we run the more likely this will happen. So we should automate the solution to avoid oncall disruptions

How did you test it?

go test -v github.com/uber/cadence/service/frontend -run TestAdminHandlerSuite -testify.m TestMaintainCorruptWorkflow*

Also with local C*, ESv6 and ESv7

Potential risks

Release notes

Documentation Changes

@demirkayaender demirkayaender requested a review from a team January 26, 2022 17:54
@demirkayaender
Copy link
Contributor Author

Looks like my repo is actually contaminated with my previous change. I will update this PR when I cleanup

@demirkayaender
Copy link
Contributor Author

this is now ready to early-review

@demirkayaender demirkayaender changed the title [WIP] Add Watchdog Workflow with Corrupt Workflow Fix Add Watchdog Workflow with Corrupt Workflow Fix Feb 1, 2022
@coveralls
Copy link

coveralls commented Feb 2, 2022

Pull Request Test Coverage Report for Build 8f0132ad-e89d-4515-b48e-ad761a375288

  • 218 of 847 (25.74%) changed or added relevant lines in 28 files are covered.
  • 52 unchanged lines in 14 files lost coverage.
  • Overall coverage decreased (-0.1%) to 56.776%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/historyManager.go 0 1 0.0%
common/elasticsearch/client_v6.go 1 3 33.33%
common/elasticsearch/client_v7.go 0 3 0.0%
common/elasticsearch/common.go 9 12 75.0%
tools/cli/admin.go 0 3 0.0%
tools/cli/adminElasticSearchCommands.go 0 5 0.0%
client/admin/grpcClient.go 0 8 0.0%
client/admin/thriftClient.go 0 8 0.0%
service/frontend/adminGrpcHandler.go 0 8 0.0%
service/frontend/adminThriftHandler.go 0 8 0.0%
Files with Coverage Reduction New Missed Lines %
common/types/shared.go 1 26.8%
service/history/execution/mutable_state_task_refresher.go 1 73.82%
service/history/task/task.go 1 78.65%
common/membership/hashring.go 2 83.54%
service/history/execution/mutable_state_builder.go 2 69.69%
service/history/task/transfer_active_task_executor.go 2 71.93%
common/cache/lru.go 3 90.73%
common/persistence/nosql/nosqlplugin/cassandra/workflow.go 3 55.49%
common/task/fifoTaskScheduler.go 3 84.54%
common/types/mapper/thrift/shared.go 4 63.25%
Totals Coverage Status
Change from base Build 6fd8dcb9-d6cb-4b69-8112-7c825968d989: -0.1%
Covered Lines: 83359
Relevant Lines: 146820

💛 - Coveralls

proto/internal/uber/cadence/admin/v1/service.proto Outdated Show resolved Hide resolved
common/types/admin.go Show resolved Hide resolved
proto/internal/uber/cadence/admin/v1/service.proto Outdated Show resolved Hide resolved
service/frontend/adminHandler.go Show resolved Hide resolved
client/admin/metricClient.go Outdated Show resolved Hide resolved
client/admin/metricClient.go Outdated Show resolved Hide resolved
client/admin/metricClient.go Outdated Show resolved Hide resolved
service/frontend/adminHandler.go Show resolved Hide resolved
service/frontend/adminHandler.go Outdated Show resolved Hide resolved
service/history/task/task.go Outdated Show resolved Hide resolved
service/worker/watchdog/client.go Show resolved Hide resolved
@@ -224,6 +242,7 @@ func (t *taskImpl) HandleErr(
t.scope.RecordTimer(metrics.TaskAttemptTimerPerDomain, time.Duration(t.attempt))
t.logger.Error("Critical error processing task, retrying.",
tag.Error(err), tag.OperationCritical, tag.TaskType(t.GetTaskType()))
t.ReportCorruptWorkflowToWatchDog()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the watchdog isn't able to repair in a couple of retries, its chances of succeeding later are too slim, so we should back off from retrying this too many times in order to free up the workflow and the DB. Calling this for each attempt after the criticalRetryCount() is too much. Perhaps run it for every N multiples of the criticalRetryCount to retry more than once but less than a reasonable limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an LRU cache in the watchdog, if the workflow was tried to be removed, it won't try again. we actually have 0 retries.

service/worker/watchdog/workflow.go Outdated Show resolved Hide resolved
service/worker/watchdog/workflow.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@demirkayaender demirkayaender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Replied some of the comments, will address the rest.

service/frontend/adminHandler.go Show resolved Hide resolved
@@ -224,6 +242,7 @@ func (t *taskImpl) HandleErr(
t.scope.RecordTimer(metrics.TaskAttemptTimerPerDomain, time.Duration(t.attempt))
t.logger.Error("Critical error processing task, retrying.",
tag.Error(err), tag.OperationCritical, tag.TaskType(t.GetTaskType()))
t.ReportCorruptWorkflowToWatchDog()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an LRU cache in the watchdog, if the workflow was tried to be removed, it won't try again. we actually have 0 retries.

service/worker/watchdog/client.go Show resolved Hide resolved
service/worker/watchdog/workflow.go Outdated Show resolved Hide resolved
wfOptions = cclient.StartWorkflowOptions{
ID: WatchdogWFID,
TaskList: taskListName,
WorkflowIDReusePolicy: cclient.WorkflowIDReusePolicyTerminateIfRunning,
Copy link
Contributor

@mantas-sidlauskas mantas-sidlauskas Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this policy, will last the deployed worker win?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main goal was to restart the workflow at each deployment. Not sure if there's a better way to do it.

@demirkayaender demirkayaender force-pushed the watchdog branch 2 times, most recently from 670e24d to 142a049 Compare March 2, 2022 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants