-
Notifications
You must be signed in to change notification settings - Fork 801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Watchdog Workflow with Corrupt Workflow Fix #4713
Conversation
Looks like my repo is actually contaminated with my previous change. I will update this PR when I cleanup |
0697a05
to
5b79e88
Compare
this is now ready to early-review |
ac375fa
to
0eeee8c
Compare
service/history/task/task.go
Outdated
@@ -224,6 +242,7 @@ func (t *taskImpl) HandleErr( | |||
t.scope.RecordTimer(metrics.TaskAttemptTimerPerDomain, time.Duration(t.attempt)) | |||
t.logger.Error("Critical error processing task, retrying.", | |||
tag.Error(err), tag.OperationCritical, tag.TaskType(t.GetTaskType())) | |||
t.ReportCorruptWorkflowToWatchDog() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the watchdog isn't able to repair in a couple of retries, its chances of succeeding later are too slim, so we should back off from retrying this too many times in order to free up the workflow and the DB. Calling this for each attempt after the criticalRetryCount()
is too much. Perhaps run it for every N multiples of the criticalRetryCount
to retry more than once but less than a reasonable limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an LRU cache in the watchdog, if the workflow was tried to be removed, it won't try again. we actually have 0 retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Replied some of the comments, will address the rest.
service/history/task/task.go
Outdated
@@ -224,6 +242,7 @@ func (t *taskImpl) HandleErr( | |||
t.scope.RecordTimer(metrics.TaskAttemptTimerPerDomain, time.Duration(t.attempt)) | |||
t.logger.Error("Critical error processing task, retrying.", | |||
tag.Error(err), tag.OperationCritical, tag.TaskType(t.GetTaskType())) | |||
t.ReportCorruptWorkflowToWatchDog() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an LRU cache in the watchdog, if the workflow was tried to be removed, it won't try again. we actually have 0 retries.
wfOptions = cclient.StartWorkflowOptions{ | ||
ID: WatchdogWFID, | ||
TaskList: taskListName, | ||
WorkflowIDReusePolicy: cclient.WorkflowIDReusePolicyTerminateIfRunning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this policy, will last the deployed worker win?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main goal was to restart the workflow at each deployment. Not sure if there's a better way to do it.
99e8285
to
af7b58f
Compare
670e24d
to
142a049
Compare
What changed?
Added a watchdog workflow to auto fix known issues without requiring oncall to interfere
Depends on uber/cadence-idl#99
Why?
There are some known issues outside of Cadence's codebase. For example, some deleted records may resurrect when using Cassandra as storage. This causes inconsistency in server, create infinite task retries and eventually causes oncall alerts. The more workflows we run the more likely this will happen. So we should automate the solution to avoid oncall disruptions
How did you test it?
Also with local C*, ESv6 and ESv7
Potential risks
Release notes
Documentation Changes