Add Watchdog Workflow with Corrupt Workflow Fix #4713

demirkayaender · 2022-01-26T17:54:19Z

What changed?
Added a watchdog workflow to auto fix known issues without requiring oncall to interfere

Why?
There are some known issues outside of Cadence's codebase. For example, some deleted records may resurrect when using Cassandra as storage. This causes inconsistency in server, create infinite task retries and eventually causes oncall alerts. The more workflows we run the more likely this will happen. So we should automate the solution to avoid oncall disruptions

How did you test it?

go test -v github.com/uber/cadence/service/frontend -run TestAdminHandlerSuite -testify.m TestMaintainCorruptWorkflow*

Also with local C*, ESv6 and ESv7

Potential risks

Release notes

Documentation Changes

demirkayaender · 2022-01-26T17:55:42Z

Looks like my repo is actually contaminated with my previous change. I will update this PR when I cleanup

demirkayaender · 2022-01-26T20:42:13Z

this is now ready to early-review

common/persistence/nosql/nosqlplugin/cassandra/visibility.go

coveralls · 2022-02-02T02:06:16Z

Pull Request Test Coverage Report for Build 8f0132ad-e89d-4515-b48e-ad761a375288

218 of 847 (25.74%) changed or added relevant lines in 28 files are covered.
52 unchanged lines in 14 files lost coverage.
Overall coverage decreased (-0.1%) to 56.776%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
common/persistence/historyManager.go	0	1	0.0%
common/elasticsearch/client_v6.go	1	3	33.33%
common/elasticsearch/client_v7.go	0	3	0.0%
common/elasticsearch/common.go	9	12	75.0%
tools/cli/admin.go	0	3	0.0%
tools/cli/adminElasticSearchCommands.go	0	5	0.0%
client/admin/grpcClient.go	0	8	0.0%
client/admin/thriftClient.go	0	8	0.0%
service/frontend/adminGrpcHandler.go	0	8	0.0%
service/frontend/adminThriftHandler.go	0	8	0.0%

Files with Coverage Reduction	New Missed Lines	%
common/types/shared.go	1	26.8%
service/history/execution/mutable_state_task_refresher.go	1	73.82%
service/history/task/task.go	1	78.65%
common/membership/hashring.go	2	83.54%
service/history/execution/mutable_state_builder.go	2	69.69%
service/history/task/transfer_active_task_executor.go	2	71.93%
common/cache/lru.go	3	90.73%
common/persistence/nosql/nosqlplugin/cassandra/workflow.go	3	55.49%
common/task/fifoTaskScheduler.go	3	84.54%
common/types/mapper/thrift/shared.go	4	63.25%

Totals
Change from base Build 6fd8dcb9-d6cb-4b69-8112-7c825968d989:	-0.1%
Covered Lines:	83359
Relevant Lines:	146820

💛 - Coveralls

proto/internal/uber/cadence/admin/v1/service.proto

common/types/admin.go

proto/internal/uber/cadence/admin/v1/service.proto

service/frontend/adminHandler.go

client/admin/metricClient.go

service/frontend/adminHandler.go

service/history/task/task.go

service/worker/watchdog/client.go

emrahs · 2022-02-12T02:00:07Z

service/history/task/task.go

@@ -224,6 +242,7 @@ func (t *taskImpl) HandleErr(
 				t.scope.RecordTimer(metrics.TaskAttemptTimerPerDomain, time.Duration(t.attempt))
 				t.logger.Error("Critical error processing task, retrying.",
 					tag.Error(err), tag.OperationCritical, tag.TaskType(t.GetTaskType()))
+				t.ReportCorruptWorkflowToWatchDog()


If the watchdog isn't able to repair in a couple of retries, its chances of succeeding later are too slim, so we should back off from retrying this too many times in order to free up the workflow and the DB. Calling this for each attempt after the criticalRetryCount() is too much. Perhaps run it for every N multiples of the criticalRetryCount to retry more than once but less than a reasonable limit.

I have an LRU cache in the watchdog, if the workflow was tried to be removed, it won't try again. we actually have 0 retries.

service/worker/watchdog/workflow.go

demirkayaender

Thanks for the review. Replied some of the comments, will address the rest.

service/frontend/adminHandler.go

demirkayaender · 2022-02-12T20:59:40Z

service/history/task/task.go

@@ -224,6 +242,7 @@ func (t *taskImpl) HandleErr(
 				t.scope.RecordTimer(metrics.TaskAttemptTimerPerDomain, time.Duration(t.attempt))
 				t.logger.Error("Critical error processing task, retrying.",
 					tag.Error(err), tag.OperationCritical, tag.TaskType(t.GetTaskType()))
+				t.ReportCorruptWorkflowToWatchDog()


I have an LRU cache in the watchdog, if the workflow was tried to be removed, it won't try again. we actually have 0 retries.

service/worker/watchdog/client.go

service/worker/watchdog/workflow.go

service/frontend/adminHandler.go

service/worker/watchdog/watchdog.go

service/worker/watchdog/workflow.go

mantas-sidlauskas · 2022-02-14T18:53:12Z

service/worker/watchdog/workflow.go

+	wfOptions = cclient.StartWorkflowOptions{
+		ID:                           WatchdogWFID,
+		TaskList:                     taskListName,
+		WorkflowIDReusePolicy:        cclient.WorkflowIDReusePolicyTerminateIfRunning,


with this policy, will last the deployed worker win?

My main goal was to restart the workflow at each deployment. Not sure if there's a better way to do it.

demirkayaender requested a review from a team January 26, 2022 17:54

demirkayaender force-pushed the watchdog branch from 0697a05 to 5b79e88 Compare January 26, 2022 20:41

demirkayaender changed the title ~~[WIP] Add Watchdog Workflow with Corrupt Workflow Fix~~ Add Watchdog Workflow with Corrupt Workflow Fix Feb 1, 2022

demirkayaender force-pushed the watchdog branch 2 times, most recently from ac375fa to 0eeee8c Compare February 1, 2022 23:59

davidporter-id-au reviewed Feb 2, 2022

View reviewed changes

common/persistence/nosql/nosqlplugin/cassandra/visibility.go Show resolved Hide resolved

davidporter-id-au reviewed Feb 2, 2022

View reviewed changes

common/persistence/nosql/nosqlplugin/cassandra/visibility.go Show resolved Hide resolved

vytautas-karpavicius reviewed Feb 2, 2022

View reviewed changes

proto/internal/uber/cadence/admin/v1/service.proto Outdated Show resolved Hide resolved

common/types/admin.go Show resolved Hide resolved

proto/internal/uber/cadence/admin/v1/service.proto Outdated Show resolved Hide resolved

service/frontend/adminHandler.go Show resolved Hide resolved

emrahs reviewed Feb 12, 2022

View reviewed changes

demirkayaender commented Feb 12, 2022

View reviewed changes

mantas-sidlauskas reviewed Feb 14, 2022

View reviewed changes

service/frontend/adminHandler.go Outdated Show resolved Hide resolved

mantas-sidlauskas reviewed Feb 14, 2022

View reviewed changes

service/worker/watchdog/watchdog.go Outdated Show resolved Hide resolved

mantas-sidlauskas reviewed Feb 14, 2022

View reviewed changes

service/worker/watchdog/workflow.go Outdated Show resolved Hide resolved

mantas-sidlauskas reviewed Feb 14, 2022

View reviewed changes

demirkayaender force-pushed the watchdog branch from 99e8285 to af7b58f Compare February 14, 2022 23:41

emrahs approved these changes Feb 28, 2022

View reviewed changes

demirkayaender force-pushed the watchdog branch 2 times, most recently from 670e24d to 142a049 Compare March 2, 2022 18:45

demirkayaender added 9 commits March 2, 2022 11:24

Define workflow skeleton

ffc4e81

DB Delete

8f198ff

ES Delete

3f79858

Fix linter issues

eeed1a8

Implement RPC

56095b5

Complete Workflow

143151b

signal watchdog

669cdcd

Add CLI Support

372b743

Clean up ES DeleteWorkflow

29be891

demirkayaender added 8 commits March 2, 2022 11:24

Unit tests

0828293

fix unit test

15a3473

Address comments

f3dc123

Address comments

8f21215

Address Comments

0dac7b6

Delete skipErrors

ca1036d

fix merge conflicts

c18643e

use list instead of map to avoid race conditions

8d1f97b

demirkayaender force-pushed the watchdog branch from 142a049 to 8d1f97b Compare March 2, 2022 19:25

demirkayaender enabled auto-merge (squash) March 2, 2022 19:27

demirkayaender merged commit 6980508 into uber:master Mar 2, 2022

demirkayaender deleted the watchdog branch March 2, 2022 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Watchdog Workflow with Corrupt Workflow Fix #4713

Add Watchdog Workflow with Corrupt Workflow Fix #4713

demirkayaender commented Jan 26, 2022 •

edited

Loading

demirkayaender commented Jan 26, 2022

demirkayaender commented Jan 26, 2022

coveralls commented Feb 2, 2022 •

edited

Loading

emrahs Feb 12, 2022

demirkayaender Feb 12, 2022

demirkayaender left a comment

demirkayaender Feb 12, 2022

mantas-sidlauskas Feb 14, 2022 •

edited

Loading

demirkayaender Feb 14, 2022

Add Watchdog Workflow with Corrupt Workflow Fix #4713

Add Watchdog Workflow with Corrupt Workflow Fix #4713

Conversation

demirkayaender commented Jan 26, 2022 • edited Loading

demirkayaender commented Jan 26, 2022

demirkayaender commented Jan 26, 2022

coveralls commented Feb 2, 2022 • edited Loading

Pull Request Test Coverage Report for Build 8f0132ad-e89d-4515-b48e-ad761a375288

💛 - Coveralls

emrahs Feb 12, 2022

Choose a reason for hiding this comment

demirkayaender Feb 12, 2022

Choose a reason for hiding this comment

demirkayaender left a comment

Choose a reason for hiding this comment

demirkayaender Feb 12, 2022

Choose a reason for hiding this comment

mantas-sidlauskas Feb 14, 2022 • edited Loading

Choose a reason for hiding this comment

demirkayaender Feb 14, 2022

Choose a reason for hiding this comment

demirkayaender commented Jan 26, 2022 •

edited

Loading

coveralls commented Feb 2, 2022 •

edited

Loading

mantas-sidlauskas Feb 14, 2022 •

edited

Loading