Skip to content

Commit

Permalink
Add a "stale workflow" scanner/fixer impl, to remove data beyond reas…
Browse files Browse the repository at this point in the history
…onable retention (uber#5361)

For a variety of reasons, we have workflows sitting in some of our databases far beyond any reasonable time range.
E.g. a workflow completed months ago in a domain with 7 day retention -> unfortunately still retained.

We've been gradually finding and fixing problems that cause this, but the old data needs to be cleaned up
at some point.  They can sometimes cause problems, particularly if they appear to be "running".

So this is a *conservative* scanner + fixer for workflows that meet these criteria.
It does not handle all states, nor does it support domains configured with >200 day retention (hardcoded but
trivially changeable, as a sanity check), but it's catching a large amount internally and seems safe to share
and run in more locations.

---

Since building and verifying this required figuring out the scanner/fixer code in detail,
and that proved to be a significant effort, this also includes three major additions on top
of the stale scanner:
- there have been a variety of renames and public -> private changes to make naming consistent
  - this helped discover a significant issue with the current execution scanner, so I'm keeping it
- concrete execution fixer invariants can now be individually enabled and disabled via dynamic config
- a new readme.md file that attempts to summarize the learnings, and how to config and verify scanner/fixer.
  • Loading branch information
Groxx authored Oct 3, 2023
1 parent c93d6af commit 4a26eea
Show file tree
Hide file tree
Showing 29 changed files with 2,338 additions and 162 deletions.
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

You can find a list of previous releases on the [github releases](https://github.com/uber/cadence/releases) page.

## [Unreleased]
### Added
- Scanner / Fixer changes (#5361)
- Stale-workflow detection and cleanup added to shardscanner, disabled by default.
- New dynamic config to better control scanner and fixer, particularly for concrete executions.
- Documentation about how scanner/fixer work and how to control them, see [the scanner readme.md](service/worker/scanner/README.md)
- This also includes example config to enable the new fixer.

### Upgrade notes
- Any concrete execution fixer run on upgrade may be missing the new config and those activities will have no invariants for a single run. Later runs will work normally. (#5361)

## [1.2.4] - 2023-09-27
### Added
Implement config store for MySQL (#5403)
Expand Down
44 changes: 44 additions & 0 deletions common/dynamicconfig/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -1709,6 +1709,30 @@ const (
// Default value: true
// Allowed filters: N/A
ConcreteExecutionsScannerInvariantCollectionHistory
// ConcreteExecutionsFixerInvariantCollectionStale indicates if the stale workflow invariant should be run
// KeyName: worker.executionsFixerInvariantCollectionStale
// Value type: Bool
// Default value: false
// Allowed filters: N/A
ConcreteExecutionsFixerInvariantCollectionStale
// ConcreteExecutionsFixerInvariantCollectionMutableState is indicates if mutable state invariant checks should be run
// KeyName: worker.executionsFixerInvariantCollectionMutableState
// Value type: Bool
// Default value: true
// Allowed filters: N/A
ConcreteExecutionsFixerInvariantCollectionMutableState
// ConcreteExecutionsFixerInvariantCollectionHistory is indicates if history invariant checks should be run
// KeyName: worker.executionsFixerInvariantCollectionHistory
// Value type: Bool
// Default value: true
// Allowed filters: N/A
ConcreteExecutionsFixerInvariantCollectionHistory
// ConcreteExecutionsScannerInvariantCollectionStale indicates if the stale workflow invariant should be run
// KeyName: worker.executionsScannerInvariantCollectionStale
// Value type: Bool
// Default value: false
// Allowed filters: N/A
ConcreteExecutionsScannerInvariantCollectionStale
// CurrentExecutionsScannerEnabled is indicates if current executions scanner should be started as part of worker.Scanner
// KeyName: worker.currentExecutionsScannerEnabled
// Value type: Bool
Expand Down Expand Up @@ -3884,6 +3908,26 @@ var BoolKeys = map[BoolKey]DynamicBool{
Description: "ConcreteExecutionsScannerInvariantCollectionHistory is indicates if history invariant checks should be run",
DefaultValue: true,
},
ConcreteExecutionsScannerInvariantCollectionStale: DynamicBool{
KeyName: "worker.executionsScannerInvariantCollectionStale",
Description: "ConcreteExecutionsScannerInvariantCollectionStale indicates if the stale-workflow invariant should be run",
DefaultValue: false, // may be enabled after further verification, but for now it's a bit too risky to enable by default
},
ConcreteExecutionsFixerInvariantCollectionMutableState: DynamicBool{
KeyName: "worker.executionsFixerInvariantCollectionMutableState",
Description: "ConcreteExecutionsFixerInvariantCollectionMutableState is indicates if mutable state invariant checks should be run",
DefaultValue: true,
},
ConcreteExecutionsFixerInvariantCollectionHistory: DynamicBool{
KeyName: "worker.executionsFixerInvariantCollectionHistory",
Description: "ConcreteExecutionsFixerInvariantCollectionHistory is indicates if history invariant checks should be run",
DefaultValue: true,
},
ConcreteExecutionsFixerInvariantCollectionStale: DynamicBool{
KeyName: "worker.executionsFixerInvariantCollectionStale",
Description: "ConcreteExecutionsFixerInvariantCollectionStale indicates if the stale-workflow invariant should be run",
DefaultValue: false, // may be enabled after further verification, but for now it's a bit too risky to enable by default
},
CurrentExecutionsScannerEnabled: DynamicBool{
KeyName: "worker.currentExecutionsScannerEnabled",
Description: "CurrentExecutionsScannerEnabled is indicates if current executions scanner should be started as part of worker.Scanner",
Expand Down
2 changes: 2 additions & 0 deletions common/persistence/retryer.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.

//go:generate mockgen -package $GOPACKAGE -source $GOFILE -destination retryer_mock.go -package persistence github.com/uber/cadence/common/persistence Retryer

package persistence

import (
Expand Down
218 changes: 218 additions & 0 deletions common/persistence/retryer_mock.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 8 additions & 4 deletions common/reconciliation/invariant/collection_enumer.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 4a26eea

Please sign in to comment.