Skip to content

Commit

Permalink
Add documentation to canary and improvements (uber#4447)
Browse files Browse the repository at this point in the history
  • Loading branch information
longquanzheng authored Sep 10, 2021
1 parent 1cc94d5 commit ef7d049
Show file tree
Hide file tree
Showing 24 changed files with 316 additions and 47 deletions.
4 changes: 2 additions & 2 deletions bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Setup
-----------
### Cadence server

Bench suite is running against a Cadence server/cluster.
Bench suite is running against a Cadence server/cluster. See [documentation](https://cadenceworkflow.io/docs/operation-guide/setup/) for Cadence server cluster setup.

Note that only the Basic bench test don't require Advanced Visibility.

Expand All @@ -20,7 +20,7 @@ For local env you can run it through:
See more [documentation here](https://cadenceworkflow.io/docs/concepts/search-workflows/).

### Bench Workers
:warning: NOTE: unlike canary, starting bench worker will not automatically start a bench test. Next two sections will cover how to start and configure it.
:warning: NOTE: Starting this bench worker will not automatically start a bench test. Next two sections will cover how to start and configure it.

Different ways of start the bench workers:

Expand Down
175 changes: 175 additions & 0 deletions canary/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Periodical feature health check workflow tools(aka Canary)

This README describes how to set up Cadence canary, different types of canary test cases, and how to start the canary.

Setup
-----------
### Cadence server

Canary test suite is running against a Cadence server/cluster. See [documentation](https://cadenceworkflow.io/docs/operation-guide/setup/) for Cadence server cluster setup.

Note that some tests require features like [Advanced Visibility]((https://cadenceworkflow.io/docs/concepts/search-workflows/).) and [History Archival](https://cadenceworkflow.io/docs/concepts/archival/).

For local server env you can run it through:
- Docker: Instructions for running Cadence server through docker can be found in `docker/README.md`. Either `docker-compose-es-v7.yml` or `docker-compose-es.yml` can be used to start the server.
- Build from source: Please check [CONTRIBUTING](/CONTRIBUTING.md) for how to build and run Cadence server from source. Please also make sure Kafka and ElasticSearch are running before starting the server with `./cadence-server --zone es start`. If ElasticSearch v7 is used, change the value for `--zone` flag to `es_v7`.

### Start canary

:warning: NOTE: By default, starting this canary worker will not automatically start a canary test. Next two sections will cover how to start and configure it.

Different ways of start the canary workers:

#### 1. Use docker image `ubercadence/cadence-canary:master`

For now, this image has no release versions for simplified the release process. Always use `master` tag for the image.

Similar to server/CLI images, the canary image will be built and published automatically by Github on every commit onto the `master` branch.

You can [pre-built docker-compose file](../docker/docker-compose-canary.yml) to run against local server
In the `docker/` directory, run:
```
docker-compose -f docker-compose-canary.yml up
```
You can modify [the canary worker config](../docker/config/canary/development.yaml) to run against a prod server cluster.


#### 2. Build & Run the worker/canary

In the project root, build cadence canary binary:
```
make cadence-canary
```

Then start canary worker:
```
./cadence-canary start
```
This is essentially the same as
```
./cadence-canary start -mode worker
```

By default, it will load [the configuration in `config/canary/development.yaml`](../config/canary/development.yaml).
Run `./cadence-canary -h` for details to understand the start options of how to change the loading directory if needed.
This will only start the workers.

Configurations
----------------------
Canary workers configuration contains two parts:
- **Canary**: this part controls which domains canary workers are responsible for what tests the sanity workflow will exclude.
```yaml
canary:
domains: ["cadence-canary"] # it will start workers on all those domains(also try to register if not exists)
excludes: ["workflow.searchAttributes", "workflow.batch", "workflow.archival.visibility"] # it will exclude the three test cases
cron:
cronSchedule: #the schedule of cron canary, default to "@every 30s"
cronExecutionTimeout: #the timeout of each run of the cron execution, default to 18 minutes
startJobTimeout: #the timeout of each run of the sanity test suite, default to 9 minutes
```
An exception here is `HistoryArchival` and `VisibilityArchival` test cases will always use `canary-archival-domain` domain.

- **Cadence**: this control how canary worker should talk to Cadence server, which includes the server's service name and address.
```yaml
cadence:
service: "cadence-frontend" # frontend service name
host: "127.0.0.1:7933" # frontend address
```
- **Metrics**: metrics configuration. Similar to server metric emitter, only M3/Statsd/Prometheus is supported.
- **Log**: logging configuration. Similar to server logging configuration.

Canary Test Cases
----------------------

#### Cron Canary: periodically running Sanity test suite

The Cron workflow is not a test case. It's a top-level workflow to kick off the Sanity suite(described below) periodically.
To start the cron canary:
```
./cadence-canary start -mode cronCanary
```

For local development, you can also start the cron canary workflows along with the worker:
```
./cadence-canary start -m all
```

The Cron Schedule is from the Configuration.
However, changing the schedule requires you manually terminate the existing cron workflow to take into effect.
It can be [improved](https://github.com/uber/cadence/issues/4469) in the future.

The workflowID is fixed: `"cadence.canary.cron"`

#### Test case starter & Sanity suite
The sanity workflow is test suite workflow. It will kick off a bunch of childWorkflows for all the test to verify that Cadence server is operating correctly.

An error result of the sanity workflow indicates at least one of the test case fails.

You can start the sanity workflow as one-off run:
```
cadence --do <the domain you configured> workflow start --tl canary-task-queue --et 1200 --wt workflow.sanity -i 0
```
Note:
* tasklist(tl) is fixed to `canary-task-queue`
* execution timeout(et) is recommended to 20 minutes(`1200` seconds) but you can adjust it
* the only required input is the scheduled unix timestamp, and `0` will uses the workflow starting time

Or using a cron job(e.g. every minute):
```
cadence --do <the domain you configured> workflow start --tl canary-task-queue --et 1200 --wt workflow.sanity -i 0 --cron "* * * * *"
```

This is [the list of the test cases](./sanity.go) that it will start all supported test cases by default if no excludes are configured.
You can find [the workflow names of the tests cases in this file](./const.go) if you want to manually start certain test cases.
For example, manually start an `Echo` test case:
```
cadence --do <> workflow start --tl canary-task-queue --et 10 --wt workflow.echo
```

Once you start the test cases, you can observe the progress:
```
cadence --do cadence-canary workflow ob -w <...workflowID form the start command output>
```

#### Echo
Echo workflow tests the very basic workflow functionality. It executes an activity to return some output and verifies it as the workflow result.

#### Signal
Signal workflow tests the signal feature.

#### Visibility
Visibility workflow tests the basic visibility feature. No advanced visibility needed, but advanced visibility should also support it.

#### SearchAttributes
SearchAttributes workflow tests the advanced visibility feature. Make sure advanced visibility feature is configured on the server. Otherwise, it should be excluded from the sanity test suite/case.

#### ConcurrentExec
ConcurrentExec workflow tests executing activities concurrently.

#### Query
Query workflow tests the Query feature.

#### Timeout
Timeout workflow make sure the activity timeout is enforced.

#### LocalActivity
LocalActivity workflow tests the local activity feature.

#### Cancellation
Cancellation workflowt tests cancellation feature.

#### Retry
Retry workflow tests activity retry policy.

#### Reset
Reset workflow tests reset feature.

#### HistoryArchival
HistoryArchival tests history archival feature. Make sure history archival feature is configured on the server. Otherwise, it should be excluded from the sanity test suite/case.
This test case always uses `canary-archival-domain` domain.

#### VisibilityArchival
VisibilityArchival tests visibility archival feature. Make sure visibility feature is configured on the server. Otherwise, it should be excluded from the sanity test suite/case.

#### Batch
Batch workflow tests the batch job feature. Make sure advanced visibility feature is configured on the server. Otherwise, it should be excluded from the sanity test suite/case.
5 changes: 4 additions & 1 deletion canary/batch.go
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,10 @@ type (
}
)

func batchWorkflow(ctx workflow.Context, scheduledTimeNanos int64, domain string) error {
func batchWorkflow(ctx workflow.Context, inputScheduledTimeNanos int64) error {
scheduledTimeNanos := getScheduledTimeFromInputIfNonZero(ctx, inputScheduledTimeNanos)
domain := workflow.GetInfo(ctx).Domain

profile, err := beginWorkflow(ctx, wfTypeBatch, scheduledTimeNanos)
if err != nil {
return err
Expand Down
44 changes: 32 additions & 12 deletions canary/canary.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ package canary

import (
"context"
"fmt"

"github.com/opentracing/opentracing-go"
"go.uber.org/cadence/.gen/go/shared"
Expand All @@ -32,7 +33,7 @@ import (
type (
// Runnable is an interface for anything that exposes a Run method
Runnable interface {
Run() error
Run(mode string) error
}

canaryImpl struct {
Expand All @@ -41,6 +42,7 @@ type (
archivalClient cadenceClient
systemClient cadenceClient
runtime *RuntimeContext
canaryConfig *Canary
}

activityContext struct {
Expand All @@ -57,7 +59,7 @@ const (
)

// new returns a new instance of Canary runnable
func newCanary(domain string, rc *RuntimeContext) Runnable {
func newCanary(domain string, rc *RuntimeContext, canaryConfig *Canary) Runnable {
canaryClient := newCadenceClient(domain, rc)
archivalClient := newCadenceClient(archivalDomain, rc)
systemClient := newCadenceClient(systemDomain, rc)
Expand All @@ -67,11 +69,15 @@ func newCanary(domain string, rc *RuntimeContext) Runnable {
archivalClient: archivalClient,
systemClient: systemClient,
runtime: rc,
canaryConfig: canaryConfig,
}
}

// Run runs the canary
func (c *canaryImpl) Run() error {
func (c *canaryImpl) Run(mode string) error {
if mode != ModeCronCanary && mode != ModeAll && mode != ModeWorker {
return fmt.Errorf("wrong mode to start canary")
}
var err error
log := c.runtime.logger

Expand All @@ -85,18 +91,25 @@ func (c *canaryImpl) Run() error {
return err
}

// start the initial cron workflow
c.startCronWorkflow()
if mode == ModeAll || mode == ModeCronCanary {
// start the initial cron workflow
c.startCronWorkflow()
}

err = c.startWorker()
if err != nil {
log.Error("start worker failed", zap.Error(err))
return err
if mode == ModeAll || mode == ModeWorker {
err = c.startWorker()
if err != nil {
log.Error("start worker failed", zap.Error(err))
return err
}
}

return nil
}

func (c *canaryImpl) startWorker() error {
c.runtime.logger.Info("starting canary worker...")

options := worker.Options{
Logger: c.runtime.logger,
MetricsScope: c.runtime.metrics,
Expand All @@ -115,18 +128,24 @@ func (c *canaryImpl) startWorker() error {
}

func (c *canaryImpl) startCronWorkflow() {
c.runtime.logger.Info("starting canary cron workflow...")
wfID := "cadence.canary.cron"
opts := newWorkflowOptions(wfID, cronWFExecutionTimeout)
opts.CronSchedule = "@every 30s" // run every 30s
opts := newWorkflowOptions(wfID, c.canaryConfig.Cron.CronExecutionTimeout)
opts.CronSchedule = c.canaryConfig.Cron.CronSchedule

// create the cron workflow span
ctx := context.Background()
span := opentracing.StartSpan("start-cron-workflow-span")
defer span.Finish()
ctx = opentracing.ContextWithSpan(ctx, span)
_, err := c.canaryClient.StartWorkflow(ctx, opts, cronWorkflow, c.canaryDomain, wfTypeSanity)
_, err := c.canaryClient.StartWorkflow(ctx, opts, cronWorkflow, wfTypeSanity)
if err != nil {
// TODO: improvement: compare the cron schedule to decide whether or not terminating the current one
// https://github.com/uber/cadence/issues/4469
if _, ok := err.(*shared.WorkflowExecutionAlreadyStartedError); !ok {
c.runtime.logger.Error("error starting cron workflow", zap.Error(err))
} else {
c.runtime.logger.Info("cron workflow already started, you may need to terminate and restart if cron schedule is changed...")
}
}
}
Expand All @@ -137,6 +156,7 @@ func (c *canaryImpl) newActivityContext() context.Context {
ctx := context.WithValue(context.Background(), ctxKeyActivityRuntime, &activityContext{cadence: c.canaryClient})
ctx = context.WithValue(ctx, ctxKeyActivityArchivalRuntime, &activityContext{cadence: c.archivalClient})
ctx = context.WithValue(ctx, ctxKeyActivitySystemClient, &activityContext{cadence: c.systemClient})
ctx = context.WithValue(ctx, ctxKeyConfig, c.canaryConfig)
return overrideWorkerOptions(ctx)
}

Expand Down
5 changes: 4 additions & 1 deletion canary/cancellation.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,10 @@ func init() {
}

// cancellationWorkflow is the workflow implementation to test for cancellation of workflows
func cancellationWorkflow(ctx workflow.Context, scheduledTimeNanos int64, domain string) error {
func cancellationWorkflow(ctx workflow.Context, inputScheduledTimeNanos int64) error {
scheduledTimeNanos := getScheduledTimeFromInputIfNonZero(ctx, inputScheduledTimeNanos)
domain := workflow.GetInfo(ctx).Domain

profile, err := beginWorkflow(ctx, wfTypeCancellation, scheduledTimeNanos)
if err != nil {
return err
Expand Down
7 changes: 7 additions & 0 deletions canary/common.go
Original file line number Diff line number Diff line change
Expand Up @@ -105,3 +105,10 @@ func beginWorkflow(ctx workflow.Context, wfType string, scheduledTimeNanos int64
func concat(first string, second string) string {
return first + "/" + second
}

func getScheduledTimeFromInputIfNonZero(ctx workflow.Context, nanos int64) int64 {
if nanos == 0 {
return workflow.Now(ctx).UnixNano()
}
return nanos
}
4 changes: 3 additions & 1 deletion canary/concurrentExec.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@ func init() {
// concurrentExecWorkflow is the workflow implementation to test
// 1. client side events pagination when reconstructing workflow state
// 2. concurrent execution of activities
func concurrentExecWorkflow(ctx workflow.Context, scheduledTimeNanos int64, domain string) error {
func concurrentExecWorkflow(ctx workflow.Context, inputScheduledTimeNanos int64) error {
scheduledTimeNanos := getScheduledTimeFromInputIfNonZero(ctx, inputScheduledTimeNanos)

profile, err := beginWorkflow(ctx, wfTypeConcurrentExec, scheduledTimeNanos)
if err != nil {
return err
Expand Down
9 changes: 9 additions & 0 deletions canary/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ package canary

import (
"errors"
"time"

"github.com/uber-go/tally"
"go.uber.org/cadence/.gen/go/cadence/workflowserviceclient"
Expand Down Expand Up @@ -64,6 +65,14 @@ type (
Canary struct {
Domains []string `yaml:"domains"`
Excludes []string `yaml:"excludes"`
Cron Cron `yaml:"cron"`
}

// Cron contains configuration for the cron workflow for canary
Cron struct {
CronSchedule string `yaml:"cronSchedule"` // default to "@every 30s"
CronExecutionTimeout time.Duration `yaml:"cronExecutionTimeout"` //default to 18 minutes
StartJobTimeout time.Duration `yaml:"startJobTimeout"` // default to 9 minutes
}

// Cadence contains the configuration for cadence service
Expand Down
Loading

0 comments on commit ef7d049

Please sign in to comment.