Skip to content

Commit

Permalink
Rewrite/improve basic load test (#4399)
Browse files Browse the repository at this point in the history
  • Loading branch information
longquanzheng authored Aug 27, 2021
1 parent cde0f41 commit 0b98055
Show file tree
Hide file tree
Showing 9 changed files with 365 additions and 246 deletions.
156 changes: 109 additions & 47 deletions bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,62 +7,94 @@ Setup
-----------
### Cadence server

Basic bench test don't require Advanced Visibility.
Bench suite is running against a Cadence server/cluster.

Note that only the Basic bench test don't require Advanced Visibility.

Other advanced bench tests requires Cadence server with Advanced Visibility. You can run it through:
Other advanced bench tests requires Cadence server with Advanced Visibility.

For local env you can run it through:
- Docker: Instructions for running Cadence server through docker can be found in `docker/README.md`. Either `docker-compose-es-v7.yml` or `docker-compose-es.yml` can be used to start the server.
- Build from source: Please check [CONTRIBUTING](/CONTRIBUTING.md) for how to build and run Cadence server from source. Please also make sure Kafka and ElasticSearch are running before starting the server with `./cadence-server --zone es start`. If ElasticSearch v7 is used, change the value for `--zone` flag to `es_v7`.

### Search Attributes
One of the bench tests (called `Cron`), which is responsible for running other tests as a cron job and tracking the results, requires an search attribute named `Passed`.
See more [documentation here](https://cadenceworkflow.io/docs/concepts/search-workflows/).

For local development environment, this search attribute has already been added to the ES index template and the list of valid search attributes.
### Bench Workers
:warning: NOTE: unlike canary, starting bench worker will not automatically start a bench test. Next two sections will cover how to start and configure it.

However, if you already have a running ES cluster, you will need to add this search attribute to your ES cluster through the following steps:
Different ways of start the bench workers:

1. Update ES cluster index template using the following Cadence CLI command
```
cadence adm cluster asa --search_attr_key Passed --search_attr_type 4
```
2. Add `Passed: 4` to the dynamic config value of valid search attributes (`frontend.validSearchAttributes`), so that Cadence server can recognize it.
3. Validate it has been successfully added with
```
cadence cluster get-search-attr
```
#### 1. Use docker image `ubercadence/cadence-bench:master`

### Bench Workers
For now there's no docker image for bench workers. The only way to run bench workers is:
1. Build cadence bench binary:
For now, this image has no release versions for simplified the release process. Always use `master` tag for the image.

Similar to server/CLI images, the bench image will be built and published automatically by Github on every commit onto the `master` branch.

You can [pre-built docker-compose file](../docker/docker-compose-bench.yml) to run against local server
In the `docker/` directory, run:
```
docker-compose -f docker-compose-bench.yml up
```
You can modify [the bench worker config](../docker/config/bench/development.yaml) to run against a prod server cluster.

Or may run it with Kubernetes, for [example](https://github.com/longquanzheng/cadence-lab/blob/master/eks/bench-deployment.yaml).



#### 2. Build & Run the binary

In the project root, build cadence bench binary:
```
make cadence-bench
```
2. Start bench workers:

Then start bench worker:
```
./cadence-bench start
```
By default, it will load the configuration in `config/bench/development.yaml`. Please run `./cadence-bench -h` for details on how to change the configuration directory and file used.
3. Note that, unlike canary, starting bench worker will not automatically start a bench test. Next two sections will cover how to start and configure it.
By default, it will load [the configuration in `config/bench/development.yaml`](../config/bench/development.yaml).
Run `./cadence-bench -h` for details to understand the start options of how to change the loading directory if needed.

Worker Configurations
----------------------
Bench workers configuration contains two parts:
- **Bench**: this part controls the client side, including the bench service name, which domains bench workers are responsible for and how many taskLists each domain should use.
```yaml
bench:
name: "cadence-bench" # bench name
domains: ["cadence-bench", "cadence-bench-sync", "cadence-bench-batch"] # it will start workers on all those domains(also try to register if not exists)
numTaskLists: 3 # it will start workers listening on cadence-bench-tl-0, cadence-bench-tl-1, cadence-bench-tl-2
```
1. Bench workers will only poll from task lists whose name start with `cadence-bench-tl-`. If in the configuration, `numTaskLists` is specified to be 2, then workers will only listen to `cadence-bench-tl-0` and `cadence-bench-tl-1`. So make sure you use a valid task list name when starting the bench load.
2. When starting bench workers, it will try to register a **local domain with archival feature disabled** for each domain name listed in the configuration, if not already exists. If your want to test the performance of global domains and/or archival feature, please register the domains first before starting the worker.

- **Cadence**: this control how bench worker should talk to Cadence server, which includes the server's service name and address.
```yaml
cadence:
service: "cadence-frontend" # frontend service name
host: "127.0.0.1:7933" # frontend address
```
- **Metrics**: metrics configuration. Similar to server metric emitter, only M3/Statsd/Prometheus is supported.
- **Log**: logging configuration. Similar to server logging configuration.

Note:
1. When starting bench workers, it will try to register a **local domain with archival feature disabled** for each domain name listed in the configuration, if not already exists. If your want to test the performance of global domains and/or archival feature, please register the domains first before starting the worker.
2. Bench workers will only poll from task lists whose name start with `cadence-bench-tl-`. If in the configuration, `numTaskLists` is specified to be 2, then workers will only listen to `cadence-bench-tl-0` and `cadence-bench-tl-1`. So make sure you use a valid task list name when starting the bench load.

Bench Loads
Bench Load Types
-----------
This section briefly describes the purpose of each bench load and provides a sample command for running the load. Detailed descriptions for each test's configuration can be found in `bench/lib/config.go`

Please note that all load configurations in `config/bench` is for only local development and illustration purpose, it does not reflect the actual capability of Cadence server.

### Basic
This is the only bench test that don't require advanced visibility.
:warning: NOTE: This is the only bench test which doesn't require advanced visibility feature on the server. Make sure you set `useBasicVisibilityValidation` to true if run with basic(db) visibility.
Also basicVisibilityValidation requires only one test load run in the same domain. This is because of the limitation of basic visibility now allow using workflowType and status filters at the same time.

As the name suggests, this load tests the basic case of load testing.
You will start a `launchWorkflow` which will execute some `launchActivities` to start `stressWorkflows`. Then the stressWorkflows running activities in sequential/parallel.
Once all stressWorkflows are started, launchWorkflow will wait stressWorkflows timeout + buffer time(default to 5 mins) before checking the status of all test workflows.

As the name suggests, this load tests the basic case of starting workflows and running activities in sequential/parallel. Once all test workflows are started, it will wait test workflow timeout + 5 mins before checking the status of all test workflows. If the failure rate is too high, or if there's any open workflows found, the test will fail.
Two criteria must be met to pass the verification:
1. No open workflows(this means server may lose some tasks and not able to close the stressWorkflows)
2. Failed/timeouted workflows <= threshold(totalLaunchCount * failureThreshold )

The basic load can also be run in "panic" mode by setting `"panicStressWorkflow": true,` to test if server can handle large number of panic workflows (which can be caused by a bad worker deployment).

Expand All @@ -86,31 +118,35 @@ Progress:
22, 2021-08-20T11:59:24-07:00, WorkflowExecutionCompleted

Result:
Run Time: 526 seconds
Run Time: 26 seconds
Status: COMPLETED
Output: "SuccessCount: 100, FailedCount: 0"
```
The test will return error if the test doesn't pass. There are two cases:
* The stress workflow couldn't finish within the timeout
* There are more failed worklfow than expected(configured by `failureThreshold`)

### Cron
`Cron` itself is not a test. It is responsible for running multiple other tests in parallel or sequential according a cron schedule.

Tests in `Cron` are divided to into multiple test suites. Tests in different test suites will be run in parallel, while tests within a test suite will be run in a random sequential order. Different test suites can also be run in different domains, which provides a way for testing the multi-tenant performance of Cadence server.
Output: "TEST PASSED. Details report: timeoutCount: 0, failedCount: 0, openCount:0, launchCount: 100, maxThreshold:1"

On the completion of each test, `Cron` will be signaled with the result of the test, which can be queried through:
```
cadence --do <domain> wf query --wid <workflowID of the Cron workflow> --qt test-results
```
This command will show the result of all completed tests.
When all tests complete, `Cron` will update the value of the `Passed` search attribute accordingly. `Passed` will be set to `true` only when all tests have passed, and `false` otherwise. Since the last event for cron workflow is always WorkflowContinuedAsNew, this search attribute can be used to tell whether one run of `Cron` is successful or not. You can see the search attribute value by adding `--psa` flag to workflow list commands when listing `Cron` runs.
The output/error result shows whether the test passes with detailed report.
A sample cron configuration is in `config/bench/cron.json`, and it can be started with
```
cadence --do <domain> wf start --tl cadence-bench-tl-0 --wt cron-test-workflow --dt 30 --et 7200 --if config/bench/cron.json
```
Configuration of basic load type. The config is passed as the launch workflow input parameter using a JSON file.
```yaml
# configuration for launch workflow
useBasicVisibilityValidation: use basic(db based) visibility to verify the stress workflows, default false which requires advanced visibility on the server
totalLaunchCount : total number of stressWorkflows that started by the launchWorkflow
waitTimeBufferInSeconds : buffer time in addition of ExecutionStartToCloseTimeoutInSeconds to wait for stressWorkflows before verification, default 300(5 minutes)
routineCount : number of in-parallel launch activities that started by launchWorkflow, to start the stressWorkflows
failureThreshold : the threshold of failed stressWorkflow for deciding whether or not the whole testSuite failed.
maxLauncherActivityRetryCount : the max retry on launcher activity to start stress workflows, default: 5
contextTimeoutInSeconds : RPC timeout inside activities(e.g. starting a stressWorkflow) default 3s
# configuration for stress workflow
executionStartToCloseTimeoutInSeconds : StartToCloseTimeout of stressWorkflow, default 5m
chainSequence : number of steps in the stressWorkflow
concurrentCount : number of in-parallel activity(dummy activity only echo data) in a step of the stressWorkflow
payloadSizeBytes : payloadSize of echo data in the dummy activity
minCadenceSleepInSeconds : control sleep time between two steps in the stressWorkflow, actual sleep time = random(min,max), default: 0
maxCadenceSleepInSeconds : control sleep time between two steps in the stressWorkflow, actual sleep time = random(min,max), default: 0
panicStressWorkflow : if true, stressWorkflow will always panic, default false
```

### Cancellation
The load tests the StartWorkflowExecution and CancelWorkflowExecution sync API, and validates the number of cancelled workflows and if there's any open workflow.
Expand Down Expand Up @@ -147,4 +183,30 @@ Typical usage is the same as the concurrent execution load above. Run it in para
Sample configuration can be found in `config/bench/timer.json` and it can be started with
```
cadence --do <domain> wf start --tl cadence-bench-tl-0 --wt timer-load-test-workflow --dt 30 --et 3600 --if config/bench/timer.json
```

### Cron: Run all the workloads as a TestSuite

:warning: NOTE: This requires a search attribute named `Passed` as boolean type. This search attribute should have been added to the [ES schema](/schema/elasticsearch).
make sure the dynamic config also have [this search attribute (`frontend.validSearchAttributes`)](/config/dynamicconfig/development_es.yaml), so that Cadence server can recognize it.
* Validate `Passed` has been successfully added in the dynamic config:
```
cadence cluster get-search-attr
```

`Cron` itself is not a test. It is responsible for running all other tests in parallel or sequential according a cron schedule.

Tests in `Cron` are divided to into multiple test suites. Tests in different test suites will be run in parallel, while tests within a test suite will be run in a random sequential order. Different test suites can also be run in different domains, which provides a way for testing the multi-tenant performance of Cadence server.

On the completion of each test, `Cron` will be signaled with the result of the test, which can be queried through:
```
cadence --do <domain> wf query --wid <workflowID of the Cron workflow> --qt test-results
```
This command will show the result of all completed tests.

When all tests complete, `Cron` will update the value of the `Passed` search attribute accordingly. `Passed` will be set to `true` only when all tests have passed, and `false` otherwise. Since the last event for cron workflow is always WorkflowContinuedAsNew, this search attribute can be used to tell whether one run of `Cron` is successful or not. You can see the search attribute value by adding `--psa` flag to workflow list commands when listing `Cron` runs.

A sample cron configuration is in `config/bench/cron.json`, and it can be started with
```
cadence --do <domain> wf start --tl cadence-bench-tl-0 --wt cron-test-workflow --dt 30 --et 7200 --if config/bench/cron.json
```
27 changes: 16 additions & 11 deletions bench/lib/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -89,17 +89,22 @@ type (

// BasicTestConfig contains the configuration for running the Basic test scenario
BasicTestConfig struct {
TotalLaunchCount int `yaml:"totalLaunchCount"`
RoutineCount int `yaml:"routineCount"`
ChainSequence int `yaml:"chainSequence"`
ConcurrentCount int `yaml:"concurrentCount"`
PayloadSizeBytes int `yaml:"payloadSizeBytes"`
MinCadenceSleepInSeconds int `yaml:"minCadenceSleepInSeconds"`
MaxCadenceSleepInSeconds int `yaml:"maxCadenceSleepInSeconds"`
ExecutionStartToCloseTimeoutInSeconds int `yaml:"executionStartToCloseTimeoutInSeconds"` // default 5m
ContextTimeoutInSeconds int `yaml:"contextTimeoutInSeconds"` // default 3s
PanicStressWorkflow bool `yaml:"panicStressWorkflow"` // default false
FailureThreshold float64 `yaml:"failureThreshold"`
// Launch workflow config
UseBasicVisibilityValidation bool `yaml:"useBasicVisibilityValidation"` // use basic(db based) visibility to verify the stress workflows, default false which requires advanced visibility on the server
TotalLaunchCount int `yaml:"totalLaunchCount"` // total number of stressWorkflows that started by the launchWorkflow
RoutineCount int `yaml:"routineCount"` // number of in-parallel launch activities that started by launchWorkflow, to start the stressWorkflows
FailureThreshold float64 `yaml:"failureThreshold"` // the threshold of failed stressWorkflow for deciding whether or not the whole testSuite failed.
MaxLauncherActivityRetryCount int `yaml:"maxLauncherActivityRetryCount"` // the max retry on launcher activity to start stress workflows, default: 5
ContextTimeoutInSeconds int `yaml:"contextTimeoutInSeconds"` // RPC timeout inside activities(e.g. starting a stressWorkflow) default 3s
WaitTimeBufferInSeconds int `yaml:"waitTimeBufferInSeconds"` // buffer time in addition of ExecutionStartToCloseTimeoutInSeconds to wait for stressWorkflows before verification, default 300(5 minutes)
// Stress workflow config
ExecutionStartToCloseTimeoutInSeconds int `yaml:"executionStartToCloseTimeoutInSeconds"` // StartToCloseTimeout of stressWorkflow, default 5m
ChainSequence int `yaml:"chainSequence"` // number of steps in the stressWorkflow
ConcurrentCount int `yaml:"concurrentCount"` // number of in-parallel activity(dummy activity only echo data) in a step of the stressWorkflow
PayloadSizeBytes int `yaml:"payloadSizeBytes"` // payloadSize of echo data in the dummy activity
MinCadenceSleepInSeconds int `yaml:"minCadenceSleepInSeconds"` // control sleep time between two steps in the stressWorkflow, actual sleep time = random(min,max), default: 0
MaxCadenceSleepInSeconds int `yaml:"maxCadenceSleepInSeconds"` // control sleep time between two steps in the stressWorkflow, actual sleep time = random(min,max), default: 0
PanicStressWorkflow bool `yaml:"panicStressWorkflow"` // if true, stressWorkflow will always panic, default false
}

// SignalTestConfig is the parameters for signalLoadTestWorkflow
Expand Down
Loading

0 comments on commit 0b98055

Please sign in to comment.