Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JitterDelay option when creating workflows. #4886

Merged
merged 9 commits into from
Jul 16, 2022
Merged

Add JitterDelay option when creating workflows. #4886

merged 9 commits into from
Jul 16, 2022

Conversation

ZackLK
Copy link
Contributor

@ZackLK ZackLK commented Jun 29, 2022

What changed?
Added a JitterStart parameter for the customer to specify a # of seconds to jitter the start of their workflow.

Why?
When customers create a large # of workflows and specify they should all start at the same time, it can cause a thundering herd problem. This parameter allows the start time to be jittered so that the large # of workflows can be started at slightly different times.

How did you test it?
Unit test and tested manually via CLI:
./cadence --domain samples-domain workflow start --tl helloWorldGroup --wt main.helloWorldWorkflow --et 30 --input '"Jittered"' --jitter_start_seconds 5

Potential risks

Release notes

Documentation Changes

@ZackLK ZackLK requested review from demirkayaender and a team and removed request for a team June 29, 2022 21:42
// Add a random jitter to start time, if requested.
jitterStartSeconds := startRequest.GetJitterStartSeconds()
if jitterStartSeconds > 0 {
firstDecisionTaskBackoffSeconds += rand.Int31n(jitterStartSeconds + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rand has a seed value somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, not sure that it does. Be careful to call rand.Seed(time.Now().UnixNano()) someplace lest you get a surprisingly deterministic result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - turns out we already seed rand in host/service.go for every service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect

Copy link
Member

@davidporter-id-au davidporter-id-au left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries with this diff as it's self-evident what's going on, but I'd normally suggest pulling the IDL changes into a separate diff/PR so as to not obscure any changes

common/util.go Outdated
@@ -530,6 +530,12 @@ func CreateHistoryStartWorkflowRequest(
firstDecisionTaskBackoffSeconds += delayStartSeconds
}

// Add a random jitter to start time, if requested.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jitter should also apply to each occurrence of a workflow when there is a cron schedule. Otherwise, only the first run of a cron workflow will benefit from the jitter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is now working - see my latest commit.

@ZackLK
Copy link
Contributor Author

ZackLK commented Jul 11, 2022

Tested new cron changes via:
./cadence --domain samples-domain workflow start --tl helloWorldGroup --wt main.helloWorldWorkflow --et 30 --input '"Jittered"' --jitter_start_seconds 5 --cron '@every 10s'

(in cadence-samples) bin/helloworld -m worker

And noting that the workflow doesn't run every 10 seconds, but rather runs between 10-15 seconds every 10 seconds.

@ZackLK ZackLK requested a review from emrahs July 11, 2022 21:17
@coveralls
Copy link

coveralls commented Jul 11, 2022

Pull Request Test Coverage Report for Build 01820429-aee4-4cb0-bd69-9e7b43d311ab

  • 48 of 71 (67.61%) changed or added relevant lines in 14 files are covered.
  • 69 unchanged lines in 13 files lost coverage.
  • Overall coverage decreased (-0.02%) to 57.727%

Changes Missing Coverage Covered Lines Changed/Added Lines %
tools/cli/workflowCommands.go 1 4 25.0%
common/types/shared.go 6 15 40.0%
service/frontend/workflowHandler.go 4 15 26.67%
Files with Coverage Reduction New Missed Lines %
common/types/shared.go 1 41.75%
service/history/execution/mutable_state_builder.go 2 69.05%
service/history/execution/mutable_state_util.go 2 36.14%
service/history/task/transfer_active_task_executor.go 2 71.43%
common/persistence/dataManagerInterfaces.go 3 58.9%
common/persistence/serialization/parser.go 4 62.41%
common/persistence/serialization/thrift_decoder.go 4 53.06%
common/types/mapper/thrift/shared.go 4 63.77%
common/task/fifoTaskScheduler.go 5 80.41%
common/persistence/sql/workflowStateMaps.go 8 81.25%
Totals Coverage Status
Change from base Build 0181d5e0-475a-4e8c-abb3-eb4952984e85: -0.02%
Covered Lines: 83710
Relevant Lines: 145010

💛 - Coveralls

Copy link
Contributor

@demirkayaender demirkayaender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check if we have a test for next cron start? Based on the code it looks like next cron schedule should consider jitter again (expected behavior) as opposed to delay_start not being considered (expected behavior).

If this is easy to test, we should test; it can simply be breakable with a code change later.


If you make changes in the idls submodule and want to test them locally, you can easily do that by using go mod to use the local idls directory instead of github.com/uber/cadence-idl. Temporarily add the following to the bottom of go.mod:

```replace github.com/uber/cadence-idl => ./idls```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return roundedInterval

var jitter time.Duration
if jitterStartSeconds > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be good to validate jitterStartSeconds somewhere to make sure its max value is lower than the time till next schedule

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, we should return an error when it's negative.

{"* * * * *", "2018-12-17T08:00:00+00:00", 10, time.Second * 60},
{"* * * * *", "2018-12-17T08:00:10+00:00", 30, time.Second * 50},
{"* * * * *", "2018-12-17T08:00:25+00:00", 5, time.Second * 35},
{"* * * * *", "2018-12-17T08:00:59+00:00", 45, time.Second * 1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try 60+ here and expect error

startTime time.Time,
closeTime time.Time,
jitterStartSeconds int32,
) time.Duration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are moving the jitter to this function, you might want to change the return type to (error, time.Duration)

@@ -2084,6 +2085,10 @@ func (wh *WorkflowHandler) StartWorkflowExecution(
return nil, wh.error(errInvalidDelayStartSeconds, scope, tags...)
}

if startRequest.GetJitterStartSeconds() < 0 {
return nil, wh.error(errInvalidJitterStartSeconds, scope, tags...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if startRequest.GetJitterStartSeconds() < 0 {
return nil, wh.error(errInvalidJitterStartSeconds, scope, tags...)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add the jitter < cron_interval check below

@ZackLK ZackLK merged commit 8c449b3 into master Jul 16, 2022
@ZackLK ZackLK deleted the jitter branch July 16, 2022 02:45
abhishekj720 pushed a commit to abhishekj720/cadence that referenced this pull request Jul 21, 2022
* Add JitterDelay option when creating workflows.

* Update idls

* getting jitter to work with cron

* Update IDL

Co-authored-by: David Porter <david.porter@uber.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants