-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
large workflow hot shard detection #5166
Merged
Merged
Changes from 28 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
50cb929
large workflow hot shard detection
allenchen2244 d4c8cd4
fix lint
allenchen2244 5f2d7e7
Merge branch 'master' into large-history-shards
allenchen2244 e38bb2a
fix bad keyname
allenchen2244 7b021c6
fix test
allenchen2244 f0b5ada
fix test
allenchen2244 55fccb5
turn off to test
allenchen2244 1c684fb
change to overall shard
allenchen2244 737d830
add operation too
allenchen2244 7cb379e
test
allenchen2244 d72cff0
more logs
allenchen2244 2399e31
try again and refactor
allenchen2244 63e1fbf
fix ssome shit
allenchen2244 94df73e
fix blob size
allenchen2244 316ace6
test more
allenchen2244 1d64236
remove testing code
allenchen2244 15550d5
Merge branch 'master' into large-history-shards
allenchen2244 3dfcab0
Merge branch 'master' into large-history-shards
Groxx cb02506
move stuff to context and change default value
allenchen2244 1c87d11
add defaults
allenchen2244 f3ae86c
add check again domain value
allenchen2244 52ce735
Update common/dynamicconfig/constants.go
allenchen2244 5c4671d
Update common/dynamicconfig/constants.go
allenchen2244 189be4d
call flipr only once
allenchen2244 1f1d2a3
move out of persistence obj to fn arg
allenchen2244 2b86428
Update service/history/execution/context.go
allenchen2244 c5ae571
found the common minint64
allenchen2244 4f58e05
fix lint
allenchen2244 280903a
testing something
allenchen2244 13522c6
add tag
allenchen2244 8b9c5bf
add 1 more field
allenchen2244 edd9766
remove testing code
allenchen2244 5dec082
Merge branch 'master' into large-history-shards
allenchen2244 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,15 +21,49 @@ | |
package execution | ||
|
||
import ( | ||
"strconv" | ||
"time" | ||
|
||
"github.com/uber/cadence/common" | ||
|
||
"github.com/uber/cadence/common/log" | ||
"github.com/uber/cadence/common/log/tag" | ||
"github.com/uber/cadence/common/metrics" | ||
"github.com/uber/cadence/common/persistence" | ||
"github.com/uber/cadence/common/types" | ||
) | ||
|
||
func (c *contextImpl) emitLargeWorkflowShardIDStats(blobSize int64, oldHistoryCount int64, oldHistorySize int64) { | ||
if c.shard.GetConfig().EnableShardIDMetrics() { | ||
shardIDStr := strconv.Itoa(c.shard.GetShardID()) | ||
|
||
blobSizeWarn := common.MinInt64(int64(c.shard.GetConfig().LargeShardHistoryBlobMetricThreshold()), int64(c.shard.GetConfig().BlobSizeLimitWarn(c.GetDomainName()))) | ||
// check if blob size is larger than threshold in Dynamic config if so alert on it every time | ||
if blobSize > blobSizeWarn { | ||
c.logger.SampleInfo("Workflow writing a large blob", c.shard.GetConfig().SampleLoggingRate(), tag.WorkflowDomainName(c.GetDomainName()), | ||
tag.WorkflowID(c.workflowExecution.GetWorkflowID()), tag.ShardID(c.shard.GetShardID())) | ||
c.metricsClient.Scope(metrics.LargeExecutionBlobShardScope, metrics.ShardIDTag(shardIDStr)).IncCounter(metrics.LargeHistoryBlobCount) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. kinda odd at a glance that this takes a string, but meh. maybe there's a reason for it. |
||
} | ||
|
||
historyCountWarn := common.MinInt64(int64(c.shard.GetConfig().LargeShardHistoryEventMetricThreshold()), int64(c.shard.GetConfig().HistoryCountLimitWarn(c.GetDomainName()))) | ||
// check if the new history count is greater than our threshold and only count/log it once when it passes it | ||
// this might sometimes double count if the workflow is extremely fast but should be ok to get a rough idea and identify bad actors | ||
if oldHistoryCount < historyCountWarn && (c.mutableState.GetNextEventID()-1) >= historyCountWarn { | ||
c.logger.Warn("Workflow history event count is reaching dangerous levels", tag.WorkflowDomainName(c.GetDomainName()), | ||
tag.WorkflowID(c.workflowExecution.GetWorkflowID()), tag.ShardID(c.shard.GetShardID())) | ||
c.metricsClient.Scope(metrics.LargeExecutionCountShardScope, metrics.ShardIDTag(shardIDStr)).IncCounter(metrics.LargeHistoryEventCount) | ||
} | ||
|
||
historySizeWarn := common.MinInt64(int64(c.shard.GetConfig().LargeShardHistorySizeMetricThreshold()), int64(c.shard.GetConfig().HistorySizeLimitWarn(c.GetDomainName()))) | ||
// check if the new history size is greater than our threshold and only count/log it once when it passes it | ||
if oldHistorySize < historySizeWarn && c.stats.HistorySize >= historySizeWarn { | ||
c.logger.Warn("Workflow history event size is reaching dangerous levels", tag.WorkflowDomainName(c.GetDomainName()), | ||
tag.WorkflowID(c.workflowExecution.GetWorkflowID()), tag.ShardID(c.shard.GetShardID())) | ||
c.metricsClient.Scope(metrics.LargeExecutionSizeShardScope, metrics.ShardIDTag(shardIDStr)).IncCounter(metrics.LargeHistorySizeCount) | ||
} | ||
} | ||
} | ||
|
||
func emitWorkflowHistoryStats( | ||
metricsClient metrics.Client, | ||
domainName string, | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while I'm exploring the code in here:
would any of these be more-correct if they came from
newWorkflow
? and/or would usingnewWorkflow
mean we wouldn't need to maintain as much separately?I'm not quite sure what the difference is tbh. I'd have to read more carefully. I don't think it'll be dangerous to use the wrong one, just possibly misleading (due to subtly-wrong values, or due to unnecessary duplicate calculations that could drift)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters. From what I can tell new workflow kind of derives the numbers the same way I am. Maybe it would be less duplicated code but at most it's saving 5 lines and makes it a bit more confusing to read imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main reason to prefer using the new-workflow's data is that it'd make it clear what the source of truth is, and get rid of any risk of the calculations drifting away from that source of truth due to future changes.
which is an "if". tbh I'm not sure if it's more-correct or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could try testing it later? The other 2 metric emission fns is kinda calculated the same way so i wanted to make it consistent. I don't wanna just change part of it then it's super confusing if we never change the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which ones are those?
but yea, later's fine (if ever)