Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix prometheus frontend label inconsistencies #5087

Conversation

charlese-instaclustr
Copy link
Contributor

What changed?

  • Added dummy domain labels to frontend metrics emitted by admin frontend, to conform with the formatting of normal frontend metrics.
  • Introduced a metric scope for the DescribeCluster handler, namely AdminDescribeClusterScope

Why?

Our team identified a bug whereby prometheus would fail to report some frontend metrics due to the fact that they shared a common fully-qualified name (e.g. cadence_requests and cadence_latency) but had inconsistencies in whether or not they included a domain label. For context, Prometheus cannot handle these inconsistent schemas, and will only report metrics which fit the first schema it encounters after startup. Attempting to report metrics which don't fit the dominant schema will trigger the following warning:

{"level":"warn","ts":"2023-02-09T15:55:18.934Z","msg":"error in prometheus reporter","service":"cadence-frontend","error":"a previously registered descriptor with the same fully-qualified name as Desc{fqName: \"cadence_latency\", help: \"cadence_latency summary\", constLabels: {}, variableLabels: [transport cadence_service operation caller]} has different label names or a different help string","logging-call-at":"metrics.go:151"}
{"level":"warn","ts":"2023-02-09T15:55:18.934Z","msg":"error in prometheus reporter","service":"cadence-frontend","error":"a previously registered descriptor with the same fully-qualified name as Desc{fqName: \"cadence_requests\", help: \"cadence_requests counter\", constLabels: {}, variableLabels: [transport cadence_service operation caller]} has different label names or a different help string","logging-call-at":"metrics.go:151"}

We identified this error happening in two different places:

  • Metrics reporting from adminHandler, which previously did not tag metrics with domain labels
  • Metrics reporting from accessControlledHandler#ListDomains specifically, because the ListDomains function lacks a domain context unlike other functions in that handler

We also opportunistically introduced the AdminDescribeClusterScope metrics scope because DescribeCluster currently reports metrics to the AdminGetWorkflowExecutionRawHistoryV2Scope which appears to be a copy-paste error.

How did you test it?

Test steps conducted:

  • Built Cadence off my branch
  • Ran Cadence with Cassandra
  • Ran cadence --do testDomain domain register
  • Ran cadence admin cluster describe
  • Ran cadence admin domain list
  • Collected the following metrics from Prometheus:

curl http://localhost:7739/metrics | grep cadence_requests
Screenshot 2023-02-09 at 4 46 30 pm
(note that we can see an admin operation being counted (DescribeCluster) as well non-admin metrics that have both dummy e.g. RegisterDomain and actual e.g. PollForDecisionTask domain labels. Also note that this shows our new metric scope for DescribeCluster working successfully).

curl http://localhost:7739/metrics | grep cadence_latency_sum
Screenshot 2023-02-09 at 4 47 38 pm
(note same as above, but here we are looking at latency metrics)

curl http://localhost:7739/metrics | grep cadence_authorization_latency_sum
Screenshot 2023-02-09 at 4 48 48 pm
(note specifically the presence of the ListDomain operation, which would previously not have been possible)

Potential risks

In the worst case, an error in the change could affect availability of Cadence frontend metrics. However, the approach used replicates existing patterns from how the workflowHandler reports on non domain-specific functionality (we reuse the pattern of labelling domain as "unknown" from there), so is expected to be robust. There are also no substantial changes to any of the metric reporting logic.

Release notes

Not notable for release.

Documentation Changes

None

Copy link
Contributor

@allenchen2244 allenchen2244 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't say no if Mantas say yes

@coveralls
Copy link

coveralls commented Feb 28, 2023

Pull Request Test Coverage Report for Build 0186c782-a1c3-4121-8d86-cff2e3e77af4

  • 1 of 3 (33.33%) changed or added relevant lines in 2 files are covered.
  • 57 unchanged lines in 8 files lost coverage.
  • Overall coverage increased (+0.003%) to 57.035%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/frontend/accessControlledHandler.go 0 1 0.0%
service/frontend/adminHandler.go 1 2 50.0%
Files with Coverage Reduction New Missed Lines %
common/task/weightedRoundRobinTaskScheduler.go 1 89.64%
common/cache/lru.go 2 92.2%
common/task/fifoTaskScheduler.go 2 85.57%
service/matching/matcher.go 2 91.46%
service/matching/taskListManager.go 3 81.15%
service/history/decision/task_handler.go 5 72.33%
service/history/task/task_util.go 20 70.57%
service/history/execution/mutable_state_task_refresher.go 22 57.59%
Totals Coverage Status
Change from base Build 0186c3e4-3ee2-434a-a2db-e560b75c5442: 0.003%
Covered Lines: 85152
Relevant Lines: 149299

💛 - Coveralls

@allenchen2244 allenchen2244 enabled auto-merge (squash) March 9, 2023 17:55
@allenchen2244 allenchen2244 merged commit 9643242 into uber:master Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants