Fix prometheus frontend label inconsistencies #5087

charlese-instaclustr · 2023-02-09T16:56:47Z

What changed?

Added dummy domain labels to frontend metrics emitted by admin frontend, to conform with the formatting of normal frontend metrics.
Introduced a metric scope for the DescribeCluster handler, namely AdminDescribeClusterScope

Why?

Our team identified a bug whereby prometheus would fail to report some frontend metrics due to the fact that they shared a common fully-qualified name (e.g. cadence_requests and cadence_latency) but had inconsistencies in whether or not they included a domain label. For context, Prometheus cannot handle these inconsistent schemas, and will only report metrics which fit the first schema it encounters after startup. Attempting to report metrics which don't fit the dominant schema will trigger the following warning:

{"level":"warn","ts":"2023-02-09T15:55:18.934Z","msg":"error in prometheus reporter","service":"cadence-frontend","error":"a previously registered descriptor with the same fully-qualified name as Desc{fqName: \"cadence_latency\", help: \"cadence_latency summary\", constLabels: {}, variableLabels: [transport cadence_service operation caller]} has different label names or a different help string","logging-call-at":"metrics.go:151"}
{"level":"warn","ts":"2023-02-09T15:55:18.934Z","msg":"error in prometheus reporter","service":"cadence-frontend","error":"a previously registered descriptor with the same fully-qualified name as Desc{fqName: \"cadence_requests\", help: \"cadence_requests counter\", constLabels: {}, variableLabels: [transport cadence_service operation caller]} has different label names or a different help string","logging-call-at":"metrics.go:151"}

We identified this error happening in two different places:

Metrics reporting from adminHandler, which previously did not tag metrics with domain labels
Metrics reporting from accessControlledHandler#ListDomains specifically, because the ListDomains function lacks a domain context unlike other functions in that handler

We also opportunistically introduced the AdminDescribeClusterScope metrics scope because DescribeCluster currently reports metrics to the AdminGetWorkflowExecutionRawHistoryV2Scope which appears to be a copy-paste error.

How did you test it?

Test steps conducted:

Built Cadence off my branch
Ran Cadence with Cassandra
Ran cadence --do testDomain domain register
Ran cadence admin cluster describe
Ran cadence admin domain list
Collected the following metrics from Prometheus:

curl http://localhost:7739/metrics | grep cadence_requests

(note that we can see an admin operation being counted (DescribeCluster) as well non-admin metrics that have both dummy e.g. RegisterDomain and actual e.g. PollForDecisionTask domain labels. Also note that this shows our new metric scope for DescribeCluster working successfully).

curl http://localhost:7739/metrics | grep cadence_latency_sum

(note same as above, but here we are looking at latency metrics)

curl http://localhost:7739/metrics | grep cadence_authorization_latency_sum

(note specifically the presence of the ListDomain operation, which would previously not have been possible)

Potential risks

In the worst case, an error in the change could affect availability of Cadence frontend metrics. However, the approach used replicates existing patterns from how the workflowHandler reports on non domain-specific functionality (we reuse the pattern of labelling domain as "unknown" from there), so is expected to be robust. There are also no substantial changes to any of the metric reporting logic.

Release notes

Not notable for release.

Documentation Changes

None

…cribeCluster handler

allenchen2244

Can't say no if Mantas say yes

coveralls · 2023-02-28T05:22:27Z

Pull Request Test Coverage Report for Build 0186c782-a1c3-4121-8d86-cff2e3e77af4

1 of 3 (33.33%) changed or added relevant lines in 2 files are covered.
57 unchanged lines in 8 files lost coverage.
Overall coverage increased (+0.003%) to 57.035%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
service/frontend/accessControlledHandler.go	0	1	0.0%
service/frontend/adminHandler.go	1	2	50.0%

Files with Coverage Reduction	New Missed Lines	%
common/task/weightedRoundRobinTaskScheduler.go	1	89.64%
common/cache/lru.go	2	92.2%
common/task/fifoTaskScheduler.go	2	85.57%
service/matching/matcher.go	2	91.46%
service/matching/taskListManager.go	3	81.15%
service/history/decision/task_handler.go	5	72.33%
service/history/task/task_util.go	20	70.57%
service/history/execution/mutable_state_task_refresher.go	22	57.59%

Totals
Change from base Build 0186c3e4-3ee2-434a-a2db-e560b75c5442:	0.003%
Covered Lines:	85152
Relevant Lines:	149299

💛 - Coveralls

charlese-instaclustr added 3 commits February 7, 2023 19:11

Add '_unknown_' domain tags to metrics missing domain label

f991044

Introduce AdminDescribeClusterScope to metrics enum and use it in Des…

012373f

…cribeCluster handler

removing testing logging

1cdc88a

charlese-instaclustr mentioned this pull request Feb 9, 2023

Occasional missing frontend metrics after cadence server startup #5039

Closed

mantas-sidlauskas approved these changes Feb 9, 2023

View reviewed changes

Merge branch 'master' into fix-prometheus-frontend-label-inconsistencies

f2f98d5

allenchen2244 approved these changes Feb 28, 2023

View reviewed changes

Merge branch 'master' into fix-prometheus-frontend-label-inconsistencies

ad50789

allenchen2244 enabled auto-merge (squash) March 9, 2023 17:55

allenchen2244 merged commit 9643242 into uber:master Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix prometheus frontend label inconsistencies #5087

Fix prometheus frontend label inconsistencies #5087

charlese-instaclustr commented Feb 9, 2023

allenchen2244 left a comment

coveralls commented Feb 28, 2023 •

edited

Loading

Fix prometheus frontend label inconsistencies #5087

Fix prometheus frontend label inconsistencies #5087

Conversation

charlese-instaclustr commented Feb 9, 2023

allenchen2244 left a comment

Choose a reason for hiding this comment

coveralls commented Feb 28, 2023 • edited Loading

Pull Request Test Coverage Report for Build 0186c782-a1c3-4121-8d86-cff2e3e77af4

💛 - Coveralls

coveralls commented Feb 28, 2023 •

edited

Loading