Fix prometheus frontend label inconsistencies #5087
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changed?
domain
labels to frontend metrics emitted by admin frontend, to conform with the formatting of normal frontend metrics.DescribeCluster
handler, namelyAdminDescribeClusterScope
Why?
Our team identified a bug whereby prometheus would fail to report some frontend metrics due to the fact that they shared a common fully-qualified name (e.g. cadence_requests and cadence_latency) but had inconsistencies in whether or not they included a
domain
label. For context, Prometheus cannot handle these inconsistent schemas, and will only report metrics which fit the first schema it encounters after startup. Attempting to report metrics which don't fit the dominant schema will trigger the following warning:We identified this error happening in two different places:
adminHandler
, which previously did not tag metrics with domain labelsaccessControlledHandler#ListDomains
specifically, because the ListDomains function lacks a domain context unlike other functions in that handlerWe also opportunistically introduced the
AdminDescribeClusterScope
metrics scope because DescribeCluster currently reports metrics to theAdminGetWorkflowExecutionRawHistoryV2Scope
which appears to be a copy-paste error.How did you test it?
Test steps conducted:
cadence --do testDomain domain register
cadence admin cluster describe
cadence admin domain list
curl http://localhost:7739/metrics | grep cadence_requests
(note that we can see an admin operation being counted (DescribeCluster) as well non-admin metrics that have both dummy e.g. RegisterDomain and actual e.g. PollForDecisionTask domain labels. Also note that this shows our new metric scope for DescribeCluster working successfully).
curl http://localhost:7739/metrics | grep cadence_latency_sum
(note same as above, but here we are looking at latency metrics)
curl http://localhost:7739/metrics | grep cadence_authorization_latency_sum
(note specifically the presence of the ListDomain operation, which would previously not have been possible)
Potential risks
In the worst case, an error in the change could affect availability of Cadence frontend metrics. However, the approach used replicates existing patterns from how the
workflowHandler
reports on non domain-specific functionality (we reuse the pattern of labelling domain as "unknown" from there), so is expected to be robust. There are also no substantial changes to any of the metric reporting logic.Release notes
Not notable for release.
Documentation Changes
None