Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial addition of Kafka metrics. #2485

Merged
merged 18 commits into from
May 9, 2022

Conversation

carlosalberto
Copy link
Contributor

Initial addition of metrics Kafka metrics as reported by the JMX gatherer (leaving out the Consumer specific ones for now).

The present PR has what the JMX Gatherer has today, but I'd like to get feedback regarding some potential changes:

  • Some of these are pre-cooked metrics (directly from Kafka), e.g. counts in nanoseconds, some of them in milliseconds instead, and more importantly 50th/99th percentiles. Do we 'massage' them?
  • All the gauges reported here should maybe be converted from Gauge to UpDownCounter.
  • Name prefix for broker-specific metrics - mostly, some of the general Kafka metrics could have a broker component IMO, e.g. kafka.message.count maybe should become kafka.broker.message.count?
  • A few of the metrics use a count sufix. The metric guidelines seem to suggest use of plurals instead? e.g. rename kafka.messages.count to kafka.messages (or kafka.total-messages).
  • Is there any preference between usage of - and _ in names?

cc @jmacd

@carlosalberto carlosalberto requested review from a team April 11, 2022 16:13
@pyohannes
Copy link
Contributor

It would be worth collaborating with Kafka folks on this. There are ideas on Kafka side to natively expose OTel Metrics. I'm not sure about the state of this effort, but there's a recent proposal here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability#KIP714:Clientmetricsandobservability-Metricsnamingandformat

The metric names proposed in the above document seem to follow a different convention, neither _ nor - are used.

@carlosalberto
Copy link
Contributor Author

@pyohannes That looks interesting, thanks for posting that!

I think the plan is to keep the current JMX metrics around, but we should unify the used metric conventions first (then, when Kafka goes OTLP we can stop relying on JMX, etc). Will drop them a note ;)

…ka.md

Co-authored-by: Reiley Yang <reyang@microsoft.com>
@reyang reyang added area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory labels Apr 13, 2022
Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @carlosalberto for the initial draft. I've put a lot of potentially debatable suggestions here to help the community make decisions. Generally I would like us to specify the instrument with the best semantic fit, even though the standard for Kafka is to reduce the cost of the instrument. An example is request timing: the tradition is to use a Summary with individual quantile series, we'd prefer a Histogram, but really either is semantically meaningful for the same convention.

@jmacd
Copy link
Contributor

jmacd commented Apr 22, 2022

I understand now, the reason this PR and these conventions are problematic is that we have metrics data that is semantically produced from a Histogram or Summary instrument, but it has been precomputed by an external metrics system and is made available as individual Sum, Count, and quantile timeseries. As an OpenTelemetry API user, we have no way to capture a precomputed summary into OTLP as a single stream via the API, so there is a temptation to bypass the semantic definition and perpetuate the use of pre-calculated metrics. To me, it is not a good outcome if we specify individual precomputed timeseries in situations where semantically a Histogram or Summary instrument would be used, unless we also specify semantic-naming conventions for translating between OTLP Summary data points and individual timeseries.

I'm imagining optional, dedicated OTLP Exporter support for recognizing that a set of metrics are named like Summary timeseries (e.g., X_sum, X_total, X_p50, X_p90 and so on) that the Exporter SHOULD recombine those series into individual Summary data points. That is, if we have both a set of naming conventions for precomputed Summary timeseries as well as a directive for Exporters to rewrite them as Summary points on the wire, then I think we have satisfied our mission; these semantic conventions will refer to Histogram semantics, receivers will pass through precomputed timeseries, and OTLP exporters will correct the problem.

@carlosalberto
Copy link
Contributor Author

@jmacd Updated the PR.

Summary is that the metrics that can be reported as Summary/Histogram will be added in a follow up PR, so I will be removing these for now.

@carlosalberto
Copy link
Contributor Author

An additional note: all the counter metrics seem to be rates (e.g. kafka.requests.failed comes from FailedProduceRequestsPerSec, and kafka.request.count comes from TotalProduceRequestsPerSec).

Existing instrumentation (e.g. DataDog) seems to use these values as rate, rather than counter; and this IBM documentation hints that the granularity is 1 second: https://www.ibm.com/docs/en/obi/current?topic=technologies-monitoring-kafka

Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the topic of precomputed rate metrics, see #2485 (comment).

Suggest using -ratio for compression ratio, which is different than all the other rates in this PR. I've suggested changing some of the UpDownCounters to Gauge when the units are unitless (as -ratio) or {things}/s (as rates derived from counters).

You see there are three .rate suggestions here. I would support a generic semantics convention stating that a metric named X.rate is the rate derived from a Counter or UpDownCounter named X; this would allow us to specify the conventions as Counters even though current instrumentation reports the rate.

@carlosalberto
Copy link
Contributor Author

@jmacd @reyang Please review ;)

carlosalberto and others added 2 commits May 6, 2022 11:28
…ka.md

Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
carlosalberto and others added 2 commits May 6, 2022 17:35
…ka.md

Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@reyang reyang mentioned this pull request May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants