Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Unknown MUST preserve its notion #3058

Closed
khanhntd opened this issue Dec 21, 2022 · 21 comments · Fixed by #3868
Closed

Prometheus Unknown MUST preserve its notion #3058

khanhntd opened this issue Dec 21, 2022 · 21 comments · Fixed by #3868
Assignees
Labels
spec:metrics Related to the specification/metrics directory

Comments

@khanhntd
Copy link
Contributor

What are you trying to achieve?

Preserve the Prometheus Unknown's notion since it would be convert to Gauge as discussed in the following two issues: open-telemetry/wg-prometheus#69 and open-telemetry/opentelemetry-collector-contrib#16768

Additional context.

@khanhntd khanhntd added the spec:metrics Related to the specification/metrics directory label Dec 21, 2022
@damemi
Copy link
Contributor

damemi commented Jun 15, 2023

Bump, was there any progress on this?

@dashpole
Copy link
Contributor

dashpole commented Jul 1, 2023

Copying over the feedback from the last spec meeting in which this was discussed (2/7/2023):

  • Consider adding an Unknown metric type, potentially as a "subtype" of gauge. It should be collector/protocol/transport-only, and should not impact the API/SDK.
  • Please look at other common metric types which may have an notion of unknown metrics (statsd?, other protocols supported by the collector?). What values can other unknown metrics have? String? or all float?

@dashpole
Copy link
Contributor

dashpole commented Jul 1, 2023

Looking at collector receivers for other protocols, and the types they support:

Protocol collector receiver metric types
OTLP MetricsData otlp receiver gauge, sum, histogram, exponentialhistogram, summary
OpenMetrics protocol prometheus receiver unknown, gauge, counter, stateset, info, histogram, gaugehistogram (no mapping to OTel), summary
statsd protocol statsd receiver counter, sampled counter, timer (histogram), gauge, set
influxdb line protocol influxdb receiver gauge, sum, histogram, summary
carbon plaintext protocol carbon receiver no types (becomes gauge or counter in receiver config)
collectd data types collectd receiver derive (sum), gauge, counter (overflow), absolute
signalfx proto format signalfx receiver gauge, (delta) counter, enum, cumulative counter

Bolded types are types which I believe don't have a 1:1 mapping to OTel types. For example, prometheus info metrics can be mapped to gauge, but the mapping isn't 1:1, as prometheus already has a gauge type. Those types would benefit from a notion of a "subtype".

@dashpole
Copy link
Contributor

dashpole commented Jul 1, 2023

Potential subtypes:

Gauge: Prometheus unknown, collectd absolute, signalfx enum
Sum: sampled counter, overflow counter, Prometheus stateset, Prometheus info, statsd set

The only issue with statsd sampled and collectd "overflow" counters is that we would need a way to represent additional numerical information to be useful, or distinguishable from a regular counter/sum. Sampled counters would need to record the sampling rate (e.g. 0.1 for a 10% sampling rate). Overflow counters would need to record the max possible value.

Because of that, I would propose that the "subtype" field be only present on gauges to start. I would structure it similar to datapoint flags, so that it only allows a fixed set of types, rather than allow a free-form set of values.

@dashpole
Copy link
Contributor

dashpole commented Jul 5, 2023

One issue i've just discovered is that we currently specify that Info and StatesSet metrics should become non-monotonic sums, rather than gauges. In this case we would either need to make subtype apply to at least gauges and sums.

@dashpole
Copy link
Contributor

dashpole commented Jul 6, 2023

Prior design document

Motivation and Goals

Prometheus and other protocols have metric types which are not explicitly supported in OTLP as their own type. The impact of this is that we either have to choose to entirely drop some metric types (e.g. gaugehistogram), or represent multiple types in the protocol as the same type in OTLP. The implication of representing multiple types as a single type in OTLP is that any distinction between the "colliding" types is lost, and isn't possible to reconstruct later if/when the distinction is important.

The motivating use-case is Prometheus' unknown metric type. An unknown-typed metric is a metric which has a value, but no type information. It could be a gauge, a sum, a bucket in a histogram, or the sum of a summary -- you can't tell at scrape time. Right now, both Prometheus gauge and unknown metrics types are converted to OTLP gauges. However, it isn't possible to later determine if an OTLP gauge started out as a Prometheus gauge or unknown. This matters when trying to convert back to Prometheus, or if you are sending metrics to a backend that handles gauges differently from other metric types (e.g. doesn't allow rates on gauges).

Goal: Preserve type information for prometheus unknown, info, and stateset metric types

Requirements

  • Backwards-compatible changes to OTLP
  • No changes to OTel API or SDK. Only changes to the protocol and collector.
  • Collector receivers should not be required to set any new fields when constructing OTel metrics.
  • Collector exporters should not be required to handle the new fields to have correct behavior.

Proposal: Subtype Enums for Gauge and Sum

Prototype: open-telemetry/opentelemetry-proto#496

These additional "types" from other protocols can be thought of as "subtypes" of existing OTLP types, since they generally meet the requirements of our types, but carry additional information or stricter requirements. Since all unsupported metric types from other collector-supported protocols are gauges or sums, we only need to support gauge and sum subtypes to begin with.

Since this is a new field, using it would be opt-in. Existing collector exporters would not see a change in behavior unless they wanted to specifically handle these types differently from their OTel type. Using a different enum for gauges and sums ensures that subtypes are always properly associated with the correct OTel type (e.g. unknown with gauges). Enums also ensure that there are not naming conflicts between different possible subtypes, as they list of subtypes is part of the proto.

Alternative: Prometheus type as an additional metric attribute

Attempted in #3056

The prometheus receiver would add a prometheus.type=unknown attribute to metrics with unknown types. This would meet the goal of supporting additional type information, but would add extra labels to metrics. However, downstream consumers of this information would have "garbage" labels added to their metrics unless they filter them out with collector processors.

Alternative: Subtype String for Gauge and Sum

Instead of an Enum, we could use a string. That would allow any additional subtypes to be defined without changes to OTLP. However, that would possibly result in collisions, and we would want to define semantic conventions for the subtypes. That is likely more work, and is not worth the added flexibility.

@dashpole
Copy link
Contributor

dashpole commented Jul 6, 2023

See #3582 for how these subtypes would be used in the prometheus specification.

@dashpole
Copy link
Contributor

dashpole commented Jul 6, 2023

cc @open-telemetry/wg-prometheus @jmacd @bogdandrutu

I've written up a proposal for how to support "subtypes" for metrics to solve this case. Feedback on the above is appreciated.

@dashpole
Copy link
Contributor

dashpole commented Jul 6, 2023

cc @jsuereth as well

@jsuereth
Copy link
Contributor

jsuereth commented Jul 7, 2023

I'm not sure I like representing Info + Stateset metrics as Sums. Sums imply the ability to add, either spacially or temporally, and that this is the default aggregation.

The definition you list in the PROTO file show Stateset/Info metrics requiring the SUM to have a value of 1. Is that a temporal only restriction? Do you still spatially aggregate by default?

My "litmus test" for OTLP metric types is whether or not the default aggregation makes sense. Unknown + Gauge, I feel are in pretty similar boats. It seems to me though that Info + Stateset may actually be closer to Gauge then Sum, but I'd like to better understand what I'm seeing first.

@dashpole
Copy link
Contributor

dashpole commented Jul 7, 2023

That decision was originally made in #2380 (comment). This preserves that decision, although placing it in the proto would make it permanent.

I do think you would want to sum by default when aggregating info and stateset metrics. They both as a "count of things with properties" or a "count of things in a state". Since the definition is in the data model, you are right that we can't require the SUM to have a value of 1, since it could have been aggregated at that point. We would have to define them as a count of things in a state instead. If we ever added an API/SDK for these metrics, we would want to require that the measurements are always 1, but we can't require that in the data model.

@dashpole
Copy link
Contributor

dashpole commented Jul 11, 2023

Notes from the Spec meeting today:

  • @tigrannajaryan: Makes sense to use this method to preserve prometheus types. Left feedback on the OTLP proto changes. @dashpole needs to consider and document how to handle version skew between old and new versions. Done here
  • @jsuereth : Make sure the default aggregation matches what we would expect for info and stateset.
  • @jmacd: Will review, and will reconsider whether info and statesets should be sums.

@dashpole
Copy link
Contributor

I've removed stateset and info from the proto PR. For now, lets just address unknown-typed metrics, as that is not contentious, and is the most important of the types to support.

@pirgeo
Copy link
Member

pirgeo commented Jul 21, 2023

Excuse my ignorance, but what is the point of this Prometheus unknown in the first place? Even in the OpenMetrics spec it says

Unknown SHOULD NOT be used.

It states that it should be used when it's impossible to determine what type that data is because it comes from some 3rd party integration. That means to me that we simply don't know what it is, and that is true for at least two hops (third party -> Prometheus -> OTLP). If we haven't managed (or cared) to fix the data at that point, does it really make sense to complicate the OTLP proto just to forward the information that we actually don't know what we are dealing with? Is it not reasonable to ask whoever is sending this data to make a decision before sending it off? I don't want to seem negative here, I would simply like to understand the importance/use-cases of this change.

@dashpole
Copy link
Contributor

There are many preexisting Prometheus applications that don't send type information. This is primarily because prometheus' data model and the promql query language don't require type information to function. Today, it is strongly recommended to include type information in Prometheus endpoints, and most client libraries will include it automatically. But it is still useful to be able to support older exporters that don't include it.

That means to me that we simply don't know what it is, and that is true for at least two hops (third party -> Prometheus -> OTLP)

I think it would just be a single hop: Older Prometheus application -> OTel collector Prometheus receiver.

If we haven't managed (or cared) to fix the data at that point, does it really make sense to complicate the OTLP proto just to forward the information that we actually don't know what we are dealing with?

Fixing types isn't straightforward today, at least in the collector. We don't have processors for changing gauges to counters, or vice-versa. Additionally, we would still need this change to make it possible for a processor to identify metrics that were previously unknown. Users also generally expect to be able to treat prometheus unknown-typed metrics as any type of metric when querying with promql, so backends that want to offer promql querying may need to treat them in special ways.

@pirgeo
Copy link
Member

pirgeo commented Jul 21, 2023

I see, thanks for clarifying. I guess there are many legacy things in Prometheus land that I don't know about. If we had these counter to gauge or gauge to counter converters we would not really need this notion anymore though, right? Instead of doing the special handling at query time, it would encourage users to do the special handling earlier in the pipeline. At the same time, the OTLP format would not have to include everything that Prometheus ever did/will do, including things that are mostly that way for legacy reasons.
As an example, when metrics from an app with Prometheus get scraped by the collector, one could convert all metrics that start with count (because all counters in the app are alreadt named count_xyz) to counters and leave everything else as gauges.

@dashpole
Copy link
Contributor

If we had these counter to gauge or gauge to counter converters we would not really need this notion anymore though, right? ... one could convert all metrics that start with count (because all counters in the app are already named count_xyz) to counters and leave everything else as gauges.

It would need to exist upstream of OTel (e.g. in the Prometheus server, or in the prometheus receiver). The collector's datamodel is essentially OTLP, so we would need this change if we wanted to offer that functionality in processors. It could exist in the receiver, but requires potentially per-metric configuration at that point.

It is fairly common with prometheus to use a single configuration to scrape everything in a kubernetes cluster (example), and adding pod annotations to opt-into scraping. Being able to still collect all of the metrics, even if I don't know the types upfront, and potentially add it later is still quite useful.

@pirgeo
Copy link
Member

pirgeo commented Jul 22, 2023

I am just wondering: If you have unknowns today, would you do something like "transform everything that has datatype unknown to count" when you read it with PromQL? Or would you do "give me the metric xyz, and interpret it as count"? If that is the case, why not move that step to before transforming into OTLP? I get that it might be helpful to simply get all metrics and then deal with the problems on read, but users already have to configure their collectors one way or another. Why not add the conversion where prometheus is transformed to OTLP instead of putting the additional information on the OTLP messages?

I don't want to seem stubborn here. I probably just haven't spent enough time in big Prometheus deployments to see the impossibility of what I am proposing, but from the "outside" it seems a bit weird to extend OTLP (with all it's well defined data types) to say "Well, actually let's add this flag to say we don't know what that is. Let's just transport it and deal with that later.".

I also wonder if this could be a point that would encourage instrumentation authors to move away from unknown. If users were forced to set it up manually, they would put pressure on the authors of the libraries, and in turn everybody's data quality goes up (although this may be wishful thinking).

@dashpole
Copy link
Contributor

would you do something like "transform everything that has datatype unknown to count" when you read it with PromQL? Or would you do "give me the metric xyz, and interpret it as count"?

With PromQL today, Prometheus users can say, "give me the metric xyz, and interpret it as count" (e.g. perform a rate on it).

users already have to configure their collectors one way or another. Why not add the conversion where prometheus is transformed to OTLP instead of putting the additional information on the OTLP messages?

It is common for a single, static prometheus scrape configuration to be used for the entire cluster, with limited, per-application configuration done using pod annotations or similar. It is also common for the team that manages ops workloads to be different from those managing the application being scraped, and they may not know anything about the metrics they are collecting at scrape time.

@pirgeo
Copy link
Member

pirgeo commented Jul 25, 2023

With PromQL today, Prometheus users can say, "give me the metric xyz, and interpret it as count" (e.g. perform a rate on it).

I assume users with dashboards already set up can do that independent of how the data was ingested (i.e., it does not matter if unknown was propagated all the way to the backend). Is that correct?

It is also common for the team that manages ops workloads to be different from those managing the application being scraped, and they may not know anything about the metrics they are collecting at scrape time.

Gotcha. Would that not be an even stronger case for making the person who creates and actually knows which metrics they produce say "This is a gauge" or "This is a counter"?

I do think that this issue is a prototype of future OTel <> Prometheus interoperability questions. I actually don't know what the plan for that is. Will OTel support everything that has ever come or will ever come out of Prometheusland? Is there a cutoff point where OTel says "We will not implement a legacy Prometheus feature in the name of OTLP data quality"? This was the point I was trying to make here. If we go the "We will do anything for Prom compatibility route", what happens if Prometheus introduces a feature that is incompatible with OTel? Will we break OTLP and release a 2.0 to deal with that? How will that affect users?

I don't think this is for us to decide here, but something for the TC (@open-telemetry/technical-committee) to discuss and decide. Personally, I think it would be fair to ask users who move their data from Prometheus to OTLP to fix their integrations on the client side (or put pressure on the authors of instrumentations to update, or even switch to OTel SDKs). Otherwise, it's a "garbage in, garbage out" situation in my eyes. I understand that it would probably mean a lot of work in a lot of places in the short term, but ultimately striving for high data quality should be a goal for all telemetry producers and should provide value for everyone.

carlosalberto pushed a commit that referenced this issue Apr 8, 2024
…nown-typed metrics in OTLP (#3868)

Fixes
#3058
Fixes
#1712
Fixes
#2409

## Changes

Use metric.metadata (added in
open-telemetry/opentelemetry-proto#514) to
support representing prometheus metric types in OTLP.
@mx-psi
Copy link
Member

mx-psi commented Aug 20, 2024

Should metric metadata be described in the Metrics data model? The situation seems similar to #4053

carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this issue Oct 31, 2024
…nown-typed metrics in OTLP (open-telemetry#3868)

Fixes
open-telemetry#3058
Fixes
open-telemetry#1712
Fixes
open-telemetry#2409

## Changes

Use metric.metadata (added in
open-telemetry/opentelemetry-proto#514) to
support representing prometheus metric types in OTLP.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment