-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus Unknown MUST preserve its notion #3058
Comments
Bump, was there any progress on this? |
Copying over the feedback from the last spec meeting in which this was discussed (2/7/2023):
|
Looking at collector receivers for other protocols, and the types they support:
Bolded types are types which I believe don't have a 1:1 mapping to OTel types. For example, prometheus |
Potential subtypes: Gauge: Prometheus unknown, collectd absolute, signalfx enum The only issue with statsd sampled and collectd "overflow" counters is that we would need a way to represent additional numerical information to be useful, or distinguishable from a regular counter/sum. Sampled counters would need to record the sampling rate (e.g. 0.1 for a 10% sampling rate). Overflow counters would need to record the max possible value. Because of that, I would propose that the "subtype" field be only present on gauges to start. I would structure it similar to datapoint flags, so that it only allows a fixed set of types, rather than allow a free-form set of values. |
One issue i've just discovered is that we currently specify that Info and StatesSet metrics should become non-monotonic sums, rather than gauges. In this case we would either need to make subtype apply to at least gauges and sums. |
Prior design document Motivation and GoalsPrometheus and other protocols have metric types which are not explicitly supported in OTLP as their own type. The impact of this is that we either have to choose to entirely drop some metric types (e.g. The motivating use-case is Prometheus' Goal: Preserve type information for prometheus unknown, info, and stateset metric types Requirements
Proposal: Subtype Enums for Gauge and SumPrototype: open-telemetry/opentelemetry-proto#496 These additional "types" from other protocols can be thought of as "subtypes" of existing OTLP types, since they generally meet the requirements of our types, but carry additional information or stricter requirements. Since all unsupported metric types from other collector-supported protocols are gauges or sums, we only need to support gauge and sum subtypes to begin with. Since this is a new field, using it would be opt-in. Existing collector exporters would not see a change in behavior unless they wanted to specifically handle these types differently from their OTel type. Using a different enum for gauges and sums ensures that subtypes are always properly associated with the correct OTel type (e.g. unknown with gauges). Enums also ensure that there are not naming conflicts between different possible subtypes, as they list of subtypes is part of the proto. Alternative: Prometheus type as an additional metric attributeAttempted in #3056 The prometheus receiver would add a Alternative: Subtype String for Gauge and SumInstead of an Enum, we could use a string. That would allow any additional subtypes to be defined without changes to OTLP. However, that would possibly result in collisions, and we would want to define semantic conventions for the subtypes. That is likely more work, and is not worth the added flexibility. |
See #3582 for how these subtypes would be used in the prometheus specification. |
cc @open-telemetry/wg-prometheus @jmacd @bogdandrutu I've written up a proposal for how to support "subtypes" for metrics to solve this case. Feedback on the above is appreciated. |
cc @jsuereth as well |
I'm not sure I like representing Info + Stateset metrics as Sums. Sums imply the ability to add, either spacially or temporally, and that this is the default aggregation. The definition you list in the PROTO file show Stateset/Info metrics requiring the SUM to have a value of 1. Is that a temporal only restriction? Do you still spatially aggregate by default? My "litmus test" for OTLP metric types is whether or not the default aggregation makes sense. Unknown + Gauge, I feel are in pretty similar boats. It seems to me though that Info + Stateset may actually be closer to Gauge then Sum, but I'd like to better understand what I'm seeing first. |
That decision was originally made in #2380 (comment). This preserves that decision, although placing it in the proto would make it permanent. I do think you would want to sum by default when aggregating info and stateset metrics. They both as a "count of things with properties" or a "count of things in a state". Since the definition is in the data model, you are right that we can't require the SUM to have a value of 1, since it could have been aggregated at that point. We would have to define them as a count of things in a state instead. If we ever added an API/SDK for these metrics, we would want to require that the measurements are always 1, but we can't require that in the data model. |
Notes from the Spec meeting today:
|
I've removed stateset and info from the proto PR. For now, lets just address unknown-typed metrics, as that is not contentious, and is the most important of the types to support. |
Excuse my ignorance, but what is the point of this Prometheus unknown in the first place? Even in the OpenMetrics spec it says
It states that it should be used when it's impossible to determine what type that data is because it comes from some 3rd party integration. That means to me that we simply don't know what it is, and that is true for at least two hops (third party -> Prometheus -> OTLP). If we haven't managed (or cared) to fix the data at that point, does it really make sense to complicate the OTLP proto just to forward the information that we actually don't know what we are dealing with? Is it not reasonable to ask whoever is sending this data to make a decision before sending it off? I don't want to seem negative here, I would simply like to understand the importance/use-cases of this change. |
There are many preexisting Prometheus applications that don't send type information. This is primarily because prometheus' data model and the promql query language don't require type information to function. Today, it is strongly recommended to include type information in Prometheus endpoints, and most client libraries will include it automatically. But it is still useful to be able to support older exporters that don't include it.
I think it would just be a single hop: Older Prometheus application -> OTel collector Prometheus receiver.
Fixing types isn't straightforward today, at least in the collector. We don't have processors for changing gauges to counters, or vice-versa. Additionally, we would still need this change to make it possible for a processor to identify metrics that were previously unknown. Users also generally expect to be able to treat prometheus unknown-typed metrics as any type of metric when querying with promql, so backends that want to offer promql querying may need to treat them in special ways. |
I see, thanks for clarifying. I guess there are many legacy things in Prometheus land that I don't know about. If we had these counter to gauge or gauge to counter converters we would not really need this notion anymore though, right? Instead of doing the special handling at query time, it would encourage users to do the special handling earlier in the pipeline. At the same time, the OTLP format would not have to include everything that Prometheus ever did/will do, including things that are mostly that way for legacy reasons. |
It would need to exist upstream of OTel (e.g. in the Prometheus server, or in the prometheus receiver). The collector's datamodel is essentially OTLP, so we would need this change if we wanted to offer that functionality in processors. It could exist in the receiver, but requires potentially per-metric configuration at that point. It is fairly common with prometheus to use a single configuration to scrape everything in a kubernetes cluster (example), and adding pod annotations to opt-into scraping. Being able to still collect all of the metrics, even if I don't know the types upfront, and potentially add it later is still quite useful. |
I am just wondering: If you have unknowns today, would you do something like "transform everything that has datatype unknown to count" when you read it with PromQL? Or would you do "give me the metric xyz, and interpret it as count"? If that is the case, why not move that step to before transforming into OTLP? I get that it might be helpful to simply get all metrics and then deal with the problems on read, but users already have to configure their collectors one way or another. Why not add the conversion where prometheus is transformed to OTLP instead of putting the additional information on the OTLP messages? I don't want to seem stubborn here. I probably just haven't spent enough time in big Prometheus deployments to see the impossibility of what I am proposing, but from the "outside" it seems a bit weird to extend OTLP (with all it's well defined data types) to say "Well, actually let's add this flag to say we don't know what that is. Let's just transport it and deal with that later.". I also wonder if this could be a point that would encourage instrumentation authors to move away from |
With PromQL today, Prometheus users can say, "give me the metric xyz, and interpret it as count" (e.g. perform a rate on it).
It is common for a single, static prometheus scrape configuration to be used for the entire cluster, with limited, per-application configuration done using pod annotations or similar. It is also common for the team that manages ops workloads to be different from those managing the application being scraped, and they may not know anything about the metrics they are collecting at scrape time. |
I assume users with dashboards already set up can do that independent of how the data was ingested (i.e., it does not matter if unknown was propagated all the way to the backend). Is that correct?
Gotcha. Would that not be an even stronger case for making the person who creates and actually knows which metrics they produce say "This is a gauge" or "This is a counter"? I do think that this issue is a prototype of future OTel <> Prometheus interoperability questions. I actually don't know what the plan for that is. Will OTel support everything that has ever come or will ever come out of Prometheusland? Is there a cutoff point where OTel says "We will not implement a legacy Prometheus feature in the name of OTLP data quality"? This was the point I was trying to make here. If we go the "We will do anything for Prom compatibility route", what happens if Prometheus introduces a feature that is incompatible with OTel? Will we break OTLP and release a 2.0 to deal with that? How will that affect users? I don't think this is for us to decide here, but something for the TC (@open-telemetry/technical-committee) to discuss and decide. Personally, I think it would be fair to ask users who move their data from Prometheus to OTLP to fix their integrations on the client side (or put pressure on the authors of instrumentations to update, or even switch to OTel SDKs). Otherwise, it's a "garbage in, garbage out" situation in my eyes. I understand that it would probably mean a lot of work in a lot of places in the short term, but ultimately striving for high data quality should be a goal for all telemetry producers and should provide value for everyone. |
…nown-typed metrics in OTLP (#3868) Fixes #3058 Fixes #1712 Fixes #2409 ## Changes Use metric.metadata (added in open-telemetry/opentelemetry-proto#514) to support representing prometheus metric types in OTLP.
Should metric metadata be described in the Metrics data model? The situation seems similar to #4053 |
…nown-typed metrics in OTLP (open-telemetry#3868) Fixes open-telemetry#3058 Fixes open-telemetry#1712 Fixes open-telemetry#2409 ## Changes Use metric.metadata (added in open-telemetry/opentelemetry-proto#514) to support representing prometheus metric types in OTLP.
What are you trying to achieve?
Preserve the Prometheus Unknown's notion since it would be convert to Gauge as discussed in the following two issues: open-telemetry/wg-prometheus#69 and open-telemetry/opentelemetry-collector-contrib#16768
Additional context.
The text was updated successfully, but these errors were encountered: