Skip to content

Proposal: OTEL delta temporality support #48

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

fionaliao
Copy link
Contributor

@fionaliao fionaliao commented Mar 27, 2025

@fionaliao fionaliao force-pushed the fionaliao/delta-proposal branch from 1c3370c to 3c9ea52 Compare March 27, 2025 16:34
@fionaliao fionaliao marked this pull request as ready for review March 27, 2025 16:35
Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
@fionaliao fionaliao force-pushed the fionaliao/delta-proposal branch from 3c9ea52 to 8094034 Compare March 27, 2025 16:36
Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
@fionaliao
Copy link
Contributor Author

I'm planning to make some updates to the proposal after the dev summit (the consensus was that we want to support push-based usage cases and continue to explore delta support) and some additional discussion with @beorn7 .

Based on these discussions, I also have a PR up for basic native delta support (behind a feature flag) in Prometheus - this just stores the sample value at TimeUnixNano without additional labels: prometheus/prometheus#16360, and for querying, advising people to use sum_over_time (/ interval) for now. The idea is that we'd first have this simple case without changing PromQL function and get some feedback, that could help figure out how to go forward in terms of temporality-aware functions. I'll update the proposal with this.

Additional updates to make:

  • Flesh out pros/cons of __temporality__ label vs delta_ types (I am actually leaning more into the temporality label now)
  • Add an example of query interval < collection interval which could mess up rate calculations
  • Add some stuff about the serverless/ephemeral jobs use case. This is not specific to OTEL per-se, but this was the use case that kept coming up when talking about deltas to various people/users during kubecon.

I'm out for the next week, but will apply the updates after that

* Simplified proposal - moved CT-per-sample to possible future extension instead of embedding within proposal
* Changed proposal to have a new `__temporality__` label instead of extending `__type__` - probably better to keep metric type concept distinct from metric temporality. This also aligns with how OTEL models it.
* Updated remote-write section - delta ingestion will actually be fully supported via remote write (since CT-per-sample is moved out of main proposal for now)
* Moved temporary `delta_rate()` and `delta_increase()` functions suggestion to discarded alternative - not sure this is actually necessary if we have feature flag for temporality-aware functions anyway
* Fleshed out implementation plan

Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
@fionaliao
Copy link
Contributor Author

Updates:

  • Simplified proposal - moved CT-per-sample to possible future extension instead of embedding within proposal
  • Changed proposal to have a new __temporality__ label instead of extending __type__ - probably better to keep metric type concept distinct from metric temporality. This also aligns with how OTEL models it.
  • Updated remote-write section - delta ingestion will actually be fully supported via remote write (since CT-per-sample is moved out of main proposal for now)
  • Moved temporary delta_rate() and delta_increase() functions suggestion to discarded alternative - not sure this is actually necessary if we have feature flag for temporality-aware functions anyway
  • Fleshed out implementation plan


#### rate() calculation

In general: `sum of second to last sample values / (last sample ts - first sample ts)) * range`. We skip the value of the first sample as we do not know its interval.
Copy link

@enisoc enisoc Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We skip the value of the first sample as we do not know its interval.

Perhaps we could get utility out of the first sample's value by guessing that each sample's interval is equal to the average time between samples in the window. One motivation for this is a case that we see often with non-sparse deltas produced by an intermediate processor.

Suppose the actual instrumented app is sending delta samples at a regular 60s interval. We'll assume for simplicity that these deltas are always greater than zero. Then there is an intermediate processor that's configured to accumulate data and flush every 30s. To avoid spareness, it's configured to flush a 0 value if nothing was seen.

The data stream will look like this, with a sample every 30s:

5 0 2 0 10 0 8 0

Note that every other value is 0 because of the mismatch between the flush intervals of the instrumented app and the intermediate processor.

If we then do a rate(...[1m]) on this timeseries, with the current proposal, we might end up with the 1m windows broken up like this:

5 0 | 2 0 | 10 0 | 8 0

If we can't make use of the value from the first sample in each window, we will end up computing a rate of 0 for all of these windows. That feels like it fails to make use of all the available information, since as humans we can clearly see that the rate was not zero.

If instead we guess that each sample represents the delta for a 30s interval, because that's the average distance between the two datapoints in the window, then we will compute the correct rates. Of course it was only a guess and you could contrive a scenario that would fool the guess, but the idea would be to assume that the kind of scenario described here is more likely than those weird ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cumulative series could generate the same values (with the zeroes being where the series resets). And in that case rate() would return no results. So though this doesn't accurately capture the rate, the behaviour would be consistent for cumulative and delta metrics.

However, the first point in a window is more meaningful in the delta case - you know it's a delta from the preceeding sample while in the cumulative case you have to look outside the window to get the same information, so maybe we should do better because of that. That example is leaning me more towards "just do sum_over_time() / range for delta rate()" - in this case that would probably give more useful information. Or at least do that before CT-per-sample available, at which point we'd have more accurate interval data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the root of evil here is that we are essentially reconstructing the "problem" of the current rate calculation, which is that we do not take into account samples from outside the range (but still extrapolates the calculated rate to match the range). I have made arguments why this is actually a good thing in the case of the classic rate calculation, and those arguments partially carry over to the delta case. But not entirely. If we had the accurate interval data, we could reason about how far outside of the range the seen increments are. We could weigh them (but then we should probably also take into account the delta sample "from the future", i.e. after the "right" end of the range), or we could accept if the interval is small enough.
Given that we do not want to take into account the collection interval in this first iteration, we could argue that a delta sample usually implies that the increments it represents are "recent", so we could actually take into account all delta samples in the range. This would ensure "complete coverage" if we graph something with 1m spacing of points and a [1m] range for the rate calculation. That's essentially what "xrate" does for classic rate calculation, but with the tweak that it is unlikely to include increments from the "distant past" because delta samples are supposed to represent "recent" increments. (If you miss a few scrapes with a cumulative counter, you don't miss increments, but now the increment you see is over a multiple of the usual scrape interval, which an "xrate" like approach will happily count as still within the given range.)
From a high level perspective, I'm a bit concerned that we are throwing away one of the advantages that delta temporality has if we ignore the first sample in the range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another perspective on this subject (partially discussed with @fionaliao in person):

One reason to do all the extrapolation and estimation magic with the current rate calculation is that the Prometheus collection model deliberately gives you "unaligned" sampling, i.e. targets with the same scrape interval are still scraped at different phases (not all at the full minute, but hashed over the minute). PromQL has to deal with this in a suitable manner.

While delta samples may be unaligned as well, the usual use case is to collect increments over the collection interval (let's say again 1m), and then send out the collected increments at the full minute. So all samples are at the full minute. If we now run a query like rate(delta_request_counter[5m]), and we run this query at an aligned "full minute" timestamp, we get the perfect match: All the delta samples in the range perfectly cover the 5m range. The sample at the very left end of the range is excluded (thanks to the new "left open" behavior in Prometheus v3). So it would be a clear loss in this case to exclude the earliest sample in the range. (The caveat here is that you do not have to run the query at the full minute. In fact, if you create recording rules in Prometheus, the evaluation time is again deliberately hashed around the rule evaluation interval to avoid the "thundering herd". The latter could be avoided, though, if we accept delayed rule evaluation, i.e. evaluate in a staggered fashion, but use a timestamp "in the past" that matches the full minute.)

There is a use case where delta samples are not aligned at all, and that's the classic statsd use case where you sent increments of one immediately upon each counter increment. However, in this case, the collection interval is effectively zero, and we should definitely not remove the earliest sample from the range.


Downsides:

* This will not work if there is only a single sample in the range, which is more likely with delta metrics (due to sparseness, or being used in short-lived jobs).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cumulative samples, it makes sense that with a single sample in the window you can't guess anything at all about how much increase happened in that window. With a single delta sample, even if we don't know the start time, we should be able to make a better guess than "no increase happened".

For example, we could guess that the interval is equal to the window size -- in other words return the single delta value as is with no extrapolation. The argument would be that you picked an arbitrary window of size T and found 1 sample, so the best guess for the frequency of the samples is 1/T. This seems like it would be more useful on average than returning no value in the case of a single sample.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is with mixing extrapolation and non-extrapolation logic because that might end up surprising users.

if we do decide to generally extrapolate to fill the whole window, but have this special case for a single datapoint, someone might rely on the non-extrapolation behaviour and get surprised when there are two points and it changes .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, another point why extrapolation (while not completely without merit) probably has another trade-off in the delta case and might just not be worth it.

…on type and metadata

Signed-off-by: Fiona Liao <fiona.liao@grafana.com>
@fionaliao fionaliao force-pushed the fionaliao/delta-proposal branch from d0474da to f2433c8 Compare April 18, 2025 16:08
@fionaliao
Copy link
Contributor Author

Next steps for this proposal:

  • Wait for type and unit metadata proposal to be finalised, which might result in updates to how exactly the temporality label will be implemented
  • Get the primitive otel delta support PR merged, hopefully having that out will help get some feedback on querying should be done
  • Write some code to experiment with delta rate implementations, see what edge cases there are for each option

@fionaliao
Copy link
Contributor Author

Write some code to experiment with delta rate implementations

Started rough implementation for rate functions here:

prometheus/prometheus@fionaliao/basic-delta-support...fionaliao/delta-rate

Including some example queries: https://github.com/prometheus/prometheus/blob/4c72cba2e76ac55c77c46af7b2b9348e8cf67b59/promql/promqltest/testdata/delta.test

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this design doc.

I realize that my comments are a bit all over the place, and often they discuss things that are already taken into account in this document, just maybe not in the order or emphasis I would prefer.

An attempt to summarize my thoughts and concerns:

I wholeheartedly agree with "Milestone 1". However, I do think we should let the experiences made from it inform our further steps. The design doc explains most of the possible approaches quite well, but it essentially already proposes a preferred solution along the following lines:

  1. Introduce a temporality label.
  2. Make rate/increase/irate behave differently depending on that label.
  3. Embrace an extrapolation approach in that rate/increase calculation.

I have concerns about each of these points. I wouldn't go as far to say that they are prohibitive, but I would have trouble approving a design doc that frames them as the preferred way to go forward, while the alternatives that I find more likely to be viable are already "demoted" to "alternatives that we have dismissed for now".

My concerns summarized:

  1. I would only introduce a temporality label once we have established we need one. I would go for "treat deltas as gauges" until we hit a wall where we clearly see that this is not enough. In the form of the outcome of recording rules, Prometheus had "delta samples" from the beginning, and never did we consider marking them as such.
  2. I have a very bad feeling about "overloading" and have certain functions behave fundamentally different depending on the type of the argument (and that even more so as we are just establishing this strong typing of metrics as we go). (We kind-of do that already for some histogram functions, but there the difference in type is firmly established, plus it's not really fundamentally different what we are doing, we are doing "the same" on different representations of histograms (classic vs. native), plus we will just get rid of the "classic" part eventually.) Additionally, I don't think it makes sense to claim that we are calculating an "increase" based on something that is already an increase (a delta sample). The "rate'ing" is then just the normalization step, which is just one part of the "actual" rate calculation. Even though it might be called that way in other metrics system, I don't think that should inform Prometheus naming. I do understand the migration point, but I see it more as a "lure" into something that looks convenient at first glance but has the potential of causing trouble precisely because it is implicit (or "automagic"). What might convince me would be a handling of ranges that contain "mixed samples" (both cumulative and delta samples) because that would actually allow a seamless migration, but that would open up a whole different can of worms.
  3. Extrapolation caused a whole lot of confusion and controversy for the existing rate conversion. I believe that it was necessary, but I see a different trade-off for delta samples. Given that we have active work on non-extrapolation (anchored in the PoC) and "a different kind of interpolation" (smoothed in the PoC) for rate calculation, we should hold back introducing a questionable extrapolation mechanism in delta handling. With the tracking of CT (aka StartTimeUnixNano), we are well set up to do something like smoothed for deltas (which is something to be fleshed out maybe here or maybe in a separate design doc), and in many cases, the naive non-extrapolation approach might just be the best for deltas. (An "aligned" rule evaluation feature might be easier to implement and more helpful for the use case of aligned delta samples.)1

To summarize the summary: I would pivot this design doc more as a list of alternatives we have to explore, and only state the first step as "already decided", namely to ingest the delta samples "as is", which puts us into a position to explore the alternatives in practice.

Footnotes

  1. if you feel that aligned rule evaluation and "smoothed" increase calculation from deltas should be included in this doc, I'm willing to flesh them out in more detail.


For the initial implementation, reuse existing chunk encodings.

Currently the counter reset behaviour for cumulative native histograms is to cut a new chunk if a counter reset is detected. If a value in a bucket drops, that counts as a counter reset. As delta samples don’t build on top of each other, there could be many false counter resets detected and cause unnecessary chunks to be cut. Therefore a new counter reset hint/header is required, to indicate the cumulative counter reset behaviour for chunk cutting should not apply.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is more to that than just creating a new counter reset hint. Counter histogram chunks have the invariant that no (bucket or total) count ever goes down baked into their implementation (e.g. to store numbers more efficiently).

The histogram chunks storing delta histogram samples should use the current gauge histogram chunks. Whether we really need a different counter reset hint then (rather than just using the existing "gauge histogram" hint) is a more subtle question. (I still tend to just view deltas as gauges, but if we want to mark them differently, the counter reset hint could be one way. However, simple float samples do not have that way, so we need some other way to mark a sample as "delta" anyway. If we use the same way for histogram samples, then we can just keep using the "gauge histogram" counter reset hint combined with that new way to mark delta samples.)


No scraped metrics should have delta temporality as there is no additional benefit over cumulative in this case. To produce delta samples from scrapes, the application being scraped has to keep track of when a scrape is done and resetting the counter. If the scraped value fails to be written to storage, the application will not know about it and therefore cannot correctly calculate the delta for the next scrape.

Delta metrics will be filtered out from metrics being federated. If the current value of the delta series is exposed directly, data can be incorrectly collected if the ingestion interval is not the same as the scrape interval for the federate endpoint. The alternative is to convert the delta metric to a cumulative one, which has issues detailed above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As delta temporality is essentially the same as the outcome of a rate(...) recording rule (provided the delta metric does not wildly change its collection interval), I wouldn't rule out federation completely. It is very common to federate the outcome of a rate(...) recording rule, so why not federate delta metrics in the same way?
If the delta metric has e.g. a constant collection interval of 1m, and we do a federation scrape at least as often (or better more often, like 15s), we can still work with the resulting federated metrics. Prerequisite is essentially a (mostly) constant and known collection interval.
In contrast, a delta metric that has samples at irregular intervals (most extreme: classic statsd approach with deltas of one whenever an event happens) would not work via federation.


#### rate() calculation

In general: `sum of second to last sample values / (last sample ts - first sample ts)) * range`. We skip the value of the first sample as we do not know its interval.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the root of evil here is that we are essentially reconstructing the "problem" of the current rate calculation, which is that we do not take into account samples from outside the range (but still extrapolates the calculated rate to match the range). I have made arguments why this is actually a good thing in the case of the classic rate calculation, and those arguments partially carry over to the delta case. But not entirely. If we had the accurate interval data, we could reason about how far outside of the range the seen increments are. We could weigh them (but then we should probably also take into account the delta sample "from the future", i.e. after the "right" end of the range), or we could accept if the interval is small enough.
Given that we do not want to take into account the collection interval in this first iteration, we could argue that a delta sample usually implies that the increments it represents are "recent", so we could actually take into account all delta samples in the range. This would ensure "complete coverage" if we graph something with 1m spacing of points and a [1m] range for the rate calculation. That's essentially what "xrate" does for classic rate calculation, but with the tweak that it is unlikely to include increments from the "distant past" because delta samples are supposed to represent "recent" increments. (If you miss a few scrapes with a cumulative counter, you don't miss increments, but now the increment you see is over a multiple of the usual scrape interval, which an "xrate" like approach will happily count as still within the given range.)
From a high level perspective, I'm a bit concerned that we are throwing away one of the advantages that delta temporality has if we ignore the first sample in the range.


#### rate() calculation

In general: `sum of second to last sample values / (last sample ts - first sample ts)) * range`. We skip the value of the first sample as we do not know its interval.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another perspective on this subject (partially discussed with @fionaliao in person):

One reason to do all the extrapolation and estimation magic with the current rate calculation is that the Prometheus collection model deliberately gives you "unaligned" sampling, i.e. targets with the same scrape interval are still scraped at different phases (not all at the full minute, but hashed over the minute). PromQL has to deal with this in a suitable manner.

While delta samples may be unaligned as well, the usual use case is to collect increments over the collection interval (let's say again 1m), and then send out the collected increments at the full minute. So all samples are at the full minute. If we now run a query like rate(delta_request_counter[5m]), and we run this query at an aligned "full minute" timestamp, we get the perfect match: All the delta samples in the range perfectly cover the 5m range. The sample at the very left end of the range is excluded (thanks to the new "left open" behavior in Prometheus v3). So it would be a clear loss in this case to exclude the earliest sample in the range. (The caveat here is that you do not have to run the query at the full minute. In fact, if you create recording rules in Prometheus, the evaluation time is again deliberately hashed around the rule evaluation interval to avoid the "thundering herd". The latter could be avoided, though, if we accept delayed rule evaluation, i.e. evaluate in a staggered fashion, but use a timestamp "in the past" that matches the full minute.)

There is a use case where delta samples are not aligned at all, and that's the classic statsd use case where you sent increments of one immediately upon each counter increment. However, in this case, the collection interval is effectively zero, and we should definitely not remove the earliest sample from the range.


#### rate() calculation

In general: `sum of second to last sample values / (last sample ts - first sample ts)) * range`. We skip the value of the first sample as we do not know its interval.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum of second to last sample values / (last sample ts - first sample ts)) * range

Technical note: This formula calculates the extrapolated increase. You have to leave out the * range to get the extrapolated rate:

sum of second to last sample values / (last sample ts - first sample ts))


CT-per-sample is not a blocker for deltas - before this is ready, `StartTimeUnixNano` will just be ignored.

Having CT-per-sample can improve the `rate()` calculation - the ingestion interval for each sample will be directly available, rather than having to guess the interval based on gaps. It also means a single sample in the range can result in a result from `rate()` as the range will effectively have an additional point at `StartTimeUnixNano`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar effect could be created from a separate and explicit tracking of counter resets (rather than relying on the detection via "value has gone down"). If we were able to mark every sample as having a counter reset, rate'ing a delta counter would implicitly give the "correct" result as described in this paragraph.

Or in other words: CT gives us tracking counter resets explicitly as a byproduct. And maybe it should just become the way. (NH can track counter resets explicitly, but need a new chunk for that. It would not be efficient if it happened on every sample. Counter resets could be tracked in metadata, but again, it would be expensive to track frequent counter resets that way.)

(This is more an academic comment to improve our collective understanding, not necessarily something to include in the design doc. Maybe just mention that CT allows precise counter-reset tracking so that the reader is made aware that those topics are related.)


To work out the increase more accurately, they would also have to look at the sample before and the sample after the range to see if there are samples that partially overlap with the range - in that case the partial overlaps should be added to the increase.

This could be a new function, or changing the `rate()` function (it could be dangerous to adjust `rate()`/`increase()` though as they’re so widely used that users may be dependent on their current behaviour even if they are “less accurate”).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe crosslink to the upcoming design doc about "anchored" and "smoothed" rate/increase calculation. (This here is "smoothed", and the "sum_over_time" approach to delta samples is "anchored".)

#### Treat as gauge
To avoid introducing a new type, deltas could be represented as gauges instead and the start time ignored.

This could be confusing as gauges are usually used for sampled data (for example, in OTEL: "Gauges do not provide an aggregation semantic, instead “last sample value” is used when performing operations like temporal alignment or adjusting resolution.”) rather than data that should be summed/rated over time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to note that gauges in Prometheus are in fact the metric type that is aggregatable. "First the rate, then aggregate!" Rate'ing a counter creates a gauge. The gauge is then what you can aggregate. Delta samples are already aggregatable. They are, for all Prometheus intents and purposes, gauges.

If we end up with a new metric type "delta-counter" that is treated in exactly the same way as gauges, then we are arguably creating a greater confusion than having a gauge in Prometheus that has a slightly different semantics from gauges in other metrics systems.

In other words, I think it is a good idea that each (Prometheus) metric type is actually handled differently within Prometheus. A type should not just be "informational".

Maybe there are cases where we want to treat "real" gauges differently from deltas, but that has to be seen.


This also does not work for samples missing StartTimeUnixNano.

#### Convert to rate on ingest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note as the person who came up with this idea: I have come to the conclusion that this approach has a bad trade-off. Being able to store as much as possible of the original sample (value, and ideally the CT AKA StartTimeUnixNano) and then process that on query time is better than doing some calculation on ingest time and lose the original data.


`sum_over_time()` between T0 and T5 will get 10. Divided by 5 for the rate results in 2.

However, if you only query between T4 and T5, the rate would be 10/1 = 1 , and queries between earlier times (T0-T1, T1-T2 etc.) will have a rate of zero. These results may be misleading.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But will the result be so much different with the rate approach described above? In fact, we won't get any rate with that because there is only one sample is the range.

I do think there is a way to estimate rates/increases if the samples do not align with the range boundaries and we have the StartTimeUnixNano AKA CT. Then we could do some weighing according to the proportion the increase is expected to happen inside the range (including for this particular example where the range is just a fraction of the collection interval, and we could say the collection interval is 5x the range, so we only take into account 1/5th of the increase). But this approach isn't described anywhere in this design doc (is it?). It would be similar to the upcoming "smoothed" rate modeling (aka "mrate" in my rate braindump). It would also be the key to a "proper" integral function, see prometheus/prometheus#14440 – to connect all the dots... ;)

@fionaliao
Copy link
Contributor Author

@beorn7 Thanks for your comments :) I need more time to go through all of them, but as a start:

To summarize the summary: I would pivot this design doc more as a list of alternatives we have to explore, and only state the first step as "already decided", namely to ingest the delta samples "as is", which puts us into a position to explore the alternatives in practice.

It makes sense to separate the first step. Not a strong opinion, but I was thinking of making a separate proposal PR for the first step, and then put this proposal back into draft to indicate it's still being figured out - that way we can have a merged PR for the first step, and there can be discussion on the possible future steps within this open PR. WDYT?


I would only introduce a temporality label once we have established we need one. I would go for "treat deltas as gauges" until we hit a wall where we clearly see that this is not enough.

I agree with doing this delta as gauge approach first. I do think eventually we will want to treat deltas separately from gauges, but we should get more user feedback to confirm this is the case.

Chronosphere have already gained insights into on this, as they've implemented their own version of delta support and @enisoc wrote up this document and noted: "Users don't like having to think about the temporality of their metrics and learn to use different functions (e.g. increase vs. sum_over_time). They want to have one recommended way to query counters that just works regardless of temporality.".

One problem with treating deltas as gauges is that gauge means different things in Prometheus and OTEL - in Prometheus, it's just a value that can go up and down, while in OTEL it's the "last-sampled event for a given time window". While it technically makes sense to represent an OTEL delta counter as a Prometheus gauge, this could be a point of confusion for OTEL users who see their counter being mapped to a Prometheus gauge, rather than a Prometheus counter. There could also be uncertainty for the user on whether the metric was accidentally instrumented as a gauge or whether it was converted from a delta counter to a gauge.

Another problem is that the temporality of a metric might not be completely under the control of the user instrumenting the metric - it could change in the metric ingestion pipeline (e.g. with the cumulativetodelta or deltatocumulative processors), so it can be hard to determine at query time what function to use. If we properly mark deltas as gauges - i.e. with the metric type gauge - and have warnings when using rate() on Prometheus gauges and sum_over_time() on Prometheus counters, this is alleviated. (However, alerts don't integrate with warnings so may end up being incorrect without detection).

We kind-of do that already for some histogram functions, but there the difference in type is firmly established

How is type being firmly established in the native histogram case vs not being firmly established in the delta and cumulative case if there's a "temporality" label?

Additionally, I don't think it makes sense to claim that we are calculating an "increase" based on something that is already an increase (a delta sample).

increase() could be considered the increase in the underlying thing being measured, which makes sense for applying increase() on a delta metric.

Also, deltas could also be seen as cumulative with resets between each sample. (On the other hand, as discussed, there are different characteristics of delta metrics so while they could be seen as cumulative or converted to cumulative that might not be the best representation)

I do understand the migration point, but I see it more as a "lure" into something that looks convenient at first glance but has the potential of causing trouble precisely because it is implicit (or "automagic"). What might convince me would be a handling of ranges that contain "mixed samples" (both cumulative and delta samples) because that would actually allow a seamless migration, but that would open up a whole different can of worms.

As well as one-off migrations where you might just have to update queries once, a case which might cause more constant frustration is when there is a mix of sources with different temporalities. So a single series might have the same temporality over time, but different series have different temporalities. If you want a single query to combine the results and we didn't do function overloading, you'd need something like rate(cumulative metrics only) + sum_over_time(delta metrics only). (Is this what you were referring to when you said mixed samples, or did you just mean the case where a single series had different temporality over time?)

@beorn7
Copy link
Member

beorn7 commented Apr 29, 2025

I was thinking of making a separate proposal PR for the first step, and then put this proposal back into draft to indicate it's still being figured out

I don't think it would help with clarity to have multiple design docs. Personally, I don't believe a design doc has to make a call all the way through. I would think it's fine if a design doc says "We want to do this first step, and then we want to do research and want to make a call between options X, Y, and Z based on these following rationales."

About the general argument about "overloading" increase and rate for delta temporality: I think the arguments are already well made in the design doc. I'm just not sure we can make a call right now without practical experience. We can repeat and refine both sides of the argument, but I think it will be much easier and much more convincing once we have played with it.

How is type being firmly established in the native histogram case vs not being firmly established in the delta and cumulative case if there's a "temporality" label?

First of all, that label does not exist yet. So currently, it is not established at all. Secondly, a histogram sample looks completely different from a float sample in the low-level TSDB data. There is no way the code can confuse one for the other. But a label is just a label. It could accidentally get removed, or added (or maybe even on purpose, "hard-casting" the metric type, if you want), so a relatively lightweight thing like a label will change how a function processes something that is just a float in the TSDB in either case.

Is this what you were referring to when you said mixed samples, or did you just mean the case where a single series had different temporality over time?

I was thinking mostly about one and the same series that changes from cumulative to delta over time. (Your point about mixed vectors is also valid, but that would be solved by the proposed "overloaded" functions just fine.)

@fionaliao
Copy link
Contributor Author

@beorn7 I'll update this doc as you suggested (with the first step + laying out the options for future steps without committing to any), and incorporate you and @enisoc's comments

@beorn7
Copy link
Member

beorn7 commented Apr 30, 2025

Thank you. Feel free to express a preference (like putting the "most likely" outcome first). As said, I just would have a problem making the call at this time.

@fionaliao
Copy link
Contributor Author

fionaliao commented May 28, 2025

As an update - I am still working on updating this proposal, but progress has been slow due to other work priorities

@fionaliao
Copy link
Contributor Author

@beorn7 Would you be open to having deltas ingested as gauges by default, with an option to ingest as counters with a __temporality__="delta" label? With documentation making it clear that all of this is experimental and could be removed. This won't include implementing any function overloading, it just adds the label so users can distinguish between delta counters and gauges if they want to.

I think to explore if it's worth persuing the __temporality__ label, we should to offer it as an option and see how users interact with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants