Metric cardinality limits "overflow attribute set" causes inconsistent attribute keys #3578

aabmass · 2023-06-29T18:38:02Z

The Cardinality Limits section of the Metrics SDK spec says:

An overflow attribute set is defined, containing a single attribute otel.metric.overflow having (boolean) value true, which is used to report a synthetic aggregation of the metric events that could not be independently aggregated because of the limit.

This will cause every metric from the SDK to have an inconsistent set of attribute keys across its streams:

mycounter{a="hello" b="world"} 2
mycounter{a="bar" b="foo"} 3
mycounter{a="" b="foo"} 3
mycounter{otel.metric.overflow="true"} 100

Specifically for Prometheus/OpenMetrics this is a bad practice:

Metrics with the same name for a given MetricFamily SHOULD have the same set of label names in their LabelSet.

One solution is to add the overflow label to every metric:

mycounter{a="hello" b="world" otel_metric_overflow="false"} 2
mycounter{a="bar" b="foo" otel_metric_overflow="false"} 3
mycounter{a="" b="foo" otel_metric_overflow="false"} 3
mycounter{a="" b="" otel_metric_overflow="true"} 100

IMO this is pretty clunky. The user also has to be careful when grouping by a in this example to explicitly filter otel.metric.overflow=false or the a="" stream will be grouped with the overflow stream containing other values for a.

The text was updated successfully, but these errors were encountered:

jack-berg · 2023-06-29T19:07:36Z

This will cause every metric from the SDK to have an inconsistent set of attribute keys across its streams:

The otel.metric.overflow series should only show up once an instrument exceeds the cardinality limit, right? Is there an implementation that behaves differently? So you only see this behavior when the system is in a bit of an error state.

aabmass · 2023-06-30T01:29:55Z

Not sure about existing implementations. This part of the spec could be interpreted otherwise:

The SDK MUST create an Aggregator with the overflow attribute set prior to reaching the cardinality limit and use it to aggregate events for which the correct Aggregator could not be created.

jack-berg · 2023-06-30T14:18:43Z

Yeah that is a bit suspect. I wonder if that aggregator is expected to be included in collection results before any measurements are received?

MrAlias · 2023-06-30T14:37:51Z

Yeah that is a bit suspect. I wonder if that aggregator is expected to be included in collection results before any measurements are received?

+1

pirgeo · 2023-07-04T13:05:22Z

For my understanding: It seems like it should mean

"If the cardinality limit is 1000, create it when you have 999 streams to make sure you can fit the otel.metric.overflow in still."

But it could also be interpreted (and I think that is what @aabmass shows above) as:

"Create the otel.metric.overflow as the first attribute to make sure it is always considered."

Is that right? Or is there another interpretation?

aabmass · 2023-07-05T17:29:48Z

In any event, the issue of inconsistent attribute keys would still come up if an overflow is reached. Do we think that

OTel metrics data model should recommend keeping a consistent set of keys across all dimensions of a metric (similar to OpenMetrics)?
the overflow metrics should follow this guidance

pirgeo · 2023-07-06T10:00:00Z

I think the point of this overflow attribute was to keep you from losing data once the limit is reached. You will lose data either way, but with the overflow attribute, at least the overall sum would be right. From my reading of the section you mentioned, the overflow should be taken into account on a per-metric-key basis. Additionally, it says:

An overflow attribute set is defined, containing a single attribute otel.metric.overflow

So, from how I understand the wording today, that would mean that this is the right way of doing it:

mycounter{a="hello" b="world"} 2
mycounter{a="bar" b="foo"} 3
mycounter{a="" b="foo"} 3
mycounter{otel.metric.overflow="true"} 100

I don't know if recommending sending "empty" attributes, especially when not sending to Prometheus, makes sense.

aabmass · 2023-07-06T19:25:26Z

@pirgeo if mycounter is function of its dimensions, shouldn't it always include all of those dimensions? If you sent the those metrics to a backend and queried grouping by a, which stream would it be grouped with?

I suppose the problem is that the overflow stream has "floating" values for all of the missing labels. It can only really be used if you DON'T groupby anything. I'm not sure how many metric backends would understand this.

pirgeo · 2023-07-07T10:26:50Z

It can only really be used if you DON'T groupby anything.

That was the whole point of the overflow attribute, I think. If you don't have an overflow attribute but want to cap the used memory at some point, you have to start throwing away streams that would still influence the grand total. The point of the overflow label is to say: Actually, we've hit the limit of what we want to keep in memory, but instead of dropping your data, we are collecting everything that's not part of one of the existing streams here, so you can at least know the grand total of everything that has come in. You will lose data either way, but at least the total (when not grouping) is right. Seeing something like this should probably be followed up by either reducing the number of splits you introduce in your data (i.e., removing some attributes) or increasing the cardinality limit. I don't think this is a state that you want to be in for extended periods of time.

As for the Prometheus implementation: I am not really sure what the "right" way to send it to Prometheus is, maybe it is to set all attributes to empty values. Generally, I don't think we need to send all dimensions with empty values regardless (in OTLP, for example, there is no restriction to do it that way). I guess it becomes a backend topic how to deal with such data.

dashpole · 2023-07-11T15:46:43Z

From the meeting: The goal is to make a lot of noise when a metric overflows. Something has gone very wrong.

One possible alternative to an overflow label on an existing metric would be an overflow metric with a label for the metric name. This would still provide the "loud noise" to warn users that something is badly wrong. It would also be easy to write an alert for to that effect. E.g. otel.cardinality.overflows would become:

#TYPE otel_cardinality_overflows_total counter
#HELP otel_cardinality_overflows_total a count of the number of measurements which overflowed cardinality limits
otel_cardinality_overflows_total {metric_name ="mycounter"} 1

However, this would remove the property of the total still being correct, which was one of the goals of the original proposal.

reyang · 2024-09-26T00:18:50Z

From the meeting: The goal is to make a lot of noise when a metric overflows. Something has gone very wrong.

One possible alternative to an overflow label on an existing metric would be an overflow metric with a label for the metric name. This would still provide the "loud noise" to warn users that something is badly wrong. It would also be easy to write an alert for to that effect. E.g. otel.cardinality.overflows would become:
#TYPE otel_cardinality_overflows_total counter
#HELP otel_cardinality_overflows_total a count of the number of measurements which overflowed cardinality limits
otel_cardinality_overflows_total {metric_name ="mycounter"} 1
However, this would remove the property of the total still being correct, which was one of the goals of the original proposal.

Correct, the alternative solution here would not be able to provide the correct total/sum, which is one critical requirement that we want to meet.

I plan to move forward with the current approach (having the overflow attribute on the existing metric) after discussing with the other TC members during the Sep. 25, 2024 TC Meeting. Please either ping me on Slack or comment here before the end of Sep. 2024.

aabmass added the spec:metrics Related to the specification/metrics directory label Jun 29, 2023

github-actions bot assigned jack-berg Jun 29, 2023

dashpole mentioned this issue Mar 4, 2024

Stabilize Overflow attribute section under Cardinality Limits #3904

Closed

austinlparker unassigned jack-berg May 14, 2024

austinlparker added the triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted label May 14, 2024

reyang closed this as completed Sep 26, 2024

reyang mentioned this issue Sep 26, 2024

Mark cardinality limits as Stable #4222

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric cardinality limits "overflow attribute set" causes inconsistent attribute keys #3578

Metric cardinality limits "overflow attribute set" causes inconsistent attribute keys #3578

aabmass commented Jun 29, 2023

jack-berg commented Jun 29, 2023

aabmass commented Jun 30, 2023

jack-berg commented Jun 30, 2023

MrAlias commented Jun 30, 2023

pirgeo commented Jul 4, 2023

aabmass commented Jul 5, 2023

pirgeo commented Jul 6, 2023

aabmass commented Jul 6, 2023

pirgeo commented Jul 7, 2023

dashpole commented Jul 11, 2023

reyang commented Sep 26, 2024

Metric cardinality limits "overflow attribute set" causes inconsistent attribute keys #3578

Metric cardinality limits "overflow attribute set" causes inconsistent attribute keys #3578

Comments

aabmass commented Jun 29, 2023

jack-berg commented Jun 29, 2023

aabmass commented Jun 30, 2023

jack-berg commented Jun 30, 2023

MrAlias commented Jun 30, 2023

pirgeo commented Jul 4, 2023

aabmass commented Jul 5, 2023

pirgeo commented Jul 6, 2023

aabmass commented Jul 6, 2023

pirgeo commented Jul 7, 2023

dashpole commented Jul 11, 2023

reyang commented Sep 26, 2024