Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric cardinality limits "overflow attribute set" causes inconsistent attribute keys #3578

Closed
aabmass opened this issue Jun 29, 2023 · 11 comments
Labels
spec:metrics Related to the specification/metrics directory triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted

Comments

@aabmass
Copy link
Member

aabmass commented Jun 29, 2023

The Cardinality Limits section of the Metrics SDK spec says:

An overflow attribute set is defined, containing a single attribute otel.metric.overflow having (boolean) value true, which is used to report a synthetic aggregation of the metric events that could not be independently aggregated because of the limit.

This will cause every metric from the SDK to have an inconsistent set of attribute keys across its streams:

mycounter{a="hello" b="world"} 2
mycounter{a="bar" b="foo"} 3
mycounter{a="" b="foo"} 3
mycounter{otel.metric.overflow="true"} 100

Specifically for Prometheus/OpenMetrics this is a bad practice:

Metrics with the same name for a given MetricFamily SHOULD have the same set of label names in their LabelSet.

One solution is to add the overflow label to every metric:

mycounter{a="hello" b="world" otel_metric_overflow="false"} 2
mycounter{a="bar" b="foo" otel_metric_overflow="false"} 3
mycounter{a="" b="foo" otel_metric_overflow="false"} 3
mycounter{a="" b="" otel_metric_overflow="true"} 100

IMO this is pretty clunky. The user also has to be careful when grouping by a in this example to explicitly filter otel.metric.overflow=false or the a="" stream will be grouped with the overflow stream containing other values for a.

@aabmass aabmass added the spec:metrics Related to the specification/metrics directory label Jun 29, 2023
@jack-berg
Copy link
Member

This will cause every metric from the SDK to have an inconsistent set of attribute keys across its streams:

The otel.metric.overflow series should only show up once an instrument exceeds the cardinality limit, right? Is there an implementation that behaves differently? So you only see this behavior when the system is in a bit of an error state.

@aabmass
Copy link
Member Author

aabmass commented Jun 30, 2023

Not sure about existing implementations. This part of the spec could be interpreted otherwise:

The SDK MUST create an Aggregator with the overflow attribute set prior to reaching the cardinality limit and use it to aggregate events for which the correct Aggregator could not be created.

@jack-berg
Copy link
Member

Yeah that is a bit suspect. I wonder if that aggregator is expected to be included in collection results before any measurements are received?

@MrAlias
Copy link
Contributor

MrAlias commented Jun 30, 2023

Yeah that is a bit suspect. I wonder if that aggregator is expected to be included in collection results before any measurements are received?

+1

@pirgeo
Copy link
Member

pirgeo commented Jul 4, 2023

For my understanding: It seems like it should mean

"If the cardinality limit is 1000, create it when you have 999 streams to make sure you can fit the otel.metric.overflow in still."

But it could also be interpreted (and I think that is what @aabmass shows above) as:

"Create the otel.metric.overflow as the first attribute to make sure it is always considered."

Is that right? Or is there another interpretation?

@aabmass
Copy link
Member Author

aabmass commented Jul 5, 2023

In any event, the issue of inconsistent attribute keys would still come up if an overflow is reached. Do we think that

  • OTel metrics data model should recommend keeping a consistent set of keys across all dimensions of a metric (similar to OpenMetrics)?
  • the overflow metrics should follow this guidance

@pirgeo
Copy link
Member

pirgeo commented Jul 6, 2023

I think the point of this overflow attribute was to keep you from losing data once the limit is reached. You will lose data either way, but with the overflow attribute, at least the overall sum would be right. From my reading of the section you mentioned, the overflow should be taken into account on a per-metric-key basis. Additionally, it says:

An overflow attribute set is defined, containing a single attribute otel.metric.overflow

So, from how I understand the wording today, that would mean that this is the right way of doing it:

mycounter{a="hello" b="world"} 2
mycounter{a="bar" b="foo"} 3
mycounter{a="" b="foo"} 3
mycounter{otel.metric.overflow="true"} 100

I don't know if recommending sending "empty" attributes, especially when not sending to Prometheus, makes sense.

@aabmass
Copy link
Member Author

aabmass commented Jul 6, 2023

@pirgeo if mycounter is function of its dimensions, shouldn't it always include all of those dimensions? If you sent the those metrics to a backend and queried grouping by a, which stream would it be grouped with?

I suppose the problem is that the overflow stream has "floating" values for all of the missing labels. It can only really be used if you DON'T groupby anything. I'm not sure how many metric backends would understand this.

@pirgeo
Copy link
Member

pirgeo commented Jul 7, 2023

It can only really be used if you DON'T groupby anything.

That was the whole point of the overflow attribute, I think. If you don't have an overflow attribute but want to cap the used memory at some point, you have to start throwing away streams that would still influence the grand total. The point of the overflow label is to say: Actually, we've hit the limit of what we want to keep in memory, but instead of dropping your data, we are collecting everything that's not part of one of the existing streams here, so you can at least know the grand total of everything that has come in. You will lose data either way, but at least the total (when not grouping) is right. Seeing something like this should probably be followed up by either reducing the number of splits you introduce in your data (i.e., removing some attributes) or increasing the cardinality limit. I don't think this is a state that you want to be in for extended periods of time.

As for the Prometheus implementation: I am not really sure what the "right" way to send it to Prometheus is, maybe it is to set all attributes to empty values. Generally, I don't think we need to send all dimensions with empty values regardless (in OTLP, for example, there is no restriction to do it that way). I guess it becomes a backend topic how to deal with such data.

@dashpole
Copy link
Contributor

From the meeting: The goal is to make a lot of noise when a metric overflows. Something has gone very wrong.

One possible alternative to an overflow label on an existing metric would be an overflow metric with a label for the metric name. This would still provide the "loud noise" to warn users that something is badly wrong. It would also be easy to write an alert for to that effect. E.g. otel.cardinality.overflows would become:

#TYPE otel_cardinality_overflows_total counter
#HELP otel_cardinality_overflows_total a count of the number of measurements which overflowed cardinality limits
otel_cardinality_overflows_total {metric_name ="mycounter"} 1

However, this would remove the property of the total still being correct, which was one of the goals of the original proposal.

@austinlparker austinlparker added the triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted label May 14, 2024
@reyang
Copy link
Member

reyang commented Sep 26, 2024

From the meeting: The goal is to make a lot of noise when a metric overflows. Something has gone very wrong.

One possible alternative to an overflow label on an existing metric would be an overflow metric with a label for the metric name. This would still provide the "loud noise" to warn users that something is badly wrong. It would also be easy to write an alert for to that effect. E.g. otel.cardinality.overflows would become:

#TYPE otel_cardinality_overflows_total counter
#HELP otel_cardinality_overflows_total a count of the number of measurements which overflowed cardinality limits
otel_cardinality_overflows_total {metric_name ="mycounter"} 1

However, this would remove the property of the total still being correct, which was one of the goals of the original proposal.

Correct, the alternative solution here would not be able to provide the correct total/sum, which is one critical requirement that we want to meet.

I plan to move forward with the current approach (having the overflow attribute on the existing metric) after discussing with the other TC members during the Sep. 25, 2024 TC Meeting. Please either ping me on Slack or comment here before the end of Sep. 2024.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:metrics Related to the specification/metrics directory triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted
Projects
None yet
Development

No branches or pull requests

7 participants