-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric cardinality limits "overflow attribute set" causes inconsistent attribute keys #3578
Comments
The |
Not sure about existing implementations. This part of the spec could be interpreted otherwise:
|
Yeah that is a bit suspect. I wonder if that aggregator is expected to be included in collection results before any measurements are received? |
+1 |
For my understanding: It seems like it should mean
But it could also be interpreted (and I think that is what @aabmass shows above) as:
Is that right? Or is there another interpretation? |
In any event, the issue of inconsistent attribute keys would still come up if an overflow is reached. Do we think that
|
I think the point of this overflow attribute was to keep you from losing data once the limit is reached. You will lose data either way, but with the overflow attribute, at least the overall sum would be right. From my reading of the section you mentioned, the overflow should be taken into account on a per-metric-key basis. Additionally, it says:
So, from how I understand the wording today, that would mean that this is the right way of doing it:
I don't know if recommending sending "empty" attributes, especially when not sending to Prometheus, makes sense. |
@pirgeo if I suppose the problem is that the overflow stream has "floating" values for all of the missing labels. It can only really be used if you DON'T groupby anything. I'm not sure how many metric backends would understand this. |
That was the whole point of the overflow attribute, I think. If you don't have an overflow attribute but want to cap the used memory at some point, you have to start throwing away streams that would still influence the grand total. The point of the overflow label is to say: Actually, we've hit the limit of what we want to keep in memory, but instead of dropping your data, we are collecting everything that's not part of one of the existing streams here, so you can at least know the grand total of everything that has come in. You will lose data either way, but at least the total (when not grouping) is right. Seeing something like this should probably be followed up by either reducing the number of splits you introduce in your data (i.e., removing some attributes) or increasing the cardinality limit. I don't think this is a state that you want to be in for extended periods of time. As for the Prometheus implementation: I am not really sure what the "right" way to send it to Prometheus is, maybe it is to set all attributes to empty values. Generally, I don't think we need to send all dimensions with empty values regardless (in OTLP, for example, there is no restriction to do it that way). I guess it becomes a backend topic how to deal with such data. |
From the meeting: The goal is to make a lot of noise when a metric overflows. Something has gone very wrong. One possible alternative to an overflow label on an existing metric would be an overflow metric with a label for the metric name. This would still provide the "loud noise" to warn users that something is badly wrong. It would also be easy to write an alert for to that effect. E.g.
However, this would remove the property of the total still being correct, which was one of the goals of the original proposal. |
Correct, the alternative solution here would not be able to provide the correct total/sum, which is one critical requirement that we want to meet. I plan to move forward with the current approach (having the overflow attribute on the existing metric) after discussing with the other TC members during the Sep. 25, 2024 TC Meeting. Please either ping me on Slack or comment here before the end of Sep. 2024. |
The Cardinality Limits section of the Metrics SDK spec says:
This will cause every metric from the SDK to have an inconsistent set of attribute keys across its streams:
Specifically for Prometheus/OpenMetrics this is a bad practice:
One solution is to add the overflow label to every metric:
IMO this is pretty clunky. The user also has to be careful when grouping by
a
in this example to explicitly filterotel.metric.overflow=false
or thea=""
stream will be grouped with the overflow stream containing other values fora
.The text was updated successfully, but these errors were encountered: