-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification needed on how values for attributes filtered by a view are used #2905
Comments
I would be interested to hear how Java, .NET, and other stable implementation behave here. Unfortunately, my ability to write examples there are non-existent. |
The Python implementation sounds buggy to me. For additive (e.g. Counter, UpDownCounter, Asynchronous Counter, Asynchronous UpDownCounter) values, removing a spatial dimension should result in sum. |
@reyang can you comment on how .NET handles this? Does that project add them together? |
#1874 Old issue discussing this. I can confirm that .NET does not handle this correctly, and has the same bug as python. (I'll create an issue to track) |
https://github.com/open-telemetry/opentelemetry-dotnet/blob/73f8d3cb0160f633661854f41866dcdb70b81069/test/OpenTelemetry.Tests/Metrics/MetricAPITest.cs#L800-L803 UnitTest showing the same bug in .NET. (I looked at it at that time and it was not trivial to fix, so just added test to come back to this later.) |
Gotcha, thanks for confirming 🙏 |
@jack-berg can you confirm the behavior of async counters in Java in the presence of an "attribute reducing" view? I'm wondering if there is any implementation that sums these values. |
This is a good question. The Lightstep metrics SDK and the v0.31 and earlier OTel-Go metrics SDK did sum these instruments. As a case where this matters, potentially, is the Golang runtime/metrics package, which outputs cumulative Counter and UpDownCounter values in a form that is natural for use with asynchronous instruments. For example, runtime/metrics outputs three metrics, one is a total of the other two:
In the instrumentation package I wrote (forked from the contrib repo), https://github.com/lightstep/otel-launcher-go/tree/main/lightstep/instrumentation/runtime, the pattern is used to discard the total, since the SDK or a downstream consumer can easily recompute the total using attribute removal and the natural merge function for the data point. So in the example above, a cumulative GC count for |
A good way to think about this problem: whether the dimension is removed from the SDK (e.g. by configuring View), or it is removed by the Collector (by reaggregation), or it is removed at the backend/comsumption (via PromQL), the results should be the same. |
@reyang should the Also, should the API specification be updated to normatively require these instrument types be treated as additive? |
otel-cpp implementation does the sum for the sync counters matching the "attribute reducing" view, but selects the last value for the async counter. Need to be fixed, will create a issue to track this. |
Which list?
I think no. Additive property applies to Sums, and is covered by the SDK specification https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#default-aggregation. It is perfectly fine for the SDK user to configure an UpDownCounter as something different (e.g. gauge) and not apply additive property. |
The list included in the sentence at the end of the linked section:
|
Where is it covered? I don't see anywhere in that linked section defining sums as additive. It also seems like the term additive is applied to instruments, not aggregations. This would make sense as that is where attributes are also defined. i.e.
|
I think yes it can, not necessarily as it is only providing examples (without promising that it will be a complete list). |
Yeah, I can see the challenge here. The data model specification covered "additive" in the temporality section, and temporality is referenced by the sums section. I wrote the supplementary guideline after discussing with @jmacd, hoping it will provide some help for folks who struggle to understand the data model and API specification. Maybe we should improve the data model spec to bring more clarity by making it more streamlined. |
This comment was marked as outdated.
This comment was marked as outdated.
Gotcha. Any help clarifying this in the specification would be appreciated 🙏 |
I think the spec does require it, although it is not expressed in a normative way.
Given all these steps, 90% chance folks will run into a trap unless they asked someone who has been deeply engaged in all the three specs... |
Java doesn't do this correctly either. Opened open-telemetry/opentelemetry-java#4901 to track. |
I want to try and clarify the workings of the async instruments and their default aggregation. In the case that was raised here, there is a filter and two observes will record to the same filtered counter, we expect the result is a sum of both observations. What about other similar cases:
Would it be a good idea to put other a set of edge cases, and the expected output? |
@MadVikingGod could you give some concrete examples? I found that the cases you've described can be interpreted differently. Ideally something like this would help to provide clarity https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#asynchronous-example. If folks feel this area needs more examples to facilitate the understanding, I can add a section under https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#additive-property to explain the spatial aggregation. |
That is the gist of what I was asking about, just more tightly focused on edge cases we have historically found to be non-obvious. Like the behavior in this issue. If we were to ask "What should be the result in these 4 (+ the original) cases?" I think we would see different interpretations. While I think the supplementary guidelines are useful, they are verbose enough that they can only cover a small set of behaviors. I was thinking more along the lines of a document that easily be translated into tests to help with the validation of these interactions of features. This issue might look like this: https://gist.github.com/MadVikingGod/e3e65d6217d321d5de8ded159f4abc7e |
The gist reminded of me https://github.com/w3c/trace-context/tree/main/test, which is a big commitment to introduce and maintain. I feel it is probably a separate topic on its own. |
Yes, I've wanted something like the tracecontext interop tests for tracing and metrics so bad. I even started on it at one point... Probably need a SIG for it :) |
Currently the specification states:
For asynchronous counter instruments (counter and up-down-counter), how should their sums be reported when this "filter" is defined?
For example, if there is an asynchronous counter that observes
1
for the attribute setversion=1
and then, in the same callback,2
for the attribute setversion=2
and a list of attribute keys equaling{"foo"}
is used in a view, both attribute sets (version=1
andversion=2
) should be ignored according to the specification. Therefore, they both become observations of the empty attribute set. How should their sums be combined? Should the SDK report2
, the last recorded value for the empty attribute set? Or, should it report3
the combination?I had assumed the latter would be expected, but when looking at the python implementation they use the former. This is based on the limit_num_of_attrs.py example:
(caveat, my python understanding is limited so I could have missed something here)
When running this:
The value reported is
2
.cc @jmacd @bogdandrutu @reyang @open-telemetry/python-approvers
The text was updated successfully, but these errors were encountered: