-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support explicitly removing attributes from an instrument API #3062
Comments
I found out now it was also discussed in #2747, but I wanted to:
|
I read the use case and I'm a bit confused: There's this part:
Makes sense to me. New broker starts reporting metrics for the topic. But then there's this:
How is it reported twice? If B1 does not emit metrics once it's moved to another broker? And for the case of Sorry if I missed something obvious, I'm just trying to understand the use case. |
Today it is not the case. But say I'm switching over all Apache Pulsar metrics to OTel, without that feature of removing attributes ("If you can't remove the attributes associated with that topic from that instrument once the topic has been unloaded from B1), then the following will happen: "once it is loaded into B2, that metric will be reported twice". I was making the case why having the feature of removing attributes from an instrument explicitly is crucial to using the epic of moving to OTel in Apache Pulsar.
That means I did poorly explaining, so I'll add an example to explain better: Say topic This means the following will be exported (using Prometheus format here). I included 2 additional topics before and after, too so we won't think this broker has only 1 topic.
Once
In
and one of the topic class variables would be
|
I see your point. I'm not much familiar with prometheus scrapes but I was thinking the timestamp would stay the same, or the value would be reset, so something to indicate that this is 'stale' and didn't change from the last scrape so = ignore it. |
To confirm I got it right: Basically, you want to stop exporting the metric with the label for one broker, and instead want to start exporting it with the label for another broker. You need the first broker to not export that metric again since the work is now done elsewhere (on the other broker) and if you reported it once for the "old" (or no longer active) broker, and once for the "new" (the now active broker) you would have a duplication. Coming from this:
The thing you are trying to avoid is:
Because that would be double counting. For a Gauge it would be possible to model this by setting the "old" gauge to 0, and setting the "new" gauge to the "old" gauges previous value. So something like: After the move:
But:you would want to get rid of the one showing 0 as well to only have
right? |
Correct @pirgeo |
@joaopgrassi Well, the timestamp must advance. In fact, Prometheus exporters usually don't set the timestamp on the exported text, but it is the Prometheus server adding the timestamp of the scrape during the scrape process it self, before persisting it to memory and WAL. From my knowledge there isn't a mechanism to indicate this value is not used anymore, hence Prometheus java client has the ability to remove an attributes set (called labels in Prometheus) from a given Collector: public void remove(String... labelValues) {
children.remove(Arrays.asList(labelValues));
initializeNoLabelsChild();
} |
@pirgeo and @joaopgrassi - What is required to move this issue forward? I would to discuss other concerns, and see we can progress into making it into the specifications, I just don't know the process yet. |
I think this is also related to #2711 (maybe even a possible duplicate?). To me, it seems like this is not a controversial topic, just one that doesn't affect a lot of folks, and therefore the responses are a bit slow. As for pushing the issue: You could try bringing it up in the next spec meeting (https://github.com/open-telemetry/community#specification-sigs, see "Specification: General") on Tuesday. Everyone is invited to add items to the agenda. If you do, I would also recommend joining the meeting (it is a public meeting) and stating your case there. It might be easier to get your point across there and to get a spotlight on this issue. Usually, there is a moderator making sure that all issues on the agenda get some air time, and are discussed! |
I think they are "sibling" issues :) That issues mentioned refers to SDK automatically detecting a certain attribute set is not being reported for x collection intervals, hence the SDK can decide to stop reporting it - implicit. Thanks for the spec meeting - I visited once in August to ask a question, and @pirgeo, you are correct that I should join the next one. Thanks for the tip! |
One possible solution, which would not require changes to the current instrument API, would be to use the disappearance of an attribute set from an asynchronous instrument callback as a signal that that attribute set no longer exists. See #1891 (comment) and open-telemetry/opentelemetry-go#3605 for details. We could add Delete() to synchronous instruments at a later point if there is still a need. |
@dashpole Using asynchronous attributes to achieve it has 2 disadvantages: It's not explicit, and less robust (more code): |
@asafm Totally agree it isn't the ideal solution for your use-case. I'm no java expert, but it should be possible to implement your desired synchronous API (an "instrument" with Add() and Delete()) using an async instrument. That would at least keep your code readable. In Go, I would do something like: type myInt64Counter struct {
values map[[]attribute.KeyValue]int64
}
// Callback is what I can use when creating my async instrument
func (m *myInt64Counter) Callback(ctx context.Context, observer instrument.Int64Observer) error {
// observe all of the values
for attrs, value := range m.values {
observer.Observe(value, attrs...)
}
return nil
}
func (m *myInt64Counter) Add(ctx context.Context, incr int64, attrs ...attribute.KeyValue) {
// increment values[attrs] by incr
}
func (m *myInt64Counter) Delete(ctx context.Context, attrs ...attribute.KeyValue) {
delete(m.values, attrs)
} My hope is that it would unblock you without needing an addition to the current instrumentation API, even if it isn't ideal. |
@dashpole Thanks. This is a good solution as a work-around. |
@dashpole Do you any reason not to approve this proposal? |
I'm in favor of being able to remove attribute sets from synchronous instruments. In terms of the proposal, you've talked about what delete() will do, but a proposal also needs to cover what the behavior is if the deleted attribute set is re-observed with Add() later on. I've put forward suggestions for how I would like to see that handled in other linked issues. One challenge is that SDKs currently have a lot of flexibility WRT how they deal with start times (e.g. the async cumulative guidelines) and resets. We would presumably be requiring SDKs to implement a reset mechanism by introducing delete(), as I can delete() and then add() the same attribute set to produce a reset. Another challenge is that most APIs/SDKs have had stable releases already, and (depending on the language) might not be able to introduce an experimental function (i.e. they would have to wait for this new spec to reach stability). Finally, spec changes need TC + spec approvers (which I am not currently a member of), and involvement from multiple SIGs to implement and prototype changes. TC members in particular are balancing a lot of priorities within the community, and may prefer to tackle other issues first. |
I really like this proposal. It means that if we clarify an under-defined area of the OTel specification we could benefit from not only reducing overhead for aysnc instruments, but allow the building of an extension that supports the requested feature. |
@dashpole Thanks a lot for the detailed response. I have several follow-up questions :)
You said you put suggestion on other linked issues - any chance you can paste those links?
So you mean that even if this gets clarified, approved and merged, other SDKs can't introduce an experimental function because they don't want to disrupt users? Doesn't mark that as experimental, helps in that case?
So basically this means - "this will take time", right?
I think this text is missing some context, or maybe the problem is with me, as I can't understand it all the way. Is there anyone you recommend I can get some help from on this? So to move forward with this:
|
It really depends on the language.
More or less. I'm also pointing out that I'm not the one you need to convince ;).
Your comment at the sig meeting "can't we just use the current time as the start time when it reappears" probably solves this. For an example of what I was trying to get at, it is currently valid to use the time at which the instrument was created (basically the process start time) as the start time of all cumulative metrics. After the addition of Delete(), that behavior would no longer be valid for SDKs. I would hope (and we will eventually need prototypes) that switching from that behavior to the one you suggest won't be hard. |
I read the section of resets and gaps in the spec and also the supplementary guidelines examples. I quote here the section in the example:
Let's assume we have two conditions met:
Also, both the reset and gaps and supplementary guidelines, actually write quite explicitly what to do when the attributes set re-appears: if you use per-attributes start time, then use the last collection time as the start-time. If you use process start-time, you can use that. If that is the case, what do you see as "blocker" for adding Delete() to the API? I have one issue I need some help with to resolve. So in memory, I have the current number of in-flight requests per region. When the user will do @jack-berg @dashpole @jmacd what do you think? |
I would love for that to be the case. Where do you see that?
If using delta, I think delete() would be a no-op, as we are basically just aggregating Add() calls from each collect. If using cumulative temporality, the only thing I can think of is that the counter would have to be reset in that case with a new start time and counter value. |
Ok, so after reading the reset and gaps 6 - 8 times (Either it's written in a super complicated way, or I simply can't get it), you are correct. It is only written explicitly in the supplementary guidelines, and I quote:
Reset means you place 0 for the value, but the software doesn't know that, it thinks the value is the current storage value as it has attributes per region. Say you delete That's quite a problematic issue. |
Just getting back from leave.
You should only delete an attribute when you know for sure that no more observations will be made with it, right? Any subsequent observations would be treated as "brand new" after a delete, so that behavior seems intended. |
@dashpole I'll start by re-pasting my problematic scenario from above: I have one issue I need some help with to resolve. So in memory, I have the current number of in-flight requests per region. When the user calls I know for sure I won't report any observations for |
Hi, this is a fairly common usecase and the Prometheus SDK has a very elegant API for this imo. One use-case: In Mimir, we expose a number of metrics per tenant (active series, samples received, queries run, rules executed, etc.). New tenants are created and old tenants are cleaned up and this is a continuous process. If we don't clean up the stale tenant metrics, the cardinality exposed will continuously keep growing and this is undesireable. In Prometheus SDK we achieve it by using a custom For example, you can have a |
@gouthamve I'm not an expert on this (yet). I can tell you from what I know:
I still want this a lot, since Histogram is not solvable - no async histogram. I personally haven't gotten around this yet. |
What are you trying to achieve?
I would like to have the ability to remove previously reported Attributes to an instrument using the OTel API
What did you expect to see?
The API specification should be changed by adding an operation named Remove to each instrument listed in the API specifications.
Let's take Java API as an example. Today we have:
I would like to have
This operation/method will remove any accumulated measurements in memory and any reference or usage of the instance of the supplied
Attributes
from the in-memory storage.Additional context.
The motivation would be best demonstrated in a real-world use case. I work on Apache Pulsar, which is a messaging platform (It's like Kafka + RabbitMQ). It is a distributed system composed of Brokers. Each broker is responsible for a set of topics out of the topics in the cluster (this is how the data is sharded). Topics have many instruments, meaning each topic has many metrics (i.e., Attributes in several instruments). A topic can move between brokers as part of automatic load balancing or a human-triggered move. Once a topic has moved from broker B1 to B2, we don't expect B1 to emit any metrics related to the moved topic, and we expect B2 to emit those metrics. For example, a topic can have a metric "broker.messaging.topic.storage_size", which counts the total size of messages stored for this topic in the persistent storage. If you can't remove the attributes associated with that topic from that instrument once the topic has been unloaded from B1, then once it is loaded into B2, that metric will be reported twice.
(Just for clarification, Apache Pulsar is yet to use OTel as we now explore it and map existing gaps).
Existing metric libraries in the Java ecosystem do support this type of removal:
This operation also reduces memory footprint since the software developer knows those attributes will never receive further measurements.
The text was updated successfully, but these errors were encountered: