-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: [processor/interval] Implement the main logic #30827
Conversation
@djaglowski This isn't ready to merge yet, but I wanted to get what I had posted so you all could give initial feedback. cc @sh0rez and @0x006EA1E5. I'd love your feedback as well, if you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RichieSams, I apologize but I'm finding this difficult to review while we have another processor under development which has to solve so many of the same concerns. Caching rules, identity, emit interval. Since the other processor appears to have a more detailed design proposal, I'd like to wait on this and see if we can't follow along by default, diverge only where necessary.
That's fine. I understand |
e099f5c
to
13b6e8a
Compare
Adds a new internal, _experimental_ package `metrics/identity` which implements identity types for resource, scope, metric and stream. This is closely related to work being done in #30707 and #30827. The package is specifically experimental, as it shall be treated as an internal component to above processors which may change at any moment as long as those are under active initial development. /cc @jpkrohling @djaglowski @RichieSams
… staleness It's a glorified wrapper over a Map type, which allows values to be expired based on a pre-supplied interval.
So Staleness isn't so tightly tied to it
So users are forced to use the correct methods. Also adds lots of documentation
e30edf3
to
303390e
Compare
|
||
sum := m.Sum() | ||
numDP := sum.DataPoints().AppendEmpty() | ||
dp.DataPoint.CopyTo(numDP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to update the timestamp to now()
. Now() should be calculated once at the top and the same value used for everything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should set any timestamps to "now". We are aggregating data points so should use the timestamps they carry in a meaningful way. The appropriate behavior may depend on the type of metric, but for example if aggregating a sum, I would expect the start time to be the earliest from all data points being aggregated, and the end time to be the latest from those data points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't update the timestamp, then why are we continuing to re-export them? Depending on the final destination, we're either creating duplicate data, or requiring the destination to implement data deduplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why we would continue to re-export data points when there is no new data to report. That said, I haven't had time to focus on the nuances of this processor and likely won't for the foreseeable future so I'll leave it to others to decide. It just seems to me that we should be reporting a direct representation of the data points we actually receive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. What you're describing is then similar to the "second" approach I describe here: #29461 (comment)
I think both approaches are "valid". But we just need to decide which this processor is doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. The second approach sounds better to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For us, the principal use case of this set of features is to be able to send count connector metrics to a Prometheus remote write endpoint.
The count connector sends a stream of increment data points (deltas with value one).
Prometheus expects a steady stream of cumulative data points, published at the "scrape interval".
For a Prometheus scrape, consecutive reads of a counter which hasn't incremented will read the same value each time, so we see the behavior of a steady stream of repeating, unchanging values, every interval.
It is my understanding that downstream metrics systems based on Prometheus expects to see this steady stream of unchanging cumulatives in this case, and if there is a gap they will see this as a break in the series and may not behave as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the behaviour of the collector's own metrics, when the service's telemetry is configured to export metrics via a periodic reader as follows:
telemetry:
metrics:
level: detailed
readers:
- periodic:
interval: 15000
exporter:
otlp:
endpoint: localhost:4321
protocol: grpc/protobuf
we see Sum metrics exported at every interval, even when there is no change in value. For example, I see a series of 0
values for metric otelcol_exporter/send_failed_metric_points
.
Therefore, I think we somehow need to be able to get this behaviour in the count connector scenario described above (or any case where source metric datapoints are received only on every change in value).
To be clear, I mean here to transform a stream of datapoints received at a variable frequency into a stream of datapoints at a fixed frequency, continuing to publish new datapoints even when there is no change in value, as we see in both the Prometheus scrape case, and the collector's own metrics case.
I guess we could rescope this to only publish datapoints on change, but if we do that, we would then need another process to meet the need described here, and it's not quite clear to me how that other processor could work if it wasn't also doing what this processor would do.
I'm not clear what the concrete use case is for not publishing new datapoints when there has not been a change in value, but if there is a use case for both modes of operation, then maybe this could be selected by a config option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the general idea with not publishing when there is no new information is that, in theory at least, it's not useful.
If we are sure that backends expect the no-new-info points, then we should probably include them in this processor's output. However, I am surprised to here this because I have not seen this in my experience (though I have barely touched prometheus) and because @sh0rez's comment here indicates that prometheus does not need these data points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus will consider a timeseries "stale" if it doesn't see any data points for some time (default 5 minutes). See here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness
If no sample is found (by default) 5 minutes before a sampling timestamp, no value is returned for that time series at this point in time. This effectively means that time series "disappear" from graphs at times where their latest collected sample is older than 5 minutes or after they are marked stale.
Imagine you have an alert configured in Prometheus Alert Manager based on the ratio of the collector's metrics otelcol_exporter_send_failed_metric_points
and otelcol_exporter_sent_metric_points
, say to trigger an alert when the ratio of failed is greater than 10 percent.
Prometheus can only calculate the ratio where it has values for both series, so following the staleness rule above, only up to five minutes since the last datapoint recorded for both series.
We can see how we could easily get in trouble if we didn't publish new data points for a series that didn't change in this case.
If otelcol_exporter_sent_metric_points
stopped incrementing, and otelcol_exporter_send_failed_metric_points
did start to increment, then depending on the time window used to calculate the ratio etc, it could take more than five minutes for the ratio to exceed 10 percent. So, if we don't continue to have new datapoints for the unchanging otelcol_exporter_sent_metric_points
series, we wouldn't actually trigger the alert in this case.
Of course this is a synthetic example, but I think it illustrates how the convention of Prometheus datasources continuing to periodically produce datapoints even when a series' value hasn't changed, is sometimes relied upon.
Adds a new internal, _experimental_ package `metrics/identity` which implements identity types for resource, scope, metric and stream. This is closely related to work being done in open-telemetry#30707 and open-telemetry#30827. The package is specifically experimental, as it shall be treated as an internal component to above processors which may change at any moment as long as those are under active initial development. /cc @jpkrohling @djaglowski @RichieSams
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Description: This PR implements the main logic of this new processor. Link to tracking Issue: #29461 This is a re-opening of [30827](#30827) Testing: I added a test for the main aggregation / export behavior, but I need to add more to test state expiry Documentation: I updated the README with the updated configs, but I should add some examples as well.
Description:
This PR implements the main logic of this new processor.
Link to tracking Issue:
#29461
Testing:
I added a test for the main aggregation / export behavior, but many more need to be added to test the various edge cases.
Documentation:
I updated the README with the updated configs, but I should add some examples as well.