-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Metric Aggregation Processor #4968
Comments
We have metrictransform process which does some aggregation. Moving this after GA in case there is more work that we need to do in this area. |
I'm happy to help with implementing that functionality. Is someone already working on it? Should such processor belong to core repository or contrib repository? |
Hi @nabam We (@alolita @amanbrar1999 @JasonXZLiu) are working on this. You're welcome to code review and provide feedback when we submit a PR. |
@alolita that's great! It's going to be very useful for streaming telemetry from AWS Lambdas. Do you have any time estimates for that? I want to understand if we have to build some temporary custom solution while you are working in it |
I believe that the use cases for this issue have since been solved by changes to the metrics proto, such as the addition of cumulative aggregation temporality. I could be wrong, does anyone know any specific use cases for this processor? For example I believe this is a use case that has since been resolved: open-telemetry/opentelemetry-collector#1541 In the last comment of this linked issue it is mentioned that temporality translation exists as of OTLP v0.5, which I believe is what this processor is intended for |
@amanbrar1999 one of the use-cases would be collecting metrics from short-living jobs such as AWS lambdas. Exporting uniquely tagged cumulative metrics for every lambda instance will cause high cardinality of time series they produce. |
Having something like this will make open-telemetry/opentelemetry-python#93 possible. Without this, Python WSGI applications using |
Howdy! @huyan0 @alolita, was there any progress towards implementing that proposal? I'm currently validating a standalone (multiple workloads) Otel collector for I have taken a look at:
Currently investigating this approach open-telemetry/opentelemetry-js#2118 although that will get affected by cardinality. |
Thanks for this amazing proposal! @alolita was there any progress on the implementation? We are considering contributing to this. Don't want to reinvent the wheel of course. |
@huyan0 @alolita @amanbrar1999 @JasonXZLiu Any progress on this issue?
One use case we are having is using statsd receiver and prometheusremotewriteexporter together. Given that statsd receiver produces delta metrics, we will need this processor to convert them into cumulative metrics before handing over to prometheusremotewriteexporter, which only accepts cumulative metrics. |
…n-telemetry#4968) Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
Looking through some of the Metrics Data Model use cases in the specification, I understand any mentions of We (Skyscanner) are currently PoC'ing a solution to generate Having something like this would make our pipelines much simpler. Is somebody already working on this? I've also read there may be plans on the |
A little bump for an update here. Something like this would reduce the complexity of the otel config for using this for cronjobs and lambda functions. |
The |
@alolita @amanbrar1999 @JasonXZLiu any update on this? |
bump. folks, is there any guidance and/or timeline on this proposal whether accepted or rejected? It has been more than two years now. kindly advise what operators should expect in the future. |
+1 |
I can't speak for the maintainers, however, I myself see the benefit of a dedicated processor to perform aggregations as a means of reducing complexity of other processors to reimplement this. I don't think this is strictly related to prom but I see it would greatly benefit from it. |
This feature would be highly beneficial to my team as well, we would be happy to give a hand in the implementation |
+1 <3 |
@kovrus Hi Ruslan. What does it mean that you are sponsoring? Are you doing all the implemention work? or are you responsible for finding someone who will work on it? |
Hi @bitomaxsp, please take a look at the first point of the Adding new components. It provides a good explanation on sponsorship. |
Any way that we can help get this started? It was brought up in the APAC end user group as something that was needed. |
Would the proposed solution to this issue allow aggregation of metrics data points in user-defined intervals? We gather metrics using the OpenTelemetry Java Agent, batch them using the OpenTelemetry Collector and export them to Grafana Cloud. Grafana Cloud plans have a maximum included "data points per minute" (DPM, cf. https://grafana.com/docs/grafana-cloud/billing-and-usage/active-series-and-dpm/#data-points-per-minute-dpm) which we would like to not exceed. If the OpenTelemetry Collector would support aggregating each time series in intervals of 60s that would be a great help. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
These is also a bigger issue when it comes to accumulating intervals from "lambdas". That i think is not covered by OTEL SDK at all iirc. It would be nice to come up with some guides on how to deal with that and include that in that processor as well. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
…#23790) **Description:** <Describe what has changed.> This continues the work done in the now closed [PR](#20530). I have addressed issues raised in the original PR by - Adding logic to handle timestamp misalignments - Adding fix + a out-of-bounds bug In addition, I have performed end-to-end testing in a local setup, and confirmed that accumulated histogram time series are correct. **Link to tracking Issue:** <Issue number if applicable> #4968 #9006 #19153 **Testing:** <Describe what testing was performed and which tests were added.> Added tests for timestamp misalignment and an out-of-bounds bug discovered in the previous PR. End-to-end testing to ensure histogram bucket counts exported to Prometheus are correct --------- Signed-off-by: Loc Mai <locmai0201@gmail.com> Signed-off-by: xchen <xchen@axon.com> Signed-off-by: stephenchen <x.chen1016@gmail.com> Co-authored-by: Lev Popov <nabam@nabam.net> Co-authored-by: Lev Popov <leo@nabam.net> Co-authored-by: Anthony Mirabella <a9@aneurysm9.com> Co-authored-by: Loc Mai <locmai0201@gmail.com> Co-authored-by: Alex Boten <aboten@lightstep.com>
…open-telemetry#23790) **Description:** <Describe what has changed.> This continues the work done in the now closed [PR](open-telemetry#20530). I have addressed issues raised in the original PR by - Adding logic to handle timestamp misalignments - Adding fix + a out-of-bounds bug In addition, I have performed end-to-end testing in a local setup, and confirmed that accumulated histogram time series are correct. **Link to tracking Issue:** <Issue number if applicable> open-telemetry#4968 open-telemetry#9006 open-telemetry#19153 **Testing:** <Describe what testing was performed and which tests were added.> Added tests for timestamp misalignment and an out-of-bounds bug discovered in the previous PR. End-to-end testing to ensure histogram bucket counts exported to Prometheus are correct --------- Signed-off-by: Loc Mai <locmai0201@gmail.com> Signed-off-by: xchen <xchen@axon.com> Signed-off-by: stephenchen <x.chen1016@gmail.com> Co-authored-by: Lev Popov <nabam@nabam.net> Co-authored-by: Lev Popov <leo@nabam.net> Co-authored-by: Anthony Mirabella <a9@aneurysm9.com> Co-authored-by: Loc Mai <locmai0201@gmail.com> Co-authored-by: Alex Boten <aboten@lightstep.com>
I've achieved my goal of Metric Aggregation by chaining several processors in the following order: Although the process might seem extensive, it effectively gets the job done. |
Any chance you could expand on your settings? Did you use delta or cumulative counters? Our use case is multiple short lived processes, all writing same metrics and attributes but with different counts. Our expectation is to have otel treat these as cumulative, and be able to export to prometheus, but we've hit a bit of a block as others have stated. |
Agreed - @yuri-rs would you mind posting your config? |
Sure @diranged @axaxs. I did collector aggregation for delta metrics.
|
Hey @yuri-rs thanks for sharing, that is great. What kind of scale have you tested this at, e.g. active series cardinality and rate of data,, and what were the memory/cpu requirements like? |
@yuri-rs Ideally You should place your batch processors last in your pipelines. Any reason for the current placement? |
@robincw-gr I'm running similar config on 100+ pods (2cpu, 2G each) k8s deployment that provide an endpoint for all OTEL telemetry. |
@azunna1 |
Thanks for the explanation @yuri-rs |
Metric Aggregation Processor Proposal
Objectives
The objective of the metric aggregation processor is to provide OpenTelemetry (OT) Collector users the ability to use exporters to send cumulative metrics to backend systems like Prometheus.
Background
Currently, the Prometheus exporter for the OT Collector is not functioning as expected. This is because OTLP Exporter in the SDK is pass through, and currently the Collector doesn't have any aggregation functionality, so the Collector receives deltas aggregations and instantaneous metric events and export them directly without converting them to cumulative values(#1255). In the Go SDK, aggregation of delta values is performed by the Accumulator and the Integrator(SDK Processor), but there is no similar component in the Collector. Any cumulative exporter that may be implemented in the future will encounter the same problem (a proposed exporter: #1150). A processor that maintains states of metrics (time series) and applies incoming deltas to cumulative values solves this problem.
Requirements
The processor should convert instantaneous and delta OTLP metrics to cumulative OTLP metrics. The following table proposes a mapping from instantaneous and delta metric kinds to cumulative metric kinds based on OTel-Proto PR open-telemetry/opentelemetry-collector#168. In the table, red entries are invalid type and kind combinations, white entries correspond to OTLP metrics that are already cumulative(or gauge), and should pass through the processor, and green entries correspond to OTLP metrics that the processor should aggregate.
*Grouping scalar indicates the most recent value to occur in a collection interval.
**Discussions around PR open-telemetry/opentelemetry-collector#168 are still ongoing.This table reflects a current status; will need to be updated when status changes
For green entries, the processor should maintain the cumulative value as a state. Metric state should be identified by a OTLP metric descriptor and a label set. For each metric of a stateful kind, the processor should aggregate cumulative values for every data point by labels.
Design Idea
The metric aggregation processor should be similar to the SDK Controller; it should contain:
The metric aggregation processor should accept concurrent updates of the metric state by taking in OTLP metrics passed from the pipeline, and perform collection periodically to send cumulative metrics to exporters, acting as a SDK push Controller. The following diagram illustrates the workflow.
cc @alolita @danielbang907 @huyan0 @jmacd
The text was updated successfully, but these errors were encountered: