-
Notifications
You must be signed in to change notification settings - Fork 210
Description
Is your feature request related to a problem? Please describe.
The recent changes in the kind of metrics exposed increase the level of visibility to trace how much energy is consumed by which elements on a process/Pod/container. This kind of metrics is valuable for applications owners looking to optimize the energy consumption by their application or services stack. At the same time, the granularity of these metrics make it extremely difficult for functionalities that require a quick understanding of the system (e.g. schedulers, multi-cluster level dashboards, etc) and considerable increase the load on the Prometheus stack.
Describe the solution you'd like
I like to propose for Kepler to have a flag that can be use to influence the type of metrics that it exposes on the metric endpoint. Basically, enable Kepler to be used by general use cases (e.g. report the top-5 energy consumption by namespace, pods or apps; report the software+OS-view energy consumption by nodes, etc) and another mode for developers that want to do low level tracing of energy consumption of their apps.
The mode for the general use cases should provide a simple single metric per container or Pod, that can be used for user facing reporting and executive types of reports.
Describe alternatives you've considered
With the current granular metrics:
- We have seen a significant impact on Prometheus utilization (way too many metrics in very short time even on micro environments)
- It has become very challenging using the metrics of the general use cases as the query formula gets very long in not time, and the system takes several seconds to process it once
- For comparison, when loading the metrics on DataFrames to run analytics in a Jupyter notebooks for an 8 nodes clusters with ~500 Pods, loading a 2hr dataset with 1 minutes resolution: The current granular metrics takes 2min 17s (as reported by
%timeit
) to load. With the previous aggregated metrics, the same cluster setup with the same datasets and resolution was loading in less than 1 second.
Additional context
The general use cases should also cover #301