Description
Is your feature request related to a problem? Please describe.
Feature Request #1061
Feature Request #6533
PR - #8816
Describe the solution you'd like
Problem Statement
The current OpenSearch stats APIs offer valuable insights into the inner workings of each node and the cluster as a whole. However, they lack certain details such as percentiles and do not provide the semantics of richer metric types like histograms. Consequently, identifying outliers becomes challenging. OpenSearch needs comprehensive metric support to effectively monitor the cluster. Recent issues and RFCs have attempted to address this in a piecemeal fashion, and we are currently engaged in a significant effort to instrument OpenSearch code paths. This presents an opportune moment to introduce comprehensive metrics support.
Tenets
- Minimal Overhead – Metrics should not impose minimal overhead on system resources such as CPU and memory.
- No performance impact – Metrics should not adversely affect cluster operations in terms of performance.
- Extensible – The metrics framework should be easily extendable to accommodate background tasks, plugins, and extensions.
- Safety - The framework must ensure that instrumentation code does not result in memory leaks.
- Well defined Abstractions – Metrics frameworks should offer clear abstractions so that changes in implementation framework or API contracts do not disrupt implementing classes.
- Flexible – There should be a mechanism for dynamically enabling/disabling metrics through configurable settings.
- Do not reinvent - We should prefer to use out of the box solutions available instead of building something from scratch.
Metrics Framework
It is widely recognized that observability components like tracing and metrics introduce overhead. Therefore, designing the Metrics Framework for OpenSearch requires careful consideration. This framework will provide abstractions, governance, and utilities to enable developers and users to easily utilize it for emitting metrics. Let's delve deeper into these aspects: –
- Abstractions - While metrics frameworks like OpenTelemetry can be leveraged, we will abstract the solution behind the OpenSearch APIs. This future-proofs the core OpenSearch code where metrics are added, reducing the need for changes. Hooks for metrics have already been included in the Telemetry abstraction, following a similar pattern as the TracingTelemetry implementation.
- Governance - Similar to tracing, we should define mechanisms for enabling and disabling metrics at multiple levels.
- Code Pollution - To mitigate code pollution, we should provide utilities such as SpanBuilder to abstract away repetitive boilerplate code.
HLD
- Metric APIs - The Metrics API will facilitate the creation and updating of metrics. It should handle most of the heavy lifting and abstract the need for boilerplate code.
public interface Meter extends Closeable {
/**
* Creates the counter. This counter can increase monotonically.
* @param name name of the counter.
* @param description any description about the metric.
* @param unit unit of the metric.
* @return counter instrument.
*/
Counter createCounter(String name, String description, String unit);
/**
* Creates the up/down counter. Value for this counter may go up and down so should be used
* at places where negative, postive and zero values are possible.
* @param name name of the counter.
* @param description any description about the metric.
* @param unit unit of the metric.
* @return up/down counter instrument.
*/
Counter createUpDownCounter(String name, String description, String unit);
/**
* Creates the histogram instrument which is needed if some values needs to be recorded against
* some buckets, samples, etc.
* @param name name of the counter.
* @param description any description about the metric.
* @param unit unit of the metric.
* @return histogram instrument..
*/
Histogram createHistogram(String name, String description, String unit);
/**
* Creates the Gauge which helps in recording the arbitrary/absolute values like cpu time, memory usage etc.
* @param name name of the counter.
* @param description any description about the metric.
* @param unit unit of the metric.
* @return counter.
*/
Gauge createGauge(String name, String description, String unit);
}
- Storage - Metrics data need not be emitted with each event or request. Instead, they should be stored or buffered (like async loggers) and aggregated over a configurable time frame in an in-memory store before periodic emission.
- Sink - Periodically emitted metrics should be written to a configurable sink. Users should have the flexibility to define their preferred sink.
Implementation
OpenTelemetry offers decent support for metrics, and we can leverage the existing telemetry-otel plugin to provide the implementation for metrics as well.
data:image/s3,"s3://crabby-images/c146f/c146f95f199769b7d519d984b7f9742e97345356" alt="Screenshot 2023-09-20 at 8 49 09 PM"
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Metadata
Assignees
Type
Projects
Status
New
Activity