[RFC] Add metrics and tracing framework in Opensearch #1061
Description
We have been looking into instrumenting the Opensearch code recently. Even though stats provides a good mechanism, it loses a lot of details like percentiles, which makes it really harder to debug issues in production. Wouldn’t it be great to add a metrics framework in Opensearch that allows a developer to add metrics easily at any part of the code without having to know the exact stats object where the metric belongs?
The framework can, for example, be integrated with RestActions and emit timing and error metrics per operation by default. Similarly, we could pass around this metrics object via ThreadContext down the executor chain and correlate timing metrics together in a single block per request. The metrics can have different logging levels allowing us to skip or add metric calculation on the fly- similar to what we have in logger.
Imagine that you added a new action within the bulk API chain and suddenly a few more requests start taking more time. One of the ways of achieving this is by adding a stat for the new operation within bulk stats. But because stats are always averaged or use precomputed percentiles - it is really tricky to confirm whether the new operation is the culprit. If there is a single metrics block that allows us correlating these metrics- it would be really simple to determine causation.
Now that I have talked about metric generation framework, the publishing can be implemented in a pluggable fashion to different sinks. We can provide a default implementation for the metric log file format, which can be plugged in via different metrics plugins.
Metadata
Assignees
Labels
Type
Projects
Status
New