Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed updates to #184 #1

Conversation

jmacd
Copy link

@jmacd jmacd commented Oct 24, 2023

@carlosalberto

After reviewing the collector's equivalent metrics and auditing the code framework there, I came up with these recommendations in the form of a (large) change to yours.

This is also very speculative--I edited the generated YAML file to demonstrate the outcome I would like, and it will take a small improvement to the build tools to achieve the intended results.

Following from the collector's example, I am proposing three levels of detail, which would be called "basic", "normal", and "detailed". In the current tooling, we have "required" and "recommended" which I take as equivalent to "basic" and "normal". To this I would add "detailed", so the tools need a small change. Please review the generated file as if I had generated it, and the work needs to be put back into otel.yaml iiuc.

leakage. Multi-level colletor topologies should allow configuration
of distinct domains (e.g., `agent` and `gateway`).

### Basic level of detail

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of having this level? It saves a single binary attribute, but there are plenty of other attributes that are required (domain, name, signal, etc), so the complexity of the spec doesn't seem to be warranted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saving a boolean attribute means having half as many (i.e., one less) timeseries. The information available in the attribute is almost redundant, so I think having a way to avoid the additional 1 timeseries matters.

When you have metrics on a pipeline, the information available by having a success attribute (i.e., one additional series) can be inferred by comparing the subsequent component's totals. This is admittedly a recursive definition -- for the subsequent component to establish it's success/failure rate it will need its own subsequent component's totals, and the final stage in a pipeline will likely not want to use basic-level metrics for this reason. If Total(x) is the sum of the single metric for a component X, the recursive rule for deriving Success/Failure of that component is:

Dropped(this) = Total(next) - Total(this)
Failed(this) = Dropped(this) + Failed(next)
Success(this) = Total(this) - Failed(this)

the exporter's `success=false` to determine the number of items
dropped by the processor, for example.

### Detailed metrics

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kinda similar comment to basic - why have a separate level? What's the objective worth complicating the spec this way?

Copy link
Author

@jmacd jmacd Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about letting users trade costs based on what they need/want to observe: more metrics may be useful, but they're just additional expense when they're not being used.

I mentioned a personal side-story that led me to this realization in today's Spec SIG: to monitor a water system is similar to monitoring a telemetry pipeline, and it's also a situation where each individual metric is a substantial expense. The minimum number of meters necessary to calculate total leakage in the system is 1 meter for (total) system production and 1 meter per user with a service connection. From total in and total out we can compute leakage, which is equivalent with the calculation for dropped items .

This leads to a conclusion that the minimum-cost configuration for a telemetry pipeline, capable of computing a global Dropped statistic, would use Basic-level detail in each SDK, (disabled metrics for all intermediate collectors), and Normal-level detail for the final component of the final collector in the pipeline. If the user is in a situation where the metrics from the SDKs are not comparable with the metrics from subsequent stages in the pipeline for ay reason, they should use Normal-level detail in the SDK.

I'm also aware of tracing pipelines where there are rate limits enforced at the destination. This is a scenario where if the response code is resource_exhausted I should turn up sampling, if it's timeout I should complain to my backend team about an SLO violation, and if it's queue_full it means I should reconfigure the SDK.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why shouldn't this be handled with just the pre-aggregation rules ("views"?) instead of making it a problem for the exporters / components to know about different levels?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This content will appear in a new location, I'm writing an OTEP.)
My assumption is that this would be implemented using views, and the text of a semantic convention would be explaining which views to configure at which level of detail.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmacd
Copy link
Author

jmacd commented Oct 31, 2023

open-telemetry/oteps#238

@jmacd jmacd closed this Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants