Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[telemetry] emit metrics with _ instead of / #9775

Merged
merged 6 commits into from
Apr 3, 2024

Conversation

codeboten
Copy link
Contributor

@codeboten codeboten commented Mar 15, 2024

This is addressing an issue w/ the names of the metrics generated by the Collector for its internal metrics. Note that this change only impacts users that emit telemetry using OTLP, which is currently still in experimental support. The prometheus metrics already replaced / with _.

Fixes #9774

This is addressing an issue w/ the names of the metrics generated by the Collector for its internal metrics. Note that this change only impacts users that emit telemetry using OTLP, which is currently still in experimental support. The prometheus metrics already replaced `/` with `_` and they will do the same with `.`.

Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
@codeboten codeboten requested review from a team and mx-psi March 15, 2024 17:16
Copy link

codecov bot commented Mar 15, 2024

Codecov Report

Attention: Patch coverage is 93.10345% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 91.32%. Comparing base (295251b) to head (483643f).
Report is 3 commits behind head on main.

Files Patch % Lines
processor/processorhelper/obsreport.go 84.61% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9775   +/-   ##
=======================================
  Coverage   91.32%   91.32%           
=======================================
  Files         356      356           
  Lines       19163    19168    +5     
=======================================
+ Hits        17500    17505    +5     
  Misses       1335     1335           
  Partials      328      328           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

NameSep = "/"
Scope = "go.opentelemetry.io/collector/obsreport"
NameSep = "/"
MetricSep = "."
Copy link
Member

@dmitryax dmitryax Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the SIG call, I suggested using _ instead of . until we stabilize the metrics. The reason is that AFAIK Prometheus is working on supporting dots. So once it's ready, it'll be another breaking change for all the users using Prometheus to scrape the internal metrics. Suddenly all the metrics will be converted from otelcol_receiver_accepted_metric_points to otelcol.receiver.accepted.metric_points. If we adopt _ for now, I believe we can avoid that.

I understand that _ is not semantically correct from OTel's perspective, but we will have to make a big migration later anyway, so it might make sense to reduce the amount of breaking changes.

I'm happy to hear other opinions. And I'm not sure about the timeline for dots support in Prometheus. Maybe it's much further away than new stable Collector metrics.

cc @open-telemetry/collector-approvers @open-telemetry/collector-contrib-approvers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, once the dots are supported in Prometheus, we might need a config option to convert .->_ somewhere to preserve the behavior and avoid breaking the users. It can be even a Prometheus receiver config option. In that case, we should be good.

@dashpole, maybe you have some ideas about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that use of dots vs underscores will be negotiated using headers, so users will be able to control it server-side: https://github.com/prometheus/proposals/blob/main/proposals/2023-08-21-utf8.md#query-time. If it is enabled by default at some point in the future, it might still be a surprise for users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitryax the reason i went with . instead of _ is that other metrics (for example the grpc instrumentation) uses . already when emitting OTLP. Having both separators, depending on the source of the metric, seemed confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that use of dots vs underscores will be negotiated using headers, so users will be able to control it server-side: https://github.com/prometheus/proposals/blob/main/proposals/2023-08-21-utf8.md#query-time. If it is enabled by default at some point in the future, it might still be a surprise for users.

Thanks for the clarification. Looks like, even if it is enabled by default, it won't happen any time soon. I assume the controlling headers will be exposed as a configuration in the Go OTel exporter as well as somewhere in service::telemetry::metrics` section of the collector config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitryax the reason i went with . instead of _ is that other metrics (for example the grpc instrumentation) uses . already when emitting OTLP. Having both separators, depending on the source of the metric, seemed confusing.

Makes sense

@dmitryax
Copy link
Member

I'm worried that the emitted internal metrics are not covered by tests. It'll be great to see some metrics output changed in tests along with this change. @codeboten, do you know if we have a GH issue to introduce such coverage, by chance?

@codeboten
Copy link
Contributor Author

codeboten commented Mar 15, 2024

@dmitryax not that i know of. I can spend some time trying to add tests here, i agree that its not ideal not to have this tested with the specific metric names (hard coded in the tests). I can also open a follow up issue to tackle it separately 🤷🏻

@dmitryax
Copy link
Member

I can also open a follow up issue to tackle it separately 🤷🏻

That would be great! Thank you.

Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
NameSep = "/"
Scope = "go.opentelemetry.io/collector/obsreport"
SpanNameSep = "/"
MetricNameSep = "."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be a breaking change for end users who scrape the metrics with prometheus right? But it will be a breaking change for anyone using the otel go sdk to emit metrics right? I am worried about breaking end-user telemetry in a single release, should we add a feature gate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't OTel Go SDK support under a feature gate already?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, but I'm always worried changing telemetry names bc of the impact to alerting. I believe our policy for alpha/beta feature gates is that we can change the feature without a breaking change. If we chose to do that we could make the release notes super clear and discoverable what we broke.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to tyler's concern here. By making the separator here a . instead of a _ we would then create three possibilities for users today:

otelcol.process_uptime # the metric for process uptime after this pr merges for users emitting otlp
otelcol_process_uptime # the metric for process uptime after this pr merges for users scraping via prometheus
process_uptime # the metric for process uptime prior to this merges for users emitting otlp

Rather than creating a third potential for this metric name, if the separator were _ we would simply have this state:

otelcol_process_uptime # after this change, no matter if a user is sending OTLP or Prometheus
process_uptime # the metric for process uptime prior to this merges for users emitting otlp

Creating a third state disincentivizes users from sending OTLP (because it would break their dashboards/alerts) or upgrading their collector further for the same reason.

We already do not follow the conventions mentioned here, see naming for time and utilization. Prometheus currently rules the metric world for metric naming, independently of the recommendations made by our own semantic convention today. Given that, I think for legacy conversions we should always be opting for not breaking users by following the existing convention. We should reconsider this once semantic convention exists for not only metric naming recommendations (when to use a . or a _), but also for pipeline monitoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right the problem is see with using _, is that today, changing a collector from emitting metrics using prometheus to OTLP has a problem with metrics we don't control.

As an example, the metric for grpc instrumentation changes from

otelcol_rpc_server_requests_per_rpc

to:

rpc.server.responses_per_rpc

This is because the prometheus metric replaces . with _ (as it does for other collector metrics /). Replacing the separator with _ instead of . means users will still have to contend with a broken metric, it's just not clear which metrics will be broken to them, as they may not know what constitutes an instrumentation metric vs a collector metric. I opted to go with . as it at least aligns with the opentelemetry semantic convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a third state disincentivizes users from sending OTLP (because it would break their dashboards/alerts) or upgrading their collector further for the same reason.

It is true that breaking the name is going to disincentivize users, but I don't think we want to hold ourselves to prometheus naming standards when OTel has its own naming standard.

I can see an argument for waiting until more semantic conventions are stabilized before doing the breaking change to require only 1 breaking change for our end users.

I am ok keeping this change behind the useOtelWithSDKConfigurationForInternalTelemetry since it is still alpha.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah my other concern is that by changing the convention for the collector's own metrics, we actually invalidate any documents for stable/beta components that mention existing metrics. these documents mention the important metrics for various components, none mentioning that these are "prometheus convention" but rather that they are the important metrics for monitoring your collector. By creating a dual state, all of these documents would need to be edited as part of this change and then edited again upon guidance from the OTEP.

Similarly, I would argue that by not keeping the metrics consistent we are introducing a bug to the collector – we would not be able to make the useOtelWithSDKConfigurationForInternalTelemetry flag beta because then anyone emitting metrics via OTLP would see incorrect metric names for components like the batch processor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codeboten is there a world where we handle switch to the OTEL SDK in 2 steps:

  1. Get the useOtelWithSDKConfigurationForInternalTelemetry to Stable without breaking the emitted metric names:
    • this gets us off the prometheus receivers/opencensus dependency and aligned with the project the collector is a part of.
  2. Redo the metric names to meet OTel stable metric conventions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i see two problems and maybe it's easier to reason with a decision to move forward if we think about them separately.

1. Metrics controlled by the collector

These are the metrics that come from instruments instantiated in the collector itself. An example of such a metric is processor/batch/metadata_cardinality. This metric in the prometheus export is generated with the otelcol_ prefix and all the / are replaced with _ resulting in a metric named: otelcol_processor_batch_metadata_cardinality.

These metrics are currently being emitted one way via the prometheus export, and another via OTLP export:

otelcol_processor_batch_metadata_cardinality # prometheus metric
processor/batch/metadata_cardinality # otlp metric

After this change, it would be emitted like (which aligns more closely to otel conventions):

processor.batch.metadata_cardinality

The alternative would be to emit this as

processor_batch_metadata_cardinality

Ignoring the missing prefix as that's done in a separate change, this would remain consistent with the prometheus naming that exists today.

2. Metrics generated by instrumentation

These are metrics we don't control as they're the result of instrumentation libraries for which we have limited control (outside of using views). These metrics currently look like this:

otelcol_rpc_server_requests_per_rpc # prometheus metric
rpc.server.requests_per_rpc # otlp metric

After this change, nothing changes in the metric itself, as there is no configuration for the separator or prefix for the OTLP exporter.

My proposal was to align both the 1 & 2 more closely to prevent users having some of their metrics continue to work but not all of them. I thought that putting this behind an experimental flag would allow users to make a change across all the metrics generated by the collector at the same time, reducing the chance that only some of their metrics work.

That being said, I can also see the benefit from having the metrics under the collector's control not change multiple times. But as an end-user, it may not be clear why only instrumentation library metrics changed but not all of them.

I'm happy to discuss this at tomorrow's SIG call if it makes sense @jaronoff97 @TylerHelmuth

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya let's discuss some more at the SIG

Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
@codeboten codeboten changed the title [telemetry] emit metrics with . instead of / [telemetry] emit metrics with _ instead of / Apr 3, 2024
Copy link
Contributor Author

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed at the 3-Apr-2024 SIG call, decided on _ for the metrics we control in the Collector. Updated this PR to reflect this

@jaronoff97 @TylerHelmuth PTAL

Copy link
Member

@TylerHelmuth TylerHelmuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hurts, but at least it isn't a breaking change for the collector's own metrics.

@codeboten codeboten merged commit bc87939 into open-telemetry:main Apr 3, 2024
34 of 49 checks passed
@codeboten codeboten deleted the codeboten/use-period branch April 3, 2024 21:42
@github-actions github-actions bot added this to the next release milestone Apr 3, 2024
@dmitryax
Copy link
Member

dmitryax commented Apr 4, 2024

This change removes the otelcol_ prefix from the http/grp metrics, right?

I believe we also agreed to have #9767 resolved first to not export the rpc/http metrics by default. I think if we stop emitting them by default in the same release, it'll be a better experience for users (enable -> adopt for the name change) than (adopt for the name change -> enable once they are disabled in the future). The blast radius of the breaking change is higher in the former case.

@codeboten
Copy link
Contributor Author

This change removes the otelcol_ prefix from the http/grp metrics, right?

This change does not. That happens in #9759

codeboten added a commit that referenced this pull request Jul 16, 2024
This ensures the consistency for folks emitting metrics w/ OTLP until
the OTEP to specify pipeline telemetry is completed.

Waiting on
#9775
before moving this forward

Fixes #9315

---------

Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

internal collector metrics should use _ instead of / in names
6 participants