-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[telemetry] emit metrics with _ instead of / #9775
Conversation
This is addressing an issue w/ the names of the metrics generated by the Collector for its internal metrics. Note that this change only impacts users that emit telemetry using OTLP, which is currently still in experimental support. The prometheus metrics already replaced `/` with `_` and they will do the same with `.`. Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9775 +/- ##
=======================================
Coverage 91.32% 91.32%
=======================================
Files 356 356
Lines 19163 19168 +5
=======================================
+ Hits 17500 17505 +5
Misses 1335 1335
Partials 328 328 ☔ View full report in Codecov by Sentry. |
NameSep = "/" | ||
Scope = "go.opentelemetry.io/collector/obsreport" | ||
NameSep = "/" | ||
MetricSep = "." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the SIG call, I suggested using _
instead of .
until we stabilize the metrics. The reason is that AFAIK Prometheus is working on supporting dots. So once it's ready, it'll be another breaking change for all the users using Prometheus to scrape the internal metrics. Suddenly all the metrics will be converted from otelcol_receiver_accepted_metric_points
to otelcol.receiver.accepted.metric_points
. If we adopt _
for now, I believe we can avoid that.
I understand that _
is not semantically correct from OTel's perspective, but we will have to make a big migration later anyway, so it might make sense to reduce the amount of breaking changes.
I'm happy to hear other opinions. And I'm not sure about the timeline for dots support in Prometheus. Maybe it's much further away than new stable Collector metrics.
cc @open-telemetry/collector-approvers @open-telemetry/collector-contrib-approvers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, once the dots are supported in Prometheus, we might need a config option to convert .
->_
somewhere to preserve the behavior and avoid breaking the users. It can be even a Prometheus receiver config option. In that case, we should be good.
@dashpole, maybe you have some ideas about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that use of dots vs underscores will be negotiated using headers, so users will be able to control it server-side: https://github.com/prometheus/proposals/blob/main/proposals/2023-08-21-utf8.md#query-time. If it is enabled by default at some point in the future, it might still be a surprise for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmitryax the reason i went with .
instead of _
is that other metrics (for example the grpc instrumentation) uses .
already when emitting OTLP. Having both separators, depending on the source of the metric, seemed confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that use of dots vs underscores will be negotiated using headers, so users will be able to control it server-side: https://github.com/prometheus/proposals/blob/main/proposals/2023-08-21-utf8.md#query-time. If it is enabled by default at some point in the future, it might still be a surprise for users.
Thanks for the clarification. Looks like, even if it is enabled by default, it won't happen any time soon. I assume the controlling headers will be exposed as a configuration in the Go OTel exporter as well as somewhere in service::telemetry::metrics` section of the collector config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmitryax the reason i went with . instead of _ is that other metrics (for example the grpc instrumentation) uses . already when emitting OTLP. Having both separators, depending on the source of the metric, seemed confusing.
Makes sense
I'm worried that the emitted internal metrics are not covered by tests. It'll be great to see some metrics output changed in tests along with this change. @codeboten, do you know if we have a GH issue to introduce such coverage, by chance? |
@dmitryax not that i know of. I can spend some time trying to add tests here, i agree that its not ideal not to have this tested with the specific metric names (hard coded in the tests). I can also open a follow up issue to tackle it separately 🤷🏻 |
That would be great! Thank you. |
Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
NameSep = "/" | ||
Scope = "go.opentelemetry.io/collector/obsreport" | ||
SpanNameSep = "/" | ||
MetricNameSep = "." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't be a breaking change for end users who scrape the metrics with prometheus right? But it will be a breaking change for anyone using the otel go sdk to emit metrics right? I am worried about breaking end-user telemetry in a single release, should we add a feature gate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't OTel Go SDK support under a feature gate already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, but I'm always worried changing telemetry names bc of the impact to alerting. I believe our policy for alpha/beta feature gates is that we can change the feature without a breaking change. If we chose to do that we could make the release notes super clear and discoverable what we broke.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to tyler's concern here. By making the separator here a .
instead of a _
we would then create three possibilities for users today:
otelcol.process_uptime # the metric for process uptime after this pr merges for users emitting otlp
otelcol_process_uptime # the metric for process uptime after this pr merges for users scraping via prometheus
process_uptime # the metric for process uptime prior to this merges for users emitting otlp
Rather than creating a third potential for this metric name, if the separator were _
we would simply have this state:
otelcol_process_uptime # after this change, no matter if a user is sending OTLP or Prometheus
process_uptime # the metric for process uptime prior to this merges for users emitting otlp
Creating a third state disincentivizes users from sending OTLP (because it would break their dashboards/alerts) or upgrading their collector further for the same reason.
We already do not follow the conventions mentioned here, see naming for time and utilization. Prometheus currently rules the metric world for metric naming, independently of the recommendations made by our own semantic convention today. Given that, I think for legacy conversions we should always be opting for not breaking users by following the existing convention. We should reconsider this once semantic convention exists for not only metric naming recommendations (when to use a . or a _), but also for pipeline monitoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right the problem is see with using _
, is that today, changing a collector from emitting metrics using prometheus to OTLP has a problem with metrics we don't control.
As an example, the metric for grpc instrumentation changes from
otelcol_rpc_server_requests_per_rpc
to:
rpc.server.responses_per_rpc
This is because the prometheus metric replaces .
with _
(as it does for other collector metrics /
). Replacing the separator with _
instead of .
means users will still have to contend with a broken metric, it's just not clear which metrics will be broken to them, as they may not know what constitutes an instrumentation metric vs a collector metric. I opted to go with .
as it at least aligns with the opentelemetry semantic convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a third state disincentivizes users from sending OTLP (because it would break their dashboards/alerts) or upgrading their collector further for the same reason.
It is true that breaking the name is going to disincentivize users, but I don't think we want to hold ourselves to prometheus naming standards when OTel has its own naming standard.
I can see an argument for waiting until more semantic conventions are stabilized before doing the breaking change to require only 1 breaking change for our end users.
I am ok keeping this change behind the useOtelWithSDKConfigurationForInternalTelemetry
since it is still alpha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah my other concern is that by changing the convention for the collector's own metrics, we actually invalidate any documents for stable/beta components that mention existing metrics. these documents mention the important metrics for various components, none mentioning that these are "prometheus convention" but rather that they are the important metrics for monitoring your collector. By creating a dual state, all of these documents would need to be edited as part of this change and then edited again upon guidance from the OTEP.
Similarly, I would argue that by not keeping the metrics consistent we are introducing a bug to the collector – we would not be able to make the useOtelWithSDKConfigurationForInternalTelemetry
flag beta because then anyone emitting metrics via OTLP would see incorrect metric names for components like the batch processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codeboten is there a world where we handle switch to the OTEL SDK in 2 steps:
- Get the
useOtelWithSDKConfigurationForInternalTelemetry
toStable
without breaking the emitted metric names:- this gets us off the prometheus receivers/opencensus dependency and aligned with the project the collector is a part of.
- Redo the metric names to meet OTel stable metric conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so i see two problems and maybe it's easier to reason with a decision to move forward if we think about them separately.
1. Metrics controlled by the collector
These are the metrics that come from instruments instantiated in the collector itself. An example of such a metric is processor/batch/metadata_cardinality
. This metric in the prometheus export is generated with the otelcol_
prefix and all the /
are replaced with _
resulting in a metric named: otelcol_processor_batch_metadata_cardinality
.
These metrics are currently being emitted one way via the prometheus export, and another via OTLP export:
otelcol_processor_batch_metadata_cardinality # prometheus metric
processor/batch/metadata_cardinality # otlp metric
After this change, it would be emitted like (which aligns more closely to otel conventions):
processor.batch.metadata_cardinality
The alternative would be to emit this as
processor_batch_metadata_cardinality
Ignoring the missing prefix as that's done in a separate change, this would remain consistent with the prometheus naming that exists today.
2. Metrics generated by instrumentation
These are metrics we don't control as they're the result of instrumentation libraries for which we have limited control (outside of using views). These metrics currently look like this:
otelcol_rpc_server_requests_per_rpc # prometheus metric
rpc.server.requests_per_rpc # otlp metric
After this change, nothing changes in the metric itself, as there is no configuration for the separator or prefix for the OTLP exporter.
My proposal was to align both the 1 & 2 more closely to prevent users having some of their metrics continue to work but not all of them. I thought that putting this behind an experimental flag would allow users to make a change across all the metrics generated by the collector at the same time, reducing the chance that only some of their metrics work.
That being said, I can also see the benefit from having the metrics under the collector's control not change multiple times. But as an end-user, it may not be clear why only instrumentation library metrics changed but not all of them.
I'm happy to discuss this at tomorrow's SIG call if it makes sense @jaronoff97 @TylerHelmuth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya let's discuss some more at the SIG
Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed at the 3-Apr-2024 SIG call, decided on _
for the metrics we control in the Collector. Updated this PR to reflect this
@jaronoff97 @TylerHelmuth PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It hurts, but at least it isn't a breaking change for the collector's own metrics.
This change removes the I believe we also agreed to have #9767 resolved first to not export the rpc/http metrics by default. I think if we stop emitting them by default in the same release, it'll be a better experience for users (enable -> adopt for the name change) than (adopt for the name change -> enable once they are disabled in the future). The blast radius of the breaking change is higher in the former case. |
This change does not. That happens in #9759 |
This is addressing an issue w/ the names of the metrics generated by the Collector for its internal metrics. Note that this change only impacts users that emit telemetry using OTLP, which is currently still in experimental support. The prometheus metrics already replaced
/
with_
.Fixes #9774