Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Exporter - Gauge counter metrics dropped with error 'failed to translate metric' #6425

Closed
rahuls2 opened this issue Nov 24, 2021 · 4 comments
Labels
bug Something isn't working comp:prometheus Prometheus related issues

Comments

@rahuls2
Copy link

rahuls2 commented Nov 24, 2021

Describe the bug
I'm using an opentelemetry collector setup to receive metrics from an app. From the opentelemetry collector, these metrics are being exported to Prometheus. The datatype of these metrics is gauge. Looking through the logs, I see a series of errors pertaining to a random subset of the metrics showing up periodically (assumedly during export attempts for these specific metrics). Here is an example of one such error seen under otelcollector logs:

2021-11-23T03:05:25.121Z error prometheusexporter@v0.39.0/accumulator.go:96 failed to translate metric {"kind": "exporter", "name": "prometheus", "data_type": "\u0000", "metric_name": "elastic_successful_bulk_write_count"}

In the above example, 'elastic_successful_bulk_write_count' is actually of datatype gauge. However, it appears as though this error shows up in the log for metrics when they don't have any value recorded.

Looks like the log is produced by the addMetric function over here in accumulator.go

It is hard to say why these logs are showing up for metrics which are of gauge datatype. I have not been able to trace this issue back to the app->otelcollector pipe, however, it seems to be produced during instances where the instruments do not have any values recorded.

Steps to reproduce
Use an application to send opentelemetry gauge type metrics to otelcollector once every 30 seconds. Export these metrics to Prometheus. For an exact replica, some of the instruments must not have any values recorded every once in a while.

Here is an example explaining how and when the error is generated:

  1. We see a value for elastic_bulk_write_time recorded and this value is sent to Prometheus successfully. There is no record of a log with translate error for this metric.
Metric #12
Descriptor:
     -> Name: elastic_bulk_write_time
     -> Description: Total bulk write time
     -> Unit: ms
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> service_name: STRING(eventcollector)
     -> component_name: STRING(MulticastOperWriter)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-11-24 07:16:01.209206314 +0000 UTC
Value: 0.006462
  1. Now in the next timestamp, this metric does not have a recorded value.
Metric #12
Descriptor:
     -> Name: elastic_bulk_write_time
     -> Description: Total bulk write time
     -> Unit: ms
     -> DataType: None

Subsequently, in the logs, the error shows up for this metric this time, as below.
2021-11-24T07:16:49.461Z error prometheusexporter@v0.39.0/accumulator.go:96 failed to translate metric {"kind": "exporter", "name": "prometheus", "data_type": "\u0000", "metric_name": "elastic_bulk_write_time"}

What did you expect to see?
The error log should not show up for unrecorded metrics. I am looking to get confirmation that this is not the expected behavior.

What did you see instead?
Log error reported for metrics.

What version did you use?
v0.38.0

@rahuls2 rahuls2 added the bug Something isn't working label Nov 24, 2021
@jpkrohling jpkrohling added the comp:prometheus Prometheus related issues label Nov 25, 2021
@gouthamve
Copy link
Member

Hi, I am new to the project and I am trying to reproduce this. I have a few questions:

Use an application to send opentelemetry gauge type metrics to otelcollector once every 30 seconds.

Does this mean setting the collector period to 30secs? Like this:

	metricClient := otlpmetricgrpc.NewClient(
		otlpmetricgrpc.WithInsecure(),
		otlpmetricgrpc.WithEndpoint(otelAgentAddr))
	metricExp, err := otlpmetric.New(ctx, metricClient)

        pusher := controller.New(
		processor.NewFactory(
			simple.NewWithExactDistribution(),
			metricExp,
		),
		controller.WithExporter(metricExp),
		controller.WithCollectPeriod(30*time.Second),
	)

For an exact replica, some of the instruments must not have any values recorded every once in a while.

What do you mean by not have any values recorded every once in a while? I am recording the value for the gauge every 120s, like this:

	newGauge := metric.Must(meter).
		NewInt64UpDownCounter(
			"repro_issue/elastic_stuff",
			metric.WithDescription("This is me desperately trying to reproduce the issue"),
		)

	for {
     		time.Sleep(time.Duration(120) * time.Second)
		meter.RecordBatch(
			ctx,
			commonLabels,
			newGauge.Measurement(10),
		)
	}

I am unable to reproduce this issue with this setup though.

@rahuls2
Copy link
Author

rahuls2 commented Nov 26, 2021

Hey Goutham, thanks for checking.

I was able to do identify the root cause of the log error, and it seems this is the expected behavior for the circumstance I described.

Answers to your questions ->

"Does this mean setting the collector period to 30secs?": yes

"What do you mean by not have any values recorded every once in a while?": I should have been clearer in my description about this. Before exporting instruments, you must clear/delete the values recorded for the instrument, so as to ensure that these instruments are still defined but don't have a value. Now if in the next 30 seconds there are no calls to record for an instrument, and the code is trying to export all instruments without any checks - then an attempt is made to export an instrument that does not have any data with it (the datatype ends up getting set to None ("\u0000")).

The second answer above is the identified root cause for the log error. By adding a check in the exporter (to make sure export is only called on instruments that have seen a value) solved the problem since it does not attempt to write this instrument to Prometheus. I believe this is how the design is and that this behavior (the log error I described) is expected. Feel free to reopen this issue if you believe it is a bug.

@rahuls2 rahuls2 closed this as completed Nov 26, 2021
@gouthamve
Copy link
Member

you must clear/delete the values recorded for the instrument

Which SDK are you using? I cannot see any method to do the same in the Golang SDK.

@rahuls2
Copy link
Author

rahuls2 commented Nov 30, 2021

I used an adaptation of the Python SDK with an option to clear instruments after exporting to Prometheus.

povilasv referenced this issue in coralogix/opentelemetry-collector-contrib Dec 19, 2022
This splits the OTLP receiver into its own module. Currently this leaves `scraperhelper` and `scrapererror` inside the main collector module, this is similar to `exporterhelper`. Note that doing this split brought up the interesting issue that the OTLP HTTP exporter depends on the OTLP receiver for some of its tests. I can address this separately from this PR.

Fixes open-telemetry/opentelemetry-collector#6190
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working comp:prometheus Prometheus related issues
Projects
None yet
Development

No branches or pull requests

3 participants