Prometheus Exporter - Gauge counter metrics dropped with error 'failed to translate metric' #6425

rahuls2 · 2021-11-24T07:10:19Z

Describe the bug
I'm using an opentelemetry collector setup to receive metrics from an app. From the opentelemetry collector, these metrics are being exported to Prometheus. The datatype of these metrics is gauge. Looking through the logs, I see a series of errors pertaining to a random subset of the metrics showing up periodically (assumedly during export attempts for these specific metrics). Here is an example of one such error seen under otelcollector logs:

2021-11-23T03:05:25.121Z error prometheusexporter@v0.39.0/accumulator.go:96 failed to translate metric {"kind": "exporter", "name": "prometheus", "data_type": "\u0000", "metric_name": "elastic_successful_bulk_write_count"}

In the above example, 'elastic_successful_bulk_write_count' is actually of datatype gauge. However, it appears as though this error shows up in the log for metrics when they don't have any value recorded.

Looks like the log is produced by the addMetric function over here in accumulator.go

It is hard to say why these logs are showing up for metrics which are of gauge datatype. I have not been able to trace this issue back to the app->otelcollector pipe, however, it seems to be produced during instances where the instruments do not have any values recorded.

Steps to reproduce
Use an application to send opentelemetry gauge type metrics to otelcollector once every 30 seconds. Export these metrics to Prometheus. For an exact replica, some of the instruments must not have any values recorded every once in a while.

Here is an example explaining how and when the error is generated:

We see a value for elastic_bulk_write_time recorded and this value is sent to Prometheus successfully. There is no record of a log with translate error for this metric.

Metric #12
Descriptor:
     -> Name: elastic_bulk_write_time
     -> Description: Total bulk write time
     -> Unit: ms
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> service_name: STRING(eventcollector)
     -> component_name: STRING(MulticastOperWriter)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-11-24 07:16:01.209206314 +0000 UTC
Value: 0.006462

Now in the next timestamp, this metric does not have a recorded value.

Metric #12
Descriptor:
     -> Name: elastic_bulk_write_time
     -> Description: Total bulk write time
     -> Unit: ms
     -> DataType: None

Subsequently, in the logs, the error shows up for this metric this time, as below.
2021-11-24T07:16:49.461Z error prometheusexporter@v0.39.0/accumulator.go:96 failed to translate metric {"kind": "exporter", "name": "prometheus", "data_type": "\u0000", "metric_name": "elastic_bulk_write_time"}

What did you expect to see?
The error log should not show up for unrecorded metrics. I am looking to get confirmation that this is not the expected behavior.

What did you see instead?
Log error reported for metrics.

What version did you use?
v0.38.0

The text was updated successfully, but these errors were encountered:

gouthamve · 2021-11-26T14:56:23Z

Hi, I am new to the project and I am trying to reproduce this. I have a few questions:

Use an application to send opentelemetry gauge type metrics to otelcollector once every 30 seconds.

Does this mean setting the collector period to 30secs? Like this:

	metricClient := otlpmetricgrpc.NewClient(
		otlpmetricgrpc.WithInsecure(),
		otlpmetricgrpc.WithEndpoint(otelAgentAddr))
	metricExp, err := otlpmetric.New(ctx, metricClient)

        pusher := controller.New(
		processor.NewFactory(
			simple.NewWithExactDistribution(),
			metricExp,
		),
		controller.WithExporter(metricExp),
		controller.WithCollectPeriod(30*time.Second),
	)

For an exact replica, some of the instruments must not have any values recorded every once in a while.

What do you mean by not have any values recorded every once in a while? I am recording the value for the gauge every 120s, like this:

	newGauge := metric.Must(meter).
		NewInt64UpDownCounter(
			"repro_issue/elastic_stuff",
			metric.WithDescription("This is me desperately trying to reproduce the issue"),
		)

	for {
     		time.Sleep(time.Duration(120) * time.Second)
		meter.RecordBatch(
			ctx,
			commonLabels,
			newGauge.Measurement(10),
		)
	}

I am unable to reproduce this issue with this setup though.

rahuls2 · 2021-11-26T23:56:01Z

Hey Goutham, thanks for checking.

I was able to do identify the root cause of the log error, and it seems this is the expected behavior for the circumstance I described.

Answers to your questions ->

"Does this mean setting the collector period to 30secs?": yes

"What do you mean by not have any values recorded every once in a while?": I should have been clearer in my description about this. Before exporting instruments, you must clear/delete the values recorded for the instrument, so as to ensure that these instruments are still defined but don't have a value. Now if in the next 30 seconds there are no calls to record for an instrument, and the code is trying to export all instruments without any checks - then an attempt is made to export an instrument that does not have any data with it (the datatype ends up getting set to None ("\u0000")).

The second answer above is the identified root cause for the log error. By adding a check in the exporter (to make sure export is only called on instruments that have seen a value) solved the problem since it does not attempt to write this instrument to Prometheus. I believe this is how the design is and that this behavior (the log error I described) is expected. Feel free to reopen this issue if you believe it is a bug.

gouthamve · 2021-11-29T11:18:08Z

you must clear/delete the values recorded for the instrument

Which SDK are you using? I cannot see any method to do the same in the Golang SDK.

rahuls2 · 2021-11-30T02:57:03Z

I used an adaptation of the Python SDK with an option to clear instruments after exporting to Prometheus.

This splits the OTLP receiver into its own module. Currently this leaves `scraperhelper` and `scrapererror` inside the main collector module, this is similar to `exporterhelper`. Note that doing this split brought up the interesting issue that the OTLP HTTP exporter depends on the OTLP receiver for some of its tests. I can address this separately from this PR. Fixes open-telemetry/opentelemetry-collector#6190

rahuls2 added the bug Something isn't working label Nov 24, 2021

jpkrohling added the comp:prometheus Prometheus related issues label Nov 25, 2021

rahuls2 closed this as completed Nov 26, 2021

gouthamve mentioned this issue Jun 21, 2022

REQUEST: New membership for @gouthamve open-telemetry/community#1087

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus Exporter - Gauge counter metrics dropped with error 'failed to translate metric' #6425

Prometheus Exporter - Gauge counter metrics dropped with error 'failed to translate metric' #6425

rahuls2 commented Nov 24, 2021 •

edited

Loading

gouthamve commented Nov 26, 2021

rahuls2 commented Nov 26, 2021

gouthamve commented Nov 29, 2021

rahuls2 commented Nov 30, 2021 •

edited

Loading

Prometheus Exporter - Gauge counter metrics dropped with error 'failed to translate metric' #6425

Prometheus Exporter - Gauge counter metrics dropped with error 'failed to translate metric' #6425

Comments

rahuls2 commented Nov 24, 2021 • edited Loading

gouthamve commented Nov 26, 2021

rahuls2 commented Nov 26, 2021

gouthamve commented Nov 29, 2021

rahuls2 commented Nov 30, 2021 • edited Loading

rahuls2 commented Nov 24, 2021 •

edited

Loading

rahuls2 commented Nov 30, 2021 •

edited

Loading