Exporter time series errors do not include metric names #443

jsirianni · 2022-06-23T14:22:43Z

When working with Google Exporter, it would be nice if time series errors returned the name of the metric(s) being rejected by the API, as sometimes a system will have hundreds of metric, with only a subset of them being rejected by the API. This is very difficult to track down as it requires the user to use metrics explorer and look at every single metric to try and find one that has spotty data.

Error

Jun 23 14:28:14 oiq-otelcollector-1 observiq-otel-collector[1673]: 2022-06-23T14:28:14.721Z error exporterhelper/queued_retry.go:149 Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: generic_node{location:global,node_id:oiq-otelcollector-1,namespace:oiq-otelcollector-1} timeSeries[0-26,28,30-138]: custom.googleapis.com/node_cpu_seconds_total{cpu0,appfluentbit2,modeidle,hostnamefluent-bit2}; Field timeSeries[27].points[0].interval.end_time had an invalid value of \"2022-05-23T03:12:21.352793-07:00\": Data points cannot be written more than 25h10s in the past.; Field timeSeries[29].points[0].interval.end_time had an invalid value of \"2022-06-06T02:45:37.375261-07:00\": Data points cannot be written more than 25h10s in the past.\nerror details: name = Unknown desc = total_point_count:139 success_point_count:117 errors:{status:{code:9} point_count:20} errors:{status:{code:3} point_count:2}"}

This error indicates a real problem, but does not include the name of the metrics being rejected.

The text was updated successfully, but these errors were encountered:

jsirianni · 2022-06-23T14:46:23Z

We see this frequently as well, which is caused by duplicate metrics when really it is just multiple systems sending identifcal metrics without adding uniquely identifiable resources such as a host.name.

{"level":"error","ts":"2022-06-17T13:01:26.349-0400","caller":"exporterhelper/queued_retry.go:149","msg":"Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors","kind":"exporter","name":"googlecloud","error":"rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[1] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.; Field timeSeries[3] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.; Field timeSeries[5] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.\nerror details: name = Unknown desc = total_point_count:6 success_point_count:3 errors:{status:{code:3} point_count:3}","stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send\n\t/opt/homebrew/pkg/mod/go.opentelemetry.io/collector@v0.52.0/exporter/exporterhelper/queued_retry.go:149\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send\n\t/opt/homebrew/pkg/mod/go.opentelemetry.io/collector@v0.52.0/exporter/exporterhelper/metrics.go:132\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1\n\t/opt/homebrew/pkg/mod/go.opentelemetry.io/collector@v0.52.0/exporter/exporterhelper/queued_retry_inmemory.go:119\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume\n\t/opt/homebrew/pkg/mod/go.opentelemetry.io/collector@v0.52.0/exporter/exporterhelper/internal/bounded_memory_queue.go:82\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2\n\t/opt/homebrew/pkg/mod/go.opentelemetry.io/collector@v0.52.0/exporter/exporterhelper/internal/bounded_memory_queue.go:69"}

Dylan-M · 2022-06-28T20:37:38Z

Regarding that second one, and really all errors in general, it is extremely beneficial to know which metric is having the error. Especially in environments with hundreds of metrics.

jsirianni · 2022-07-12T19:26:29Z

I am running into this again today. The issue is clearly in my configuration, but impossible to narrow down. The error I am getting exceeds 65k characters.

jsuereth · 2022-08-22T12:41:11Z

Unfortunately, this long error is an issue with the Cloud Monitoring Metrics API. The only way to solve it for this project would be to parse the error message and attempt to produce a better one. Instead we'll escalate this against the Cloud Monitoring API itself.

Dylan-M · 2022-09-15T16:00:34Z

Any progress on this? I encountered it again yesterday.

dashpole · 2022-10-10T19:59:47Z

Sorry, still no updates. I'll check with the Cloud Monitoring API team again to see if they have any updates.

dashpole · 2023-02-06T20:36:14Z

Still no updates.

Dylan-M · 2023-02-06T21:49:15Z

@dashpole Thanks, we're still seeing this, so it is still an important issue.

dashpole · 2023-08-14T18:42:40Z

Still no updates. If others run into the same issue, feel free to thumbs up the original comment

Dylan-M · 2023-08-14T18:52:44Z

@dashpole We still see this regularly, and it actually causes a deeper issue. If the persistent queue is enabled, these failures are put into the queue and retried over and over again. This causes issues all over the place, such as repeated API failures, an extreme log growth, and of course the persistent queue also growing on disk.

dashpole · 2023-08-14T19:00:53Z

If you are using the collector exporter, we do not recommend enabling the retry on failure setting (which we default to false). The exporter (well, really the cloud monitoring client library) has a (relatively) intelligent retry mechanism already built in, which should avoid spamming logs. This issue is just tracking making the error response more helpful.

If you are experiencing other issues, feel free to open an new issue in this repo.

Dylan-M · 2023-08-14T19:06:31Z

As you say, those other issues are preventable with settings tuning. However, I was not aware that the library had a separate retry mechanism. We should probably have an internal discussion on this. Thank you for the insight.

dashpole · 2023-08-14T19:08:58Z

Source for retry settings built-into the client: https://github.com/googleapis/google-cloud-go/blob/main/monitoring/apiv3/metric_client.go#L63. It is different per-api-call. CreateTimeSeries does not retry.

jsuereth added the enhancement New feature or request label Aug 22, 2022

dashpole added the priority: p2 label Aug 30, 2022

Dylan-M mentioned this issue Oct 10, 2022

Cloud Exporter Error Messages Do Not Include Metric Names #507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporter time series errors do not include metric names #443

Exporter time series errors do not include metric names #443

jsirianni commented Jun 23, 2022 •

edited

Loading

jsirianni commented Jun 23, 2022

Dylan-M commented Jun 28, 2022

jsirianni commented Jul 12, 2022

jsuereth commented Aug 22, 2022

Dylan-M commented Sep 15, 2022

dashpole commented Oct 10, 2022

dashpole commented Feb 6, 2023

Dylan-M commented Feb 6, 2023

dashpole commented Aug 14, 2023

Dylan-M commented Aug 14, 2023

dashpole commented Aug 14, 2023

Dylan-M commented Aug 14, 2023

dashpole commented Aug 14, 2023

Exporter time series errors do not include metric names #443

Exporter time series errors do not include metric names #443

Comments

jsirianni commented Jun 23, 2022 • edited Loading

jsirianni commented Jun 23, 2022

Dylan-M commented Jun 28, 2022

jsirianni commented Jul 12, 2022

jsuereth commented Aug 22, 2022

Dylan-M commented Sep 15, 2022

dashpole commented Oct 10, 2022

dashpole commented Feb 6, 2023

Dylan-M commented Feb 6, 2023

dashpole commented Aug 14, 2023

Dylan-M commented Aug 14, 2023

dashpole commented Aug 14, 2023

Dylan-M commented Aug 14, 2023

dashpole commented Aug 14, 2023

jsirianni commented Jun 23, 2022 •

edited

Loading