Googlemanagedprometheus exporter randomly falls into an infinite error state #31507

rafal-dudek · 2024-02-29T10:15:33Z

Component(s)

exporter/googlemanagedprometheus

What happened?

Description

Sometimes, when pod in GKE with OpenTelemetry Collector starts up, it reports errors "One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric." in every consequent minute (scrape interval is 30s). After restarting pod the problem disappears. After some more restarts, problem happens again.
Looks like all the metrics are send properly to Google Monitoring but every minute additional duplicated data points are added to the batch, which causes the errors.

Steps to Reproduce

Create pod in Google Kubernetes Engine with OpenTelemetry Collector with similar config to ours. If problem does not occur, delete the pod and recreate it. Repeat until you see consistent error logs.

Expected Result

If there is any problem with saving data point to Google Monitoring which causes sending duplicated data point next minute, it should not repeat infinitely each minute.

Actual Result

Error with duplicated data point causes infinite errors state of OpenTelemetry exporter, fixed only when pod is deleted.

Collector version

v0.95.0

Environment information

Environment

Google Kubernetes Engine
Base image: ubi9/ubi
Compiler(if manually compiled): go 1.21.7

OpenTelemetry Collector configuration

receivers:  
  prometheus/otel-metrics:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 30s
          static_configs:
            - targets: ['127.0.0.1:8888']
          metrics_path: /metrics

processors:
  resource/metrics:
    attributes:
      - key: k8s.namespace.name
        value: namespace-name
        action: upsert
      - key: k8s.pod.name
        value: pod-name-tfx4k # The name of the POD - unique name after each recreation
        action: upsert
      - key: k8s.container.name
        value: otel-collector
        action: upsert
      - key: cloud.availability_zone
        value: us-central1-c
        action: upsert
      - key: service.name
        action: delete
      - key: service.version
        action: delete
      - key: service.instance.id
        action: delete
  metricstransform/gmp_otel:
    transforms:
    - include: ^(.*)$$
      match_type: regexp
      action: update
      new_name: otel_internal_$${1}
    - include: \.*
      match_type: regexp
      action: update
      operations:
        - action: add_label
          new_label: source_project_id
          new_value: gke-cluster-project
        - action: add_label
          new_label: pod_name
          new_value: pod-name-tfx4k # The name of the POD - unique name after each recreation
        - action: add_label
          new_label: container_name
          new_value: es-exporter
    - include: ^(.+)_(seconds|bytes)_(.+)$$
      match_type: regexp
      action: update
      new_name: $${1}_$${3}
    - include: ^(.+)_(bytes|total|seconds)$$
      match_type: regexp
      action: update
      new_name: $${1}
  resourcedetection/metrics:
    detectors: [env, gcp]
    timeout: 2s
    override: false
  batch/metrics:
    send_batch_size: 200
    timeout: 5s
    send_batch_max_size: 200
  memory_limiter:
    limit_mib: 297
    spike_limit_mib: 52
    check_interval: 1s
exporters:
  googlemanagedprometheus/otel-metrics:
    project: project-for-metrics
    timeout: 15s
    sending_queue:
      enabled: false
      num_consumers: 10
      queue_size: 5000
    metric:
      prefix: prometheus.googleapis.com
      add_metric_suffixes: False
  logging:
    loglevel: debug
    sampling_initial: 1
    sampling_thereafter: 500
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
  pprof:
    endpoint: 0.0.0.0:1777
service:
  telemetry:
    logs:
      level: "info"
  extensions: [health_check]
  pipelines:
    metrics/otel:
      receivers: [prometheus/otel-metrics]
      processors: [batch/metrics, resourcedetection/metrics, metricstransform/gmp_otel, resource/metrics]
      exporters: [googlemanagedprometheus/otel-metrics]

Log output

First 11 error logs:
2024-02-29T08:22:00.043Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{location:us-central1-c,job:,instance:,namespace:namespace-name,cluster:gke-cluster-name} timeSeries[0-12]: prometheus.googleapis.com/otel_internal_scrape_series_added/gauge{pod_name:pod-name-tfx4k,source_project_id:gke-cluster-project,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,container_name:es-exporter,otel_scope_name:otelcol/prometheusreceiver}\\nerror details: name = Unknown  desc = total_point_count:13  success_point_count:12  errors:{status:{code:9}  point_count:1} "rejected_items": 28}

2024-02-29T08:23:00.169Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{job:,cluster:gke-cluster-name,location:us-central1-c,instance:,namespace:namespace-name} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_batch_batch_size_trigger_send/counter{service_name:otel-collector-ngp-monitoring,otel_scope_name:otelcol/prometheusreceiver,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,processor:batch/metrics}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:24:00.330Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{namespace:namespace-name,job:,cluster:gke-cluster-name,location:us-central1-c,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_scrape_samples_scraped/gauge{otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,container_name:es-exporter,otel_scope_name:otelcol/prometheusreceiver,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:25:00.450Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{namespace:namespace-name,cluster:gke-cluster-name,location:us-central1-c,job:,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_receiver_accepted_metric_points/counter{source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,transport:http,service_instance_id:instance-id-c4ef0b0aaf35,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,container_name:es-exporter,otel_scope_name:otelcol/prometheusreceiver,service_name:otel-collector-ngp-monitoring,receiver:prometheus/app-metrics,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:26:00.599Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,cluster:gke-cluster-name,namespace:namespace-name,job:,location:us-central1-c} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_exporter_send_failed_metric_points/counter{source_project_id:gke-cluster-project,container_name:es-exporter,exporter:googlemanagedprometheus/app-metrics,service_name:otel-collector-ngp-monitoring,service_instance_id:instance-id-c4ef0b0aaf35,otel_scope_name:otelcol/prometheusreceiver,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,pod_name:pod-name-tfx4k,service_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:27:00.737Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{location:us-central1-c,job:,cluster:gke-cluster-name,namespace:namespace-name,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_grpc_io_client_completed_rpcs/counter{service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,grpc_client_status:INVALID_ARGUMENT,source_project_id:gke-cluster-project,service_name:otel-collector-ngp-monitoring,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,otel_scope_name:otelcol/prometheusreceiver,pod_name:pod-name-tfx4k,grpc_client_method:google.monitoring.v3.MetricService/CreateTimeSeries}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:28:00.895Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,namespace:namespace-name,cluster:gke-cluster-name,location:us-central1-c,job:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_grpc_io_client_completed_rpcs/counter{service_version:0.95.0-rc-9-g87a8be8-20240228-145123,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,grpc_client_method:google.monitoring.v3.MetricService/CreateTimeSeries,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,otel_scope_name:otelcol/prometheusreceiver,grpc_client_status:INVALID_ARGUMENT,service_name:otel-collector-ngp-monitoring,service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:29:01.042Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{job:,location:us-central1-c,cluster:gke-cluster-name,namespace:namespace-name,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_exporter_queue_capacity/gauge{service_name:otel-collector-ngp-monitoring,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,otel_scope_name:otelcol/prometheusreceiver,service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,exporter:googlecloud/app-traces}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:30:01.166Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,job:,location:us-central1-c,namespace:namespace-name,cluster:gke-cluster-name} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_accepted_metric_points/counter{service_name:otel-collector-ngp-monitoring,otel_scope_name:otelcol/prometheusreceiver,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,processor:memory_limiter,container_name:es-exporter,source_project_id:gke-cluster-project,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,service_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:30:56.284Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{job:,namespace:namespace-name,cluster:gke-cluster-name,location:us-central1-c,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_exporter_sent_metric_points/counter{service_version:0.95.0-rc-9-g87a8be8-20240228-145123,exporter:googlemanagedprometheus/app-metrics,source_project_id:gke-cluster-project,otel_scope_name:otelcol/prometheusreceiver,container_name:es-exporter,service_name:otel-collector-ngp-monitoring,service_instance_id:instance-id-c4ef0b0aaf35,pod_name:pod-name-tfx4k,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:31:56.439Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{cluster:gke-cluster-name,instance:,job:,namespace:namespace-name,location:us-central1-c} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_batch_metadata_cardinality/gauge{service_version:0.95.0-rc-9-g87a8be8-20240228-145123,source_project_id:gke-cluster-project,service_name:otel-collector-ngp-monitoring,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,otel_scope_name:otelcol/prometheusreceiver,container_name:es-exporter}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

Additional context

I made some additional tests and it looks like googlemanagedprometheus timeout could be related to the problem.
With timeout 10s I got 5 pods with errors on 12 pods started.
With timeout 15s I got 1 pod with errors on 20 pods started.
So, maybe there is a problem with exporting timeout but this behavior with infinite errors does not look correct.

Histogram for 10s timeout:

Histogram for 15s timeout:

almost 2 hours of errors later (the same pod):

Blue rectangle mean new Pod started. Red rectangle mean the error described in this issue.
All pods are exactly the same, just with a different name with random suffix.

github-actions · 2024-02-29T10:18:16Z

Pinging code owners:

exporter/googlemanagedprometheus: @aabmass @dashpole @jsuereth @punya @damemi @psx95

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole · 2024-02-29T15:07:14Z

That error usually means you either have multiple collectors trying to write the same set of metrics, or that the metrics being sent contain duplicates. Are you able to reproduce this with a single collector?

I see service_version:0.95.0-rc-9-g87a8be8-20240228-145123. Is this a fork of the collector, or built off of a different commit? If you can reproduce with a released version, that would be helpful.

rafal-dudek · 2024-02-29T15:43:17Z

I have only 1 instance of this collector at the same time. We are also using labels with a Namespace and a Pod name (Deployment) which is de facto unique.
And I believe we cannot have duplicates as we are using (in this example) otel-collector internal metrics.

E.g. let's take this error:
Exporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,job:,location:us-central1-c,namespace:namespace-name,cluster:gke-cluster-name} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_accepted_metric_points/counter{service_name:otel-collector-ngp-monitoring,otel_scope_name:otelcol/prometheusreceiver,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,processor:memory_limiter,container_name:es-exporter,source_project_id:gke-cluster-project,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,service_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown desc = total_point_count:37 success_point_count:36 errors:{status:{code:9} point_count:1} "rejected_items": 35}
otelcol_processor_accepted_metric_points is an internal metric of collector and have 1 timeserie for each receiver. So there could be no duplicate in the metric itself. And we can see the label pod_name:pod-name-tfx4k which is unique name of the pod provided from Kubernetes.

And like described in the topic - we are just starting this 1 Pod and sometimes it works correctly all the time and sometimes it prints errors all the time.

We are just building our own image with go.opentelemetry.io/collector/cmd/builder for opentelemetry-collector-contrib and go.opentelemetry.io/collector in v0.95.0 version. We are using base image ubi9/ubi with go1.21.7.

Maybe there is a problem in our configuration but for me, it looks like wrong behavior of the collector (exporter?), especially looking at that the smaller timeout causes bigger probability of the problem on the Pod.

Do you have some tip how to look for the duplicated data point? The metrics in error logs are looking fine for themselves.

dashpole · 2024-02-29T15:51:15Z

Can you turn off sampling in the logging exporter, and export to both GMP and logging?

dashpole · 2024-02-29T15:52:01Z

I assume you are using the downward API to set the pod name, then?

rafal-dudek · 2024-03-01T13:21:33Z

I assume you are using the downward API to set the pod name, then?

Yes.

In the original post I simplified my collector configuration - actually we are using 2 receivers and pipelines - one for internal otel metrics and one for metrics from our application. I didn't find it relevant, but now it looks like it may be important.

In the original config we have interval for App metrics 60s and Otel metrics 30s - then we got duplicate errors once in a minute.
Now I tested different scenarios - App metrics 60s and Otel metrics 45s - we got errors once in 3 minutes.
And with App metrics 60s and Otel metrics 50s - we got errors once in 5 minutes.

So, analyzing those times, it looks like the problem appears when we scrape app and otel metrics at the same time. Maybe some pods didn't have the problem because there were some random offset between app and otel metric scrapes?
I checked logging exporter from the pod without errors - there were 14 second offset between timestamps of app and otel metrics.
And on the pod with errors - there were only 1 second offset between them.
It could be related to batching it together but I don't get it why there are errors with duplicates - there couldn't be any duplicates between app and otel metrics because we are adding otel_app and otel_internal prefixes to them respectively.

With logging exporter there is a big number of logs but I am adding some segment of it here:
logs.txt

And here is the config for it:

Config

receivers:
  zipkin/app-traces:
  jaeger/app-traces:
    protocols:
      thrift_http:
  prometheus/app-metrics:
    config:
      scrape_configs:
        - job_name: 'app-collector'
          scrape_interval: 60s
          scrape_timeout: 10s
          static_configs:
            - targets: [127.0.0.1:9114]
          metrics_path: /metrics
  prometheus/otel-metrics:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 45s
          static_configs:
            - targets: ['127.0.0.1:8888']
          metrics_path: /metrics

processors:
  attributes/traces:
    actions:
      - key: source_project_id
        value: cluster-project
        action: upsert
      - key: namespace_name
        value: namespace-name
        action: upsert
      - key: pod_name
        value: pod-name-jv9n8
        action: upsert
      - key: container_name
        value: es-exporter
        action: upsert
      - key: location
        value: us-central1-a
        action: upsert
  resource/metrics:
    attributes:
      - key: k8s.namespace.name
        value: namespace-name
        action: upsert
      - key: k8s.pod.name
        value: pod-name-jv9n8
        action: upsert
      - key: k8s.container.name
        value: otel-collector
        action: upsert
      - key: cloud.availability_zone
        value: us-central1-a
        action: upsert
      - key: service.name
        action: delete
      - key: service.version
        action: delete
      - key: service.instance.id
        action: delete
  metricstransform/gmp_app:
    transforms:
    - include: ^(.*)$$
      match_type: regexp
      action: update
      new_name: otel_app_$${1}
    - include: \.*
      match_type: regexp
      action: update
      operations:
        - action: add_label
          new_label: source_project_id
          new_value: cluster-project
        - action: add_label
          new_label: pod_name
          new_value: pod-name-jv9n8
        - action: add_label
          new_label: container_name
          new_value: es-exporter
    - include: ^(.+)_(seconds|bytes)_(.+)$$
      match_type: regexp
      action: update
      new_name: $${1}_$${3}
    - include: ^(.+)_(bytes|total|seconds)$$
      match_type: regexp
      action: update
      new_name: $${1}
  metricstransform/gmp_otel:
    transforms:
    - include: ^(.*)$$
      match_type: regexp
      action: update
      new_name: otel_internal_$${1}
    - include: \.*
      match_type: regexp
      action: update
      operations:
        - action: add_label
          new_label: source_project_id
          new_value: cluster-project
        - action: add_label
          new_label: pod_name
          new_value: pod-name-jv9n8
        - action: add_label
          new_label: container_name
          new_value: es-exporter
    - include: ^(.+)_(seconds|bytes)_(.+)$$
      match_type: regexp
      action: update
      new_name: $${1}_$${3}
    - include: ^(.+)_(bytes|total|seconds)$$
      match_type: regexp
      action: update
      new_name: $${1}
  resourcedetection/metrics:
    detectors: [env, gcp]
    timeout: 2s
    override: false
  resourcedetection/traces:
    detectors: [env, gcp]
    timeout: 2s
    override: false
  batch/metrics:
    send_batch_size: 200
    timeout: 5s
    send_batch_max_size: 200
  batch/traces:
    send_batch_size: 10000
    timeout: 5s
    send_batch_max_size: 10000
  memory_limiter:
    limit_mib: 297
    spike_limit_mib: 52
    check_interval: 1s
exporters:
  googlecloud/app-traces:
    project: google-monitoring-project
    timeout: 10s
  googlemanagedprometheus/app-metrics:
    project: google-monitoring-project
    timeout: 15s
    sending_queue:
      enabled: false
      num_consumers: 10
      queue_size: 5000
    metric:
      prefix: prometheus.googleapis.com
      add_metric_suffixes: False
  googlemanagedprometheus/otel-metrics:
    project: google-monitoring-project
    timeout: 15s
    sending_queue:
      enabled: false
      num_consumers: 10
      queue_size: 5000
    metric:
      prefix: prometheus.googleapis.com
      add_metric_suffixes: False
  logging:
    loglevel: debug
    sampling_initial: 10000
    sampling_thereafter: 10000
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
  pprof:
    endpoint: 0.0.0.0:1777
service:
  telemetry:
    logs:
      level: "info"
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [zipkin/app-traces, jaeger/app-traces]
      processors: [memory_limiter, batch/traces, resourcedetection/traces, attributes/traces]
      exporters: [googlecloud/app-traces, logging]
    metrics/otel:
      receivers: [prometheus/otel-metrics]
      processors: [batch/metrics, resourcedetection/metrics, metricstransform/gmp_otel, resource/metrics]
      exporters: [googlemanagedprometheus/otel-metrics, logging]
    metrics/app:
      receivers: [prometheus/app-metrics]
      processors: [memory_limiter, batch/metrics, resourcedetection/metrics, metricstransform/gmp_app, resource/metrics]
      exporters: [googlemanagedprometheus/app-metrics, logging]

I guess some "scrape_offset" parameter for "prometheusreceiver" would be a workaround for this problem but I don't see anything like that.

dashpole · 2024-03-01T21:03:02Z

Usually the error message gives you a particular metric + labels that failed. If you can find what the logging exporter prints just before the export for that metric, that might point to how it ended up duplicated. If that doesn't give you enough info, you can paste it here, and I might be able to figure out why it resulted in the error.

If that still doesn't work, and you are able to get the OTLP in json using the json exporter, I can actually replay it using our testing framework and figure out why it isn't working.

rafal-dudek · 2024-03-04T07:45:29Z

I don't see anything unusual there.
In logs.txt from previous comment, there is a log

Exporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures. {""kind"": ""exporter"", ""data_type"": ""metrics"", ""name"": ""googlemanagedprometheus/otel-metrics"", ""error"": ""rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,namespace:namespace-name,job:,cluster:cluster-name,location:us-central1-a} timeSeries[0-38]: prometheus.googleapis.com/otel_internal_otelcol_process_runtime_total_sys_memory/gauge{source_project_id:cluster-project,service_name:otel-collector-ngp-monitoring,pod_name:pod-name-jv9n8,otel_scope_name:otelcol/prometheusreceiver,otel_scope_version:0.95.0-rc-9-g9b2e7ee-20240301-074459,service_instance_id:instance-id--353de3ab330f,service_version:0.95.0-rc-9-g9b2e7ee-20240301-074459,container_name:es-exporter}\nerror details: name = Unknown desc = total_point_count:39 success_point_count:38 errors:{status:{code:9} point_count:1}"", ""rejected_items"": 37}"

Just before that there is different metric described "otel_internal_scrape_series_added". And Earlier, there is "otel_internal_otelcol_process_runtime_total_sys_memory" where we can see that there is only 1 data point.

And like you can see in my description in the thread - each minute there is different metric in error log.

How can I use Json Exporter? I don't see it in the repo.

dashpole · 2024-03-04T15:15:23Z

Ah, sorry. Its called the file exporter: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/fileexporter

I'll try and reproduce it with your config above.

rafal-dudek · 2024-03-04T16:01:52Z

I'm sending 10MB of metrics from the pod with problems:
otel-metrics.json

dashpole · 2024-03-05T22:02:35Z

I've run the first 22 batches of metrics through the replay mechanism, grouped by the timestamps, which covers the first 90 seconds. I haven't been able to produce any errors. GoogleCloudPlatform/opentelemetry-operations-go#809

It looks like the errors occur every minute, so I should have found one by now. Do you have any of the error logs from during that run?

rafal-dudek · 2024-03-06T08:02:19Z

Here are the logs from this exact Pod since startup. At the beginning there is its configuration.
otel-logs.txt

I've run the first 22 batches of metrics through the replay mechanism, grouped by the timestamps, which covers the first 90 seconds.

I see that the first timestamp in metrics is 15:43:36, next are 15:44:04 (app metrics) and 15:44:06 (otel internal metrics) and the first error log 15:44:08, so I think so.

We are running collector on Pod in kubernetes with CPU 100m-300m, so it is running only on 1 core. Not sure if it may influence the behavior of 2 pipelines running at the same time, as the difference is 2 seconds and the batching processor is waiting 5 seconds.

github-actions · 2024-05-06T03:30:08Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/googlemanagedprometheus: @aabmass @dashpole @jsuereth @punya @damemi @psx95

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-07-05T05:19:58Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

rafal-dudek added bug Something isn't working needs triage New item requiring triage labels Feb 29, 2024

github-actions bot added the exporter/googlemanagedprometheus Google Managed Prometheus exporter label Feb 29, 2024

dashpole self-assigned this Feb 29, 2024

dashpole removed the needs triage New item requiring triage label Feb 29, 2024

This was referenced Mar 5, 2024

Weekly Report: 2024-02-27 - 2024-03-05 #31560

Closed

Weekly Report: 2024-02-27 - 2024-03-05 asuresh4/opentelemetry-collector-contrib#11543

Open

dashpole mentioned this issue Mar 5, 2024

WIP TESTING GoogleCloudPlatform/opentelemetry-operations-go#809

Closed

github-actions bot added the Stale label May 6, 2024

github-actions bot added the closed as inactive label Jul 5, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Googlemanagedprometheus exporter randomly falls into an infinite error state #31507

Googlemanagedprometheus exporter randomly falls into an infinite error state #31507

rafal-dudek commented Feb 29, 2024 •

edited

Loading

github-actions bot commented Feb 29, 2024

dashpole commented Feb 29, 2024

rafal-dudek commented Feb 29, 2024

dashpole commented Feb 29, 2024

dashpole commented Feb 29, 2024

rafal-dudek commented Mar 1, 2024

dashpole commented Mar 1, 2024

rafal-dudek commented Mar 4, 2024

dashpole commented Mar 4, 2024

rafal-dudek commented Mar 4, 2024

dashpole commented Mar 5, 2024 •

edited

Loading

rafal-dudek commented Mar 6, 2024

github-actions bot commented May 6, 2024

github-actions bot commented Jul 5, 2024

Googlemanagedprometheus exporter randomly falls into an infinite error state #31507

Googlemanagedprometheus exporter randomly falls into an infinite error state #31507

Comments

rafal-dudek commented Feb 29, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Feb 29, 2024

dashpole commented Feb 29, 2024

rafal-dudek commented Feb 29, 2024

dashpole commented Feb 29, 2024

dashpole commented Feb 29, 2024

rafal-dudek commented Mar 1, 2024

dashpole commented Mar 1, 2024

rafal-dudek commented Mar 4, 2024

dashpole commented Mar 4, 2024

rafal-dudek commented Mar 4, 2024

dashpole commented Mar 5, 2024 • edited Loading

rafal-dudek commented Mar 6, 2024

github-actions bot commented May 6, 2024

github-actions bot commented Jul 5, 2024

rafal-dudek commented Feb 29, 2024 •

edited

Loading

dashpole commented Mar 5, 2024 •

edited

Loading