Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Googlemanagedprometheus exporter randomly falls into an infinite error state #31507

Closed
rafal-dudek opened this issue Feb 29, 2024 · 14 comments
Closed
Assignees
Labels
bug Something isn't working closed as inactive exporter/googlemanagedprometheus Google Managed Prometheus exporter Stale

Comments

@rafal-dudek
Copy link

rafal-dudek commented Feb 29, 2024

Component(s)

exporter/googlemanagedprometheus

What happened?

Description

Sometimes, when pod in GKE with OpenTelemetry Collector starts up, it reports errors "One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric." in every consequent minute (scrape interval is 30s). After restarting pod the problem disappears. After some more restarts, problem happens again.
Looks like all the metrics are send properly to Google Monitoring but every minute additional duplicated data points are added to the batch, which causes the errors.

Steps to Reproduce

Create pod in Google Kubernetes Engine with OpenTelemetry Collector with similar config to ours. If problem does not occur, delete the pod and recreate it. Repeat until you see consistent error logs.

Expected Result

If there is any problem with saving data point to Google Monitoring which causes sending duplicated data point next minute, it should not repeat infinitely each minute.

Actual Result

Error with duplicated data point causes infinite errors state of OpenTelemetry exporter, fixed only when pod is deleted.

Collector version

v0.95.0

Environment information

Environment

Google Kubernetes Engine
Base image: ubi9/ubi
Compiler(if manually compiled): go 1.21.7

OpenTelemetry Collector configuration

receivers:  
  prometheus/otel-metrics:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 30s
          static_configs:
            - targets: ['127.0.0.1:8888']
          metrics_path: /metrics

processors:
  resource/metrics:
    attributes:
      - key: k8s.namespace.name
        value: namespace-name
        action: upsert
      - key: k8s.pod.name
        value: pod-name-tfx4k # The name of the POD - unique name after each recreation
        action: upsert
      - key: k8s.container.name
        value: otel-collector
        action: upsert
      - key: cloud.availability_zone
        value: us-central1-c
        action: upsert
      - key: service.name
        action: delete
      - key: service.version
        action: delete
      - key: service.instance.id
        action: delete
  metricstransform/gmp_otel:
    transforms:
    - include: ^(.*)$$
      match_type: regexp
      action: update
      new_name: otel_internal_$${1}
    - include: \.*
      match_type: regexp
      action: update
      operations:
        - action: add_label
          new_label: source_project_id
          new_value: gke-cluster-project
        - action: add_label
          new_label: pod_name
          new_value: pod-name-tfx4k # The name of the POD - unique name after each recreation
        - action: add_label
          new_label: container_name
          new_value: es-exporter
    - include: ^(.+)_(seconds|bytes)_(.+)$$
      match_type: regexp
      action: update
      new_name: $${1}_$${3}
    - include: ^(.+)_(bytes|total|seconds)$$
      match_type: regexp
      action: update
      new_name: $${1}
  resourcedetection/metrics:
    detectors: [env, gcp]
    timeout: 2s
    override: false
  batch/metrics:
    send_batch_size: 200
    timeout: 5s
    send_batch_max_size: 200
  memory_limiter:
    limit_mib: 297
    spike_limit_mib: 52
    check_interval: 1s
exporters:
  googlemanagedprometheus/otel-metrics:
    project: project-for-metrics
    timeout: 15s
    sending_queue:
      enabled: false
      num_consumers: 10
      queue_size: 5000
    metric:
      prefix: prometheus.googleapis.com
      add_metric_suffixes: False
  logging:
    loglevel: debug
    sampling_initial: 1
    sampling_thereafter: 500
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
  pprof:
    endpoint: 0.0.0.0:1777
service:
  telemetry:
    logs:
      level: "info"
  extensions: [health_check]
  pipelines:
    metrics/otel:
      receivers: [prometheus/otel-metrics]
      processors: [batch/metrics, resourcedetection/metrics, metricstransform/gmp_otel, resource/metrics]
      exporters: [googlemanagedprometheus/otel-metrics]

Log output

First 11 error logs:
2024-02-29T08:22:00.043Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{location:us-central1-c,job:,instance:,namespace:namespace-name,cluster:gke-cluster-name} timeSeries[0-12]: prometheus.googleapis.com/otel_internal_scrape_series_added/gauge{pod_name:pod-name-tfx4k,source_project_id:gke-cluster-project,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,container_name:es-exporter,otel_scope_name:otelcol/prometheusreceiver}\\nerror details: name = Unknown  desc = total_point_count:13  success_point_count:12  errors:{status:{code:9}  point_count:1} "rejected_items": 28}

2024-02-29T08:23:00.169Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{job:,cluster:gke-cluster-name,location:us-central1-c,instance:,namespace:namespace-name} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_batch_batch_size_trigger_send/counter{service_name:otel-collector-ngp-monitoring,otel_scope_name:otelcol/prometheusreceiver,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,processor:batch/metrics}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:24:00.330Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{namespace:namespace-name,job:,cluster:gke-cluster-name,location:us-central1-c,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_scrape_samples_scraped/gauge{otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,container_name:es-exporter,otel_scope_name:otelcol/prometheusreceiver,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:25:00.450Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{namespace:namespace-name,cluster:gke-cluster-name,location:us-central1-c,job:,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_receiver_accepted_metric_points/counter{source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,transport:http,service_instance_id:instance-id-c4ef0b0aaf35,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,container_name:es-exporter,otel_scope_name:otelcol/prometheusreceiver,service_name:otel-collector-ngp-monitoring,receiver:prometheus/app-metrics,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:26:00.599Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,cluster:gke-cluster-name,namespace:namespace-name,job:,location:us-central1-c} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_exporter_send_failed_metric_points/counter{source_project_id:gke-cluster-project,container_name:es-exporter,exporter:googlemanagedprometheus/app-metrics,service_name:otel-collector-ngp-monitoring,service_instance_id:instance-id-c4ef0b0aaf35,otel_scope_name:otelcol/prometheusreceiver,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,pod_name:pod-name-tfx4k,service_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:27:00.737Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{location:us-central1-c,job:,cluster:gke-cluster-name,namespace:namespace-name,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_grpc_io_client_completed_rpcs/counter{service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,grpc_client_status:INVALID_ARGUMENT,source_project_id:gke-cluster-project,service_name:otel-collector-ngp-monitoring,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,otel_scope_name:otelcol/prometheusreceiver,pod_name:pod-name-tfx4k,grpc_client_method:google.monitoring.v3.MetricService/CreateTimeSeries}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:28:00.895Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,namespace:namespace-name,cluster:gke-cluster-name,location:us-central1-c,job:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_grpc_io_client_completed_rpcs/counter{service_version:0.95.0-rc-9-g87a8be8-20240228-145123,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,grpc_client_method:google.monitoring.v3.MetricService/CreateTimeSeries,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,otel_scope_name:otelcol/prometheusreceiver,grpc_client_status:INVALID_ARGUMENT,service_name:otel-collector-ngp-monitoring,service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:29:01.042Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{job:,location:us-central1-c,cluster:gke-cluster-name,namespace:namespace-name,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_exporter_queue_capacity/gauge{service_name:otel-collector-ngp-monitoring,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,otel_scope_name:otelcol/prometheusreceiver,service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,exporter:googlecloud/app-traces}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:30:01.166Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,job:,location:us-central1-c,namespace:namespace-name,cluster:gke-cluster-name} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_accepted_metric_points/counter{service_name:otel-collector-ngp-monitoring,otel_scope_name:otelcol/prometheusreceiver,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,processor:memory_limiter,container_name:es-exporter,source_project_id:gke-cluster-project,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,service_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:30:56.284Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{job:,namespace:namespace-name,cluster:gke-cluster-name,location:us-central1-c,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_exporter_sent_metric_points/counter{service_version:0.95.0-rc-9-g87a8be8-20240228-145123,exporter:googlemanagedprometheus/app-metrics,source_project_id:gke-cluster-project,otel_scope_name:otelcol/prometheusreceiver,container_name:es-exporter,service_name:otel-collector-ngp-monitoring,service_instance_id:instance-id-c4ef0b0aaf35,pod_name:pod-name-tfx4k,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:31:56.439Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{cluster:gke-cluster-name,instance:,job:,namespace:namespace-name,location:us-central1-c} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_batch_metadata_cardinality/gauge{service_version:0.95.0-rc-9-g87a8be8-20240228-145123,source_project_id:gke-cluster-project,service_name:otel-collector-ngp-monitoring,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,otel_scope_name:otelcol/prometheusreceiver,container_name:es-exporter}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

Additional context

I made some additional tests and it looks like googlemanagedprometheus timeout could be related to the problem.
With timeout 10s I got 5 pods with errors on 12 pods started.
With timeout 15s I got 1 pod with errors on 20 pods started.
So, maybe there is a problem with exporting timeout but this behavior with infinite errors does not look correct.

Histogram for 10s timeout:
image

Histogram for 15s timeout:
image
almost 2 hours of errors later (the same pod):
image

Blue rectangle mean new Pod started. Red rectangle mean the error described in this issue.
All pods are exactly the same, just with a different name with random suffix.

@rafal-dudek rafal-dudek added bug Something isn't working needs triage New item requiring triage labels Feb 29, 2024
@github-actions github-actions bot added the exporter/googlemanagedprometheus Google Managed Prometheus exporter label Feb 29, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole
Copy link
Contributor

That error usually means you either have multiple collectors trying to write the same set of metrics, or that the metrics being sent contain duplicates. Are you able to reproduce this with a single collector?

I see service_version:0.95.0-rc-9-g87a8be8-20240228-145123. Is this a fork of the collector, or built off of a different commit? If you can reproduce with a released version, that would be helpful.

@dashpole dashpole self-assigned this Feb 29, 2024
@dashpole dashpole removed the needs triage New item requiring triage label Feb 29, 2024
@rafal-dudek
Copy link
Author

I have only 1 instance of this collector at the same time. We are also using labels with a Namespace and a Pod name (Deployment) which is de facto unique.
And I believe we cannot have duplicates as we are using (in this example) otel-collector internal metrics.

E.g. let's take this error:
Exporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,job:,location:us-central1-c,namespace:namespace-name,cluster:gke-cluster-name} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_accepted_metric_points/counter{service_name:otel-collector-ngp-monitoring,otel_scope_name:otelcol/prometheusreceiver,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,processor:memory_limiter,container_name:es-exporter,source_project_id:gke-cluster-project,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,service_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown desc = total_point_count:37 success_point_count:36 errors:{status:{code:9} point_count:1} "rejected_items": 35}
otelcol_processor_accepted_metric_points is an internal metric of collector and have 1 timeserie for each receiver. So there could be no duplicate in the metric itself. And we can see the label pod_name:pod-name-tfx4k which is unique name of the pod provided from Kubernetes.

And like described in the topic - we are just starting this 1 Pod and sometimes it works correctly all the time and sometimes it prints errors all the time.

We are just building our own image with go.opentelemetry.io/collector/cmd/builder for opentelemetry-collector-contrib and go.opentelemetry.io/collector in v0.95.0 version. We are using base image ubi9/ubi with go1.21.7.

Maybe there is a problem in our configuration but for me, it looks like wrong behavior of the collector (exporter?), especially looking at that the smaller timeout causes bigger probability of the problem on the Pod.

Do you have some tip how to look for the duplicated data point? The metrics in error logs are looking fine for themselves.

@dashpole
Copy link
Contributor

Can you turn off sampling in the logging exporter, and export to both GMP and logging?

@dashpole
Copy link
Contributor

I assume you are using the downward API to set the pod name, then?

@rafal-dudek
Copy link
Author

I assume you are using the downward API to set the pod name, then?

Yes.

In the original post I simplified my collector configuration - actually we are using 2 receivers and pipelines - one for internal otel metrics and one for metrics from our application. I didn't find it relevant, but now it looks like it may be important.

In the original config we have interval for App metrics 60s and Otel metrics 30s - then we got duplicate errors once in a minute.
Now I tested different scenarios - App metrics 60s and Otel metrics 45s - we got errors once in 3 minutes.
And with App metrics 60s and Otel metrics 50s - we got errors once in 5 minutes.

So, analyzing those times, it looks like the problem appears when we scrape app and otel metrics at the same time. Maybe some pods didn't have the problem because there were some random offset between app and otel metric scrapes?
I checked logging exporter from the pod without errors - there were 14 second offset between timestamps of app and otel metrics.
And on the pod with errors - there were only 1 second offset between them.
It could be related to batching it together but I don't get it why there are errors with duplicates - there couldn't be any duplicates between app and otel metrics because we are adding otel_app and otel_internal prefixes to them respectively.

With logging exporter there is a big number of logs but I am adding some segment of it here:
logs.txt

And here is the config for it:

Config
receivers:
  zipkin/app-traces:
  jaeger/app-traces:
    protocols:
      thrift_http:
  prometheus/app-metrics:
    config:
      scrape_configs:
        - job_name: 'app-collector'
          scrape_interval: 60s
          scrape_timeout: 10s
          static_configs:
            - targets: [127.0.0.1:9114]
          metrics_path: /metrics
  prometheus/otel-metrics:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 45s
          static_configs:
            - targets: ['127.0.0.1:8888']
          metrics_path: /metrics

processors:
  attributes/traces:
    actions:
      - key: source_project_id
        value: cluster-project
        action: upsert
      - key: namespace_name
        value: namespace-name
        action: upsert
      - key: pod_name
        value: pod-name-jv9n8
        action: upsert
      - key: container_name
        value: es-exporter
        action: upsert
      - key: location
        value: us-central1-a
        action: upsert
  resource/metrics:
    attributes:
      - key: k8s.namespace.name
        value: namespace-name
        action: upsert
      - key: k8s.pod.name
        value: pod-name-jv9n8
        action: upsert
      - key: k8s.container.name
        value: otel-collector
        action: upsert
      - key: cloud.availability_zone
        value: us-central1-a
        action: upsert
      - key: service.name
        action: delete
      - key: service.version
        action: delete
      - key: service.instance.id
        action: delete
  metricstransform/gmp_app:
    transforms:
    - include: ^(.*)$$
      match_type: regexp
      action: update
      new_name: otel_app_$${1}
    - include: \.*
      match_type: regexp
      action: update
      operations:
        - action: add_label
          new_label: source_project_id
          new_value: cluster-project
        - action: add_label
          new_label: pod_name
          new_value: pod-name-jv9n8
        - action: add_label
          new_label: container_name
          new_value: es-exporter
    - include: ^(.+)_(seconds|bytes)_(.+)$$
      match_type: regexp
      action: update
      new_name: $${1}_$${3}
    - include: ^(.+)_(bytes|total|seconds)$$
      match_type: regexp
      action: update
      new_name: $${1}
  metricstransform/gmp_otel:
    transforms:
    - include: ^(.*)$$
      match_type: regexp
      action: update
      new_name: otel_internal_$${1}
    - include: \.*
      match_type: regexp
      action: update
      operations:
        - action: add_label
          new_label: source_project_id
          new_value: cluster-project
        - action: add_label
          new_label: pod_name
          new_value: pod-name-jv9n8
        - action: add_label
          new_label: container_name
          new_value: es-exporter
    - include: ^(.+)_(seconds|bytes)_(.+)$$
      match_type: regexp
      action: update
      new_name: $${1}_$${3}
    - include: ^(.+)_(bytes|total|seconds)$$
      match_type: regexp
      action: update
      new_name: $${1}
  resourcedetection/metrics:
    detectors: [env, gcp]
    timeout: 2s
    override: false
  resourcedetection/traces:
    detectors: [env, gcp]
    timeout: 2s
    override: false
  batch/metrics:
    send_batch_size: 200
    timeout: 5s
    send_batch_max_size: 200
  batch/traces:
    send_batch_size: 10000
    timeout: 5s
    send_batch_max_size: 10000
  memory_limiter:
    limit_mib: 297
    spike_limit_mib: 52
    check_interval: 1s
exporters:
  googlecloud/app-traces:
    project: google-monitoring-project
    timeout: 10s
  googlemanagedprometheus/app-metrics:
    project: google-monitoring-project
    timeout: 15s
    sending_queue:
      enabled: false
      num_consumers: 10
      queue_size: 5000
    metric:
      prefix: prometheus.googleapis.com
      add_metric_suffixes: False
  googlemanagedprometheus/otel-metrics:
    project: google-monitoring-project
    timeout: 15s
    sending_queue:
      enabled: false
      num_consumers: 10
      queue_size: 5000
    metric:
      prefix: prometheus.googleapis.com
      add_metric_suffixes: False
  logging:
    loglevel: debug
    sampling_initial: 10000
    sampling_thereafter: 10000
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
  pprof:
    endpoint: 0.0.0.0:1777
service:
  telemetry:
    logs:
      level: "info"
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [zipkin/app-traces, jaeger/app-traces]
      processors: [memory_limiter, batch/traces, resourcedetection/traces, attributes/traces]
      exporters: [googlecloud/app-traces, logging]
    metrics/otel:
      receivers: [prometheus/otel-metrics]
      processors: [batch/metrics, resourcedetection/metrics, metricstransform/gmp_otel, resource/metrics]
      exporters: [googlemanagedprometheus/otel-metrics, logging]
    metrics/app:
      receivers: [prometheus/app-metrics]
      processors: [memory_limiter, batch/metrics, resourcedetection/metrics, metricstransform/gmp_app, resource/metrics]
      exporters: [googlemanagedprometheus/app-metrics, logging]

I guess some "scrape_offset" parameter for "prometheusreceiver" would be a workaround for this problem but I don't see anything like that.

@dashpole
Copy link
Contributor

dashpole commented Mar 1, 2024

Usually the error message gives you a particular metric + labels that failed. If you can find what the logging exporter prints just before the export for that metric, that might point to how it ended up duplicated. If that doesn't give you enough info, you can paste it here, and I might be able to figure out why it resulted in the error.

If that still doesn't work, and you are able to get the OTLP in json using the json exporter, I can actually replay it using our testing framework and figure out why it isn't working.

@rafal-dudek
Copy link
Author

I don't see anything unusual there.
In logs.txt from previous comment, there is a log

Exporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures. {""kind"": ""exporter"", ""data_type"": ""metrics"", ""name"": ""googlemanagedprometheus/otel-metrics"", ""error"": ""rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,namespace:namespace-name,job:,cluster:cluster-name,location:us-central1-a} timeSeries[0-38]: prometheus.googleapis.com/otel_internal_otelcol_process_runtime_total_sys_memory/gauge{source_project_id:cluster-project,service_name:otel-collector-ngp-monitoring,pod_name:pod-name-jv9n8,otel_scope_name:otelcol/prometheusreceiver,otel_scope_version:0.95.0-rc-9-g9b2e7ee-20240301-074459,service_instance_id:instance-id--353de3ab330f,service_version:0.95.0-rc-9-g9b2e7ee-20240301-074459,container_name:es-exporter}\nerror details: name = Unknown desc = total_point_count:39 success_point_count:38 errors:{status:{code:9} point_count:1}"", ""rejected_items"": 37}"

Just before that there is different metric described "otel_internal_scrape_series_added". And Earlier, there is "otel_internal_otelcol_process_runtime_total_sys_memory" where we can see that there is only 1 data point.

And like you can see in my description in the thread - each minute there is different metric in error log.

How can I use Json Exporter? I don't see it in the repo.

@dashpole
Copy link
Contributor

dashpole commented Mar 4, 2024

Ah, sorry. Its called the file exporter: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/fileexporter

I'll try and reproduce it with your config above.

@rafal-dudek
Copy link
Author

I'm sending 10MB of metrics from the pod with problems:
otel-metrics.json

@dashpole
Copy link
Contributor

dashpole commented Mar 5, 2024

I've run the first 22 batches of metrics through the replay mechanism, grouped by the timestamps, which covers the first 90 seconds. I haven't been able to produce any errors. GoogleCloudPlatform/opentelemetry-operations-go#809

It looks like the errors occur every minute, so I should have found one by now. Do you have any of the error logs from during that run?

@rafal-dudek
Copy link
Author

Here are the logs from this exact Pod since startup. At the beginning there is its configuration.
otel-logs.txt

I've run the first 22 batches of metrics through the replay mechanism, grouped by the timestamps, which covers the first 90 seconds.

I see that the first timestamp in metrics is 15:43:36, next are 15:44:04 (app metrics) and 15:44:06 (otel internal metrics) and the first error log 15:44:08, so I think so.

We are running collector on Pod in kubernetes with CPU 100m-300m, so it is running only on 1 core. Not sure if it may influence the behavior of 2 pipelines running at the same time, as the difference is 2 seconds and the batching processor is waiting 5 seconds.

Copy link
Contributor

github-actions bot commented May 6, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 6, 2024
Copy link
Contributor

github-actions bot commented Jul 5, 2024

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working closed as inactive exporter/googlemanagedprometheus Google Managed Prometheus exporter Stale
Projects
None yet
Development

No branches or pull requests

2 participants