Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve transient errors in googlecloud trace exporter batch write spans. #34957

Open
AkselAllas opened this issue Sep 2, 2024 · 10 comments
Open
Assignees
Labels

Comments

@AkselAllas
Copy link

Component(s)

No response

Describe the issue you're reporting

I am experiencing transient otel-collector failures for exporting Trace batches. e.g.:
image

I have:

    traces/2:
      receivers: [ otlp ]
      processors: [ tail_sampling, batch ]
      exporters: [ googlecloud ]

I have tried increasing timeout to 45 sec, as described here. And I have tried decreasing batch size from 200 to 100 as suggested here. Neither approach has given any statistically relevant observable improvements.

Stacktrace:

"caller":"exporterhelper/queue_sender.go:101", "data_type":"traces", "dropped_items":200, "error":"context deadline exceeded", "kind":"exporter", "level":"error", "msg":"Exporting failed. Dropping data.", "name":"googlecloud", "stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/queue_sender.go:101
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1

Any ideas on what to do?

Seemingly, I don't have problems with retry queue.
image

@Frapschen
Copy link
Contributor

Pinging code owners: @aabmass, @dashpole,@jsuereth,@punya, @damemi, @psx95

Copy link
Contributor

github-actions bot commented Sep 3, 2024

Pinging code owners for exporter/googlecloud: @aabmass @dashpole @jsuereth @punya @damemi @psx95. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole dashpole self-assigned this Sep 3, 2024
@dashpole dashpole added bug Something isn't working and removed needs triage New item requiring triage labels Sep 3, 2024
@dashpole
Copy link
Contributor

dashpole commented Sep 3, 2024

What version of the collector are you on?

@AkselAllas
Copy link
Author

Version 0.102

@dashpole
Copy link
Contributor

dashpole commented Sep 3, 2024

Can you share the googlecloud exporter configuration as well?

@AkselAllas
Copy link
Author

AkselAllas commented Sep 4, 2024

@dashpole

    traces:
      receivers: [ otlp ]
      processors: [ tail_sampling, batch ]
      exporters: [ googlecloud ]

and

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:8080"
  batch:
    send_batch_max_size: 200
    send_batch_size: 200
    timeout: 10s
  tail_sampling:
    policies:
      - name: drop_noisy_traces_url
        type: string_attribute
        string_attribute:
          key: http.user_agent
          values:
            - GoogleHC*
            - StackDriver*
            - kube-probe*
            - GoogleStackdriverMonitoring-UptimeChecks*
            - cloud-run-http-probe
          enabled_regex_matching: true
          invert_match: true
      - name: drop_noisy_traces_new_semantic_conv
        type: string_attribute
        string_attribute:
          key: user_agent.original
          values:
            - GoogleHC*
            - StackDriver*
            - GoogleStackdriverMonitoring-UptimeChecks*
            - cloud-run-http-probe
          enabled_regex_matching: true
          invert_match: true
  
  exporters:
    googlecloud:
    metric:
      prefix: 'custom.googleapis.com'

@BinaryFissionGames
Copy link
Contributor

Just adding another datapoint here, we have a user who's doing some outage testing with logs and are running into the same issue where the exporter seems to be dropping logs when there's a transient error (can't connect to GCP).

Logs are like this:

{"level":"warn","ts":"2024-09-06T12:02:16.765+0200","caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: \"logging.googleapis.com:443\", ServerName: \"logging.googleapis.com:443\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 142.250.74.138:443: i/o timeout\"","grpc_log":true}
{"level":"error","ts":"2024-09-06T12:03:07.799+0200","caller":"exporterhelper/queue_sender.go:101","msg":"Exporting failed. Dropping data.","kind":"exporter","data_type":"logs","name":"googlecloud/applogs_google","error":"context deadline exceeded","dropped_items":15,"stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.1/exporterhelper/queue_sender.go:101\ngo.opentelemetry.io/collector/exporter/internal/queue.(*persistentQueue[...]).Consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.1/internal/queue/persistent_queue.go:215\ngo.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.1/internal/queue/consumers.go:43"}

Where that first log is repeated many times (google api servers are intentionally blocked for testing in this case)

They are using v0.102.1 of the collector. This case is a little more artificial and we'll try adjusting the timeout settings, but figured I'd add what we're seeing here.

@dbason
Copy link

dbason commented Oct 11, 2024

We're also seeing this sporadically (we admittedly do have some very chatty trace exporters) on version 0.105.

Config is:

    # Copyright 2024 Google LLC
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.

    exporters:
      googlecloud:
        user_agent: Google-Cloud-OTLP manifests:0.1.0 otel/opentelemetry-collector-contrib:{{ .Values.opentelemetryCollector.version }}
      googlemanagedprometheus:
        user_agent: Google-Cloud-OTLP manifests:0.1.0 otel/opentelemetry-collector-contrib:{{ .Values.opentelemetryCollector.version }}

    extensions:
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    processors:
      filter/self-metrics:
        metrics:
          include:
            match_type: strict
            metric_names:
            - otelcol_process_uptime
            - otelcol_process_memory_rss
            - otelcol_grpc_io_client_completed_rpcs
            - otelcol_googlecloudmonitoring_point_count
      batch:
        send_batch_max_size: 200
        send_batch_size: 200
        timeout: 5s

      k8sattributes:
        extract:
          metadata:
          - k8s.namespace.name
          - k8s.deployment.name
          - k8s.statefulset.name
          - k8s.daemonset.name
          - k8s.cronjob.name
          - k8s.job.name
          - k8s.node.name
          - k8s.pod.name
          - k8s.pod.uid
          - k8s.pod.start_time
        passthrough: false
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.ip
        - sources:
          - from: resource_attribute
            name: k8s.pod.uid
        - sources:
          - from: connection
      memory_limiter:
        check_interval: 1s
        limit_percentage: 65
        spike_limit_percentage: 20

      metricstransform/self-metrics:
        transforms:
        - action: update
          include: otelcol_process_uptime
          operations:
          - action: add_label
            new_label: version
            new_value: Google-Cloud-OTLP manifests:0.1.0 otel/opentelemetry-collector-contrib:{{ .Values.opentelemetryCollector.version }}

      # We need to add the pod IP as a resource label so the k8s attributes processor can find it.
      resource/self-metrics:
        attributes:
        - action: insert
          key: k8s.pod.ip
          value: ${env:MY_POD_IP}

      resourcedetection:
        detectors: [gcp]
        timeout: 10s

      transform/collision:
        metric_statements:
        - context: datapoint
          statements:
          - set(attributes["exported_location"], attributes["location"])
          - delete_key(attributes, "location")
          - set(attributes["exported_cluster"], attributes["cluster"])
          - delete_key(attributes, "cluster")
          - set(attributes["exported_namespace"], attributes["namespace"])
          - delete_key(attributes, "namespace")
          - set(attributes["exported_job"], attributes["job"])
          - delete_key(attributes, "job")
          - set(attributes["exported_instance"], attributes["instance"])
          - delete_key(attributes, "instance")
          - set(attributes["exported_project_id"], attributes["project_id"])
          - delete_key(attributes, "project_id")

    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:4317
          http:
            cors:
              allowed_origins:
              - http://*
              - https://*
            endpoint: ${env:MY_POD_IP}:4318
      prometheus/self-metrics:
        config:
          scrape_configs:
          - job_name: otel-self-metrics
            scrape_interval: 1m
            static_configs:
            - targets:
              - ${env:MY_POD_IP}:8888

    service:
      extensions:
      - health_check
      pipelines:
        metrics/self-metrics:
          exporters:
          - googlemanagedprometheus
          processors:
          - filter/self-metrics
          - metricstransform/self-metrics
          - resource/self-metrics
          - k8sattributes
          - memory_limiter
          - resourcedetection
          - batch
          receivers:
          - prometheus/self-metrics
        traces:
          exporters:
          - googlecloud
          processors:
          - k8sattributes
          - memory_limiter
          - resourcedetection
          - batch
          receivers:
          - otlp
      telemetry:
        metrics:
          address: ${env:MY_POD_IP}:8888

@dashpole
Copy link
Contributor

Thanks. We disable retries in the collector because the google cloud client library has retries built-in

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants