Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTel Collector 0.104.0+ issues when using linkerd-proxy side car container #34565

Open
Tyrion85 opened this issue Aug 9, 2024 · 2 comments
Open
Labels
bug Something isn't working waiting for author

Comments

@Tyrion85
Copy link

Tyrion85 commented Aug 9, 2024

Component(s)

No response

What happened?

Description

When using Opentelemetry collector 0.104.0+ (up to 0.106.1), linkerd-proxy logs an enormous amount of "HTTP service in fail-fast" logs, and has a high cpu usage (100x normal cpu usage).

Screenshot 2024-08-09 at 15 02 47 Screenshot 2024-08-09 at 15 15 13

This issue might as well be posted in linkerd community - but linkerd is a generic proxy, and Opentelemetry collector 0.103.1 doesn't result in these issues.

Relevant collector configuration:

apiVersion: v1
data:
  relay: |
    connectors:
      spanmetrics:
        aggregation_temporality: AGGREGATION_TEMPORALITY_CUMULATIVE
        dimensions:
        - default: GET
          name: http.method
        - name: http.status_code
        dimensions_cache_size: 50
        events:
          dimensions:
          - name: exception.type
          - name: exception.message
          enabled: true
        exclude_dimensions:
        - status.code
        exemplars:
          enabled: true
        histogram:
          explicit:
            buckets:
            - 10ms
            - 100ms
            - 250ms
            - 500ms
            - 750ms
            - 1s
            - 1500ms
            - 2s
            - 5s
        metrics_expiration: 0
        metrics_flush_interval: 15s
        resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name
    exporters:
      debug: {}
      otlp/quickwit:
        endpoint: ...
        tls:
          insecure: true
      prometheusremotewrite:
        endpoint: http://prometheus-operated.monitoring:9090/api/v1/write
        target_info:
          enabled: true
    extensions:
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    processors:
      batch: {}
      memory_limiter:
        check_interval: 5s
        limit_percentage: 80
        spike_limit_percentage: 25
      span/to_attributes:
        name:
          to_attributes:
            rules:
            .....
    receivers:
      jaeger:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:14250
          thrift_compact:
            endpoint: ${env:MY_POD_IP}:6831
          thrift_http:
            endpoint: ${env:MY_POD_IP}:14268
      opencensus: null
      otlp:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:4317
          http:
            cors:
              allowed_origins:
              - ....
            endpoint: ${env:MY_POD_IP}:4318
      prometheus:
        config:
          scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${env:MY_POD_IP}:8888
      zipkin:
        endpoint: ${env:MY_POD_IP}:9411
    service:
      extensions:
      - health_check
      pipelines:
        logs:
          exporters:
          - debug
          processors:
          - memory_limiter
          - batch
          receivers:
          - otlp
        metrics:
          exporters:
          - prometheusremotewrite
          processors:
          - memory_limiter
          - batch
          receivers:
          - spanmetrics
        traces:
          exporters:
          - otlp/quickwit
          - spanmetrics
          processors:
          - batch
          - span/to_attributes
          receivers:
          - otlp
          - opencensus
          - zipkin
          - jaeger
      telemetry:
        metrics:
          address: ${env:MY_POD_IP}:8888
kind: ConfigMap
....

Steps to Reproduce

Opentelemetry collector 0.104.0 or 0.106.1
Linkerd 2.12.2 (but I suspect linkerd is just highlighting some other issue, that's how it usually goes with this service mesh)
Storage I don't think matters

Expected Result

It looks like some sort of regression, as 0.103.1 works fine.

Actual Result

Collector version

0.104.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

relay: |
    connectors:
      spanmetrics:
        aggregation_temporality: AGGREGATION_TEMPORALITY_CUMULATIVE
        dimensions:
        - default: GET
          name: http.method
        - name: http.status_code
        dimensions_cache_size: 50
        events:
          dimensions:
          - name: exception.type
          - name: exception.message
          enabled: true
        exclude_dimensions:
        - status.code
        exemplars:
          enabled: true
        histogram:
          explicit:
            buckets:
            - 10ms
            - 100ms
            - 250ms
            - 500ms
            - 750ms
            - 1s
            - 1500ms
            - 2s
            - 5s
        metrics_expiration: 0
        metrics_flush_interval: 15s
        resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name
    exporters:
      debug: {}
      otlp/quickwit:
        endpoint: ...
        tls:
          insecure: true
      prometheusremotewrite:
        endpoint: http://prometheus-operated.monitoring:9090/api/v1/write
        target_info:
          enabled: true
    extensions:
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    processors:
      batch: {}
      memory_limiter:
        check_interval: 5s
        limit_percentage: 80
        spike_limit_percentage: 25
      span/to_attributes:
        name:
          to_attributes:
            rules:
            .....
    receivers:
      jaeger:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:14250
          thrift_compact:
            endpoint: ${env:MY_POD_IP}:6831
          thrift_http:
            endpoint: ${env:MY_POD_IP}:14268
      opencensus: null
      otlp:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:4317
          http:
            cors:
              allowed_origins:
              - ....
            endpoint: ${env:MY_POD_IP}:4318
      prometheus:
        config:
          scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${env:MY_POD_IP}:8888
      zipkin:
        endpoint: ${env:MY_POD_IP}:9411
    service:
      extensions:
      - health_check
      pipelines:
        logs:
          exporters:
          - debug
          processors:
          - memory_limiter
          - batch
          receivers:
          - otlp
        metrics:
          exporters:
          - prometheusremotewrite
          processors:
          - memory_limiter
          - batch
          receivers:
          - spanmetrics
        traces:
          exporters:
          - otlp/quickwit
          - spanmetrics
          processors:
          - batch
          - span/to_attributes
          receivers:
          - otlp
          - opencensus
          - zipkin
          - jaeger
      telemetry:
        metrics:
          address: ${env:MY_POD_IP}:8888

Log output

No response

Additional context

No response

Copy link
Contributor

github-actions bot commented Oct 9, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@atoulme
Copy link
Contributor

atoulme commented Oct 29, 2024

Please provide clear reproduction steps. Please try with the latest release as well.

@atoulme atoulme added waiting for author and removed needs triage New item requiring triage labels Oct 29, 2024
@github-actions github-actions bot removed the Stale label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting for author
Projects
None yet
Development

No branches or pull requests

2 participants