Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spanmetrics Connector] sometimes counter-type metrics grow exponentially. #33136

Closed
pingping95 opened this issue May 21, 2024 · 11 comments
Closed
Labels
bug Something isn't working connector/spanmetrics needs triage New item requiring triage

Comments

@pingping95
Copy link

pingping95 commented May 21, 2024

Component(s)

connector/spanmetrics

What happened?

Description

Sometimes counter-type metrics grow exponentially.

This has been happening for about 2 months now.

I would run a heapdump if it was a memory leak,
but the metric values are growing exponentially, so I don't know how to debug it.

This didn't happen when I was using Tempo's Metrics Generator.

However, in order to create RED metrics with Tempo's Metrics Generator, I couldn't sample from the Collector, so I moved to the Opentelemetry Collector.




Steps to Reproduce

It doesn't happen in the development environment.

The issue occurs on collectors with some traffic, like production environment.

Because of the metric_expiration config, it returns back to normal after 5 minutes.

but this is only a temporary measure and doesn’t help much.



Expected Result

The counter type metric should not be spiking.

image



Actual Result

image

Collector version

v0.99

Environment information

Environment

OS: Amazon Linux 2 (AWS EKS)

Compiler(if manually compiled): I don`t know

Architecture

2 Layer Collectors

LoadBalancing Collector -> Spanmetrics Collector
                        -> Tailsampling Collector

image

OpenTelemetry Collector configuration

#########################
  ### RECEIVER
  #########################
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
        http:
          endpoint: ${env:MY_POD_IP}:4318

    prometheus:
      config:
        scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
              - targets:
                  - ${env:MY_POD_IP}:8888

  #########################
  ### PROCESSORS
  #########################
  processors:
    batch: {}
    memory_limiter:
      check_interval: 2s
      limit_percentage: 70
      spike_limit_percentage: 25

    filter:
      metrics:
        datapoint:
          - 'attributes["span.kind"] == "SPAN_KIND_CLIENT"'
          - 'attributes["span.kind"] == "SPAN_KIND_INTERNAL"'

  #########################
  ### EXPORTERS
  #########################
  exporters:

    prometheusremotewrite/spanmetrics:
      endpoint: https://mimir.xxxxx.xxx/api/v1/push

      target_info:
        enabled: true
      external_labels:
        cluster: xxxxxxxx

    prometheusremotewrite/monitormetrics:
      endpoint: https://mimir.xxxxx.xxx/api/v1/push

      external_labels:
        cluster: xxxxxxxx
        collector-type: spanmetrics

  # SPAN_KIND   : CLIENT, SERVER, INTERNAL, PRODUCER, CONSUMER
  # STATUS_CODE : UNSET, OK, ERROR

  #########################
  ### CONNECTORS
  #########################
  connectors:
    # 1. SERVER Metrics
    spanmetrics:
      histogram:
        explicit:
          buckets: [100ms, 500ms, 2s, 5s, 10s, 20s, 30s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      metrics_flush_interval: 15s
      metrics_expiration: 5m

      dimensions_cache_size: 4000
      resource_metrics_cache_size: 4000

      exemplars:
        enabled: false
#        max_per_data_point: 5
      dimensions:
        - name: http.method
        - name: http.status_code
        - name: cluster

#        - name: messaging.kafka.client_id
        # PRODUCER
#        - name: messaging.destination.name
#        - name: messaging.kafka.destination.partition
        # CONSUMER
#        - name: messaging.kafka.consumer.group
#        - name: messaging.kafka.source.partition
#        - name: messaging.operation

      events:
        enabled: false
        dimensions:
          - name: exception.type

      resource_metrics_key_attributes:
        - service.name

  #########################
  ### EXTENSIONS
  #########################
  extensions:
    health_check:
      endpoint: ${env:MY_POD_IP}:13133
    memory_ballast:
      size_in_percentage: 40

  #########################
  ### SERVICES
  #########################
  service:
    extensions:
      - health_check
      - memory_ballast

    pipelines:
      ############################
      #### OTLP -> SPANMETRICS ###
      ############################
      traces:
        receivers:
          - otlp
        processors:
          - memory_limiter
          - batch
        exporters:
          - spanmetrics

      #############################################
      #### KIND: SERVER, CONSUMER, PRODUCER
      #############################################
      metrics/spanmetrics:
        receivers:
          - spanmetrics
        processors:
          - memory_limiter
          - batch
          - filter
        exporters:
          - prometheusremotewrite/spanmetrics

      #############################################
      #### OTEL Monitoring
      #############################################
      metrics/monitormetrics:
        receivers:
          - prometheus
        processors:
          - memory_limiter
          - batch
        exporters:
          - prometheusremotewrite/monitormetrics

Log output

No special logs.

Additional context

I'm constantly trying to disable spanmetrics settings one by one to figure out what's causing the problem.

  1. disable events
  2. disable exemplars
  3. disable some dimensions

So far, nothing has worked.

@pingping95 pingping95 added bug Something isn't working needs triage New item requiring triage labels May 21, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@pingping95
Copy link
Author

pingping95 commented May 22, 2024

I resolved this issue by removing consumer and producer related dimensions.

i don`t know why this issue happens.

so i decided to change like this.

SERVER span metrics -> server related dimensions only

else : Drop all

@pingping95
Copy link
Author

Not resolved this issue..

still occurs.

@pingping95 pingping95 reopened this May 30, 2024
@pingping95
Copy link
Author

I tried this but still occurs.

  1. Apply a filter (SPAN-KIND: Server only) before the SpanMetrics connector.
    -> Only server spans pass through the spanmetrics connector

  2. add telemetry.sdk.language, telemetry.sdk.name to resource_metrics_key_attribute

config:

  #########################
  ### RECEIVER
  #########################
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
        http:
          endpoint: ${env:MY_POD_IP}:4318

    prometheus:
      config:
        scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
              - targets:
                  - ${env:MY_POD_IP}:8888

  #########################
  ### PROCESSORS
  #########################
  processors:
    batch: {}
    memory_limiter:
      check_interval: 2s
      limit_percentage: 70
      spike_limit_percentage: 25

    filter/server-span-only:
      error_mode: ignore
      traces:
        span:
          - 'kind.string == "Unspecified"'
          - 'kind.string == "Internal"'
          - 'kind.string == "Client"'
          - 'kind.string == "Producer"'
          - 'kind.string == "Consumer"'

#    filter/server:
#      metrics:
#        datapoint:
#          - 'attributes["span.kind"] == "SPAN_KIND_CLIENT"'
#          - 'attributes["span.kind"] == "SPAN_KIND_UNSPECIFIED"'
#          - 'attributes["span.kind"] == "SPAN_KIND_INTERNAL"'
#          - 'attributes["span.kind"] == "SPAN_KIND_CONSUMER"'
#          - 'attributes["span.kind"] == "SPAN_KIND_PRODUCER"'


  #########################
  ### EXPORTERS
  #########################
  exporters:

    prometheusremotewrite/spanmetrics-server:
      endpoint: https://xxxx.xxxxx.xxxxx/api/v1/push
      target_info:
        enabled: true
      external_labels:
        cluster: xx-xxxx-xxxx

    prometheusremotewrite/monitormetrics:
      endpoint: https://xxxxx.xxxxx.xxxxx/api/v1/push
      external_labels:
        cluster: xx-xxxx-xxxx
        collector-type: spanmetrics

  # SPAN_KIND   : CLIENT, SERVER, INTERNAL, PRODUCER, CONSUMER
  # STATUS_CODE : UNSET, OK, ERROR

  #########################
  ### CONNECTORS
  #########################
  connectors:
    # 1. SERVER Metrics
    spanmetrics/server:
      histogram:
        explicit:
          buckets: [100ms, 500ms, 2s, 5s, 10s, 20s, 30s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      metrics_flush_interval: 15s
      metrics_expiration: 5m
      exemplars:
        enabled: false
      dimensions:
        - name: http.method
        - name: http.status_code
        - name: cluster

      events:
        enabled: true
        dimensions:
          - name: exception.type

      resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name

  #########################
  ### EXTENSIONS
  #########################
  extensions:
    health_check:
      endpoint: ${env:MY_POD_IP}:13133
    memory_ballast:
      size_in_percentage: 40

  #########################
  ### SERVICES
  #########################
  service:
    extensions:
      - health_check
      - memory_ballast

    pipelines:
      ############################
      #### OTLP -> SPANMETRICS ###
      ############################
      traces:
        receivers:
          - otlp
        processors:
          - memory_limiter
          - batch
          - filter/server-span-only
        exporters:
          - spanmetrics/server
#          - spanmetrics/messaging

      #############################################
      #### KIND: SERVER
      #############################################
      metrics/spanmetrics-server:
        receivers:
          - spanmetrics/server
        processors:
          - memory_limiter
          - batch
#          - filter/server
        exporters:
          - prometheusremotewrite/spanmetrics-server

      #############################################
      #### OTEL Monitoring
      #############################################
      metrics/monitormetrics:
        receivers:
          - prometheus
        processors:
          - memory_limiter
          - batch
        exporters:
          - prometheusremotewrite/monitormetrics

image

@pingping95
Copy link
Author

pingping95 commented May 30, 2024

@portertech @Frapschen

Could give me any advice ?

If you have any additional questions or suspicions, please let �me know

Or is there something I'm setting up incorrectly?

@Frapschen
Copy link
Contributor

Frapschen commented May 30, 2024

@pingping95 You added http.method to your spanmetrics labels set, this dimension sometimes is high cardinality.

Updating your promSQL with sum(...) by (span_kind,cluster,service_name,http_method) to check whether your spanmetrics encounter a high cardinality case.

@pingping95
Copy link
Author

pingping95 commented May 30, 2024

@Frapschen Thanks for replying !

Im not added cluster, service_name label to by () because security.

sum(rate(calls_total{cluster="xxx-xxxx-xxxx"}[$__rate_interval])) by (span_kind, http_method)

It seems that put, get http.method is strange.

(http.method = GET : up to 500k)
image

(http.method = PUT : up to 3k)
image

@ankitpatel96
Copy link
Contributor

Hi,
Can you give any kind of sample metrics from these series? I understand that the number of unique metrics are growing very fast, but we simply don't have enough information to figure out why new labels keep being created.

If you could provide us a list of ~50 metrics from the normal case (before your spike) and 50ish during the spike, we could help you figure out what the origin of the spike is and whether the span metrics connector is involved in the problem. We would want the metric_name and full set of labels for each metric.

You could definitely anonymize this data however you see fit - but keeping the label keys the same would really help.

@pingping95
Copy link
Author

pingping95 commented Jun 3, 2024

@ankitpatel96 Hi. thanks for help!

because of your help, i found something new.

counter-type metrics does not grow exponentially.

when removing rate() function, calls_total metrics looks drop at once.

image

image

spanMetrics connector is Stateful component ?

I use loadBalancing OTEL Collector in front of spnaMetrics connector.

and it is routed by serviceName.

# loadBalancing Collector  --> (serviceName) --> spanMetrics Collector
                           --> (traceID)          --> tailSampling Collector
opentelemetry-collector-spanmetrics-78cb9f7ff-6jrfw     1/1     Running   0          5d3h
opentelemetry-collector-spanmetrics-78cb9f7ff-hwvh9     1/1     Running   0       **17h**
opentelemetry-collector-spanmetrics-78cb9f7ff-rfdz7     1/1     Running   0          **8h**

The time when the metric looks strange and the time when the Collector containing the spanMetrics Connector is restarted are exactly the same.

(8h , 17h)

Do i need to run spanMetrics connector using PVC ?

First of all, I found out the cause because of your help. Thank you.

@pingping95
Copy link
Author

pingping95 commented Jun 3, 2024

I gonna try to add new label that increase cardinality (for example, pod_id to calls_total metrics)

in order to check if issue exists or not, i would try restart collector pod.

If the same phenomenon does not occur after restarting, I think i can conclude that the issue is not a problem with spanmetrics connectors, but rather the absence of labels that increase cardinality in the exporter component.

https://grafana.com/docs/grafana-cloud/monitor-applications/application-observability/setup/scaling/

image

@pingping95
Copy link
Author

I`ll close this issue.

When issue occurs again, then i`ll re-open issue again.

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/spanmetrics needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

3 participants