[Spanmetrics Connector] sometimes counter-type metrics grow exponentially. #33136

pingping95 · 2024-05-21T01:50:47Z

Component(s)

connector/spanmetrics

What happened?

Description

Sometimes counter-type metrics grow exponentially.

This has been happening for about 2 months now.

I would run a heapdump if it was a memory leak,
but the metric values are growing exponentially, so I don't know how to debug it.

This didn't happen when I was using Tempo's Metrics Generator.

However, in order to create RED metrics with Tempo's Metrics Generator, I couldn't sample from the Collector, so I moved to the Opentelemetry Collector.

Steps to Reproduce

It doesn't happen in the development environment.

The issue occurs on collectors with some traffic, like production environment.

Because of the metric_expiration config, it returns back to normal after 5 minutes.

but this is only a temporary measure and doesn’t help much.

Expected Result

The counter type metric should not be spiking.

Actual Result

Collector version

v0.99

Environment information

Environment

OS: Amazon Linux 2 (AWS EKS)

Compiler(if manually compiled): I don`t know

Architecture

2 Layer Collectors

LoadBalancing Collector -> Spanmetrics Collector
                        -> Tailsampling Collector

OpenTelemetry Collector configuration

#########################
  ### RECEIVER
  #########################
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
        http:
          endpoint: ${env:MY_POD_IP}:4318

    prometheus:
      config:
        scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
              - targets:
                  - ${env:MY_POD_IP}:8888

  #########################
  ### PROCESSORS
  #########################
  processors:
    batch: {}
    memory_limiter:
      check_interval: 2s
      limit_percentage: 70
      spike_limit_percentage: 25

    filter:
      metrics:
        datapoint:
          - 'attributes["span.kind"] == "SPAN_KIND_CLIENT"'
          - 'attributes["span.kind"] == "SPAN_KIND_INTERNAL"'

  #########################
  ### EXPORTERS
  #########################
  exporters:

    prometheusremotewrite/spanmetrics:
      endpoint: https://mimir.xxxxx.xxx/api/v1/push

      target_info:
        enabled: true
      external_labels:
        cluster: xxxxxxxx

    prometheusremotewrite/monitormetrics:
      endpoint: https://mimir.xxxxx.xxx/api/v1/push

      external_labels:
        cluster: xxxxxxxx
        collector-type: spanmetrics

  # SPAN_KIND   : CLIENT, SERVER, INTERNAL, PRODUCER, CONSUMER
  # STATUS_CODE : UNSET, OK, ERROR

  #########################
  ### CONNECTORS
  #########################
  connectors:
    # 1. SERVER Metrics
    spanmetrics:
      histogram:
        explicit:
          buckets: [100ms, 500ms, 2s, 5s, 10s, 20s, 30s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      metrics_flush_interval: 15s
      metrics_expiration: 5m

      dimensions_cache_size: 4000
      resource_metrics_cache_size: 4000

      exemplars:
        enabled: false
#        max_per_data_point: 5
      dimensions:
        - name: http.method
        - name: http.status_code
        - name: cluster

#        - name: messaging.kafka.client_id
        # PRODUCER
#        - name: messaging.destination.name
#        - name: messaging.kafka.destination.partition
        # CONSUMER
#        - name: messaging.kafka.consumer.group
#        - name: messaging.kafka.source.partition
#        - name: messaging.operation

      events:
        enabled: false
        dimensions:
          - name: exception.type

      resource_metrics_key_attributes:
        - service.name

  #########################
  ### EXTENSIONS
  #########################
  extensions:
    health_check:
      endpoint: ${env:MY_POD_IP}:13133
    memory_ballast:
      size_in_percentage: 40

  #########################
  ### SERVICES
  #########################
  service:
    extensions:
      - health_check
      - memory_ballast

    pipelines:
      ############################
      #### OTLP -> SPANMETRICS ###
      ############################
      traces:
        receivers:
          - otlp
        processors:
          - memory_limiter
          - batch
        exporters:
          - spanmetrics

      #############################################
      #### KIND: SERVER, CONSUMER, PRODUCER
      #############################################
      metrics/spanmetrics:
        receivers:
          - spanmetrics
        processors:
          - memory_limiter
          - batch
          - filter
        exporters:
          - prometheusremotewrite/spanmetrics

      #############################################
      #### OTEL Monitoring
      #############################################
      metrics/monitormetrics:
        receivers:
          - prometheus
        processors:
          - memory_limiter
          - batch
        exporters:
          - prometheusremotewrite/monitormetrics

Log output

No special logs.

Additional context

I'm constantly trying to disable spanmetrics settings one by one to figure out what's causing the problem.

disable events
disable exemplars
disable some dimensions

So far, nothing has worked.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-21T01:51:05Z

Pinging code owners:

connector/spanmetrics: @portertech @Frapschen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

pingping95 · 2024-05-22T14:52:47Z

I resolved this issue by removing consumer and producer related dimensions.

i don`t know why this issue happens.

so i decided to change like this.

SERVER span metrics -> server related dimensions only

else : Drop all

pingping95 · 2024-05-30T03:10:32Z

Not resolved this issue..

still occurs.

pingping95 · 2024-05-30T03:17:44Z

I tried this but still occurs.

Apply a filter (SPAN-KIND: Server only) before the SpanMetrics connector.
-> Only server spans pass through the spanmetrics connector
add telemetry.sdk.language, telemetry.sdk.name to resource_metrics_key_attribute

config:

  #########################
  ### RECEIVER
  #########################
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
        http:
          endpoint: ${env:MY_POD_IP}:4318

    prometheus:
      config:
        scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
              - targets:
                  - ${env:MY_POD_IP}:8888

  #########################
  ### PROCESSORS
  #########################
  processors:
    batch: {}
    memory_limiter:
      check_interval: 2s
      limit_percentage: 70
      spike_limit_percentage: 25

    filter/server-span-only:
      error_mode: ignore
      traces:
        span:
          - 'kind.string == "Unspecified"'
          - 'kind.string == "Internal"'
          - 'kind.string == "Client"'
          - 'kind.string == "Producer"'
          - 'kind.string == "Consumer"'

#    filter/server:
#      metrics:
#        datapoint:
#          - 'attributes["span.kind"] == "SPAN_KIND_CLIENT"'
#          - 'attributes["span.kind"] == "SPAN_KIND_UNSPECIFIED"'
#          - 'attributes["span.kind"] == "SPAN_KIND_INTERNAL"'
#          - 'attributes["span.kind"] == "SPAN_KIND_CONSUMER"'
#          - 'attributes["span.kind"] == "SPAN_KIND_PRODUCER"'


  #########################
  ### EXPORTERS
  #########################
  exporters:

    prometheusremotewrite/spanmetrics-server:
      endpoint: https://xxxx.xxxxx.xxxxx/api/v1/push
      target_info:
        enabled: true
      external_labels:
        cluster: xx-xxxx-xxxx

    prometheusremotewrite/monitormetrics:
      endpoint: https://xxxxx.xxxxx.xxxxx/api/v1/push
      external_labels:
        cluster: xx-xxxx-xxxx
        collector-type: spanmetrics

  # SPAN_KIND   : CLIENT, SERVER, INTERNAL, PRODUCER, CONSUMER
  # STATUS_CODE : UNSET, OK, ERROR

  #########################
  ### CONNECTORS
  #########################
  connectors:
    # 1. SERVER Metrics
    spanmetrics/server:
      histogram:
        explicit:
          buckets: [100ms, 500ms, 2s, 5s, 10s, 20s, 30s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      metrics_flush_interval: 15s
      metrics_expiration: 5m
      exemplars:
        enabled: false
      dimensions:
        - name: http.method
        - name: http.status_code
        - name: cluster

      events:
        enabled: true
        dimensions:
          - name: exception.type

      resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name

  #########################
  ### EXTENSIONS
  #########################
  extensions:
    health_check:
      endpoint: ${env:MY_POD_IP}:13133
    memory_ballast:
      size_in_percentage: 40

  #########################
  ### SERVICES
  #########################
  service:
    extensions:
      - health_check
      - memory_ballast

    pipelines:
      ############################
      #### OTLP -> SPANMETRICS ###
      ############################
      traces:
        receivers:
          - otlp
        processors:
          - memory_limiter
          - batch
          - filter/server-span-only
        exporters:
          - spanmetrics/server
#          - spanmetrics/messaging

      #############################################
      #### KIND: SERVER
      #############################################
      metrics/spanmetrics-server:
        receivers:
          - spanmetrics/server
        processors:
          - memory_limiter
          - batch
#          - filter/server
        exporters:
          - prometheusremotewrite/spanmetrics-server

      #############################################
      #### OTEL Monitoring
      #############################################
      metrics/monitormetrics:
        receivers:
          - prometheus
        processors:
          - memory_limiter
          - batch
        exporters:
          - prometheusremotewrite/monitormetrics

pingping95 · 2024-05-30T03:20:08Z

@portertech @Frapschen

Could give me any advice ?

If you have any additional questions or suspicions, please let �me know

Or is there something I'm setting up incorrectly?

Frapschen · 2024-05-30T06:54:10Z

@pingping95 You added http.method to your spanmetrics labels set, this dimension sometimes is high cardinality.

Updating your promSQL with sum(...) by (span_kind,cluster,service_name,http_method) to check whether your spanmetrics encounter a high cardinality case.

pingping95 · 2024-05-30T07:02:55Z

@Frapschen Thanks for replying !

Im not added cluster, service_name label to by () because security.

sum(rate(calls_total{cluster="xxx-xxxx-xxxx"}[$__rate_interval])) by (span_kind, http_method)

It seems that put, get http.method is strange.

(http.method = GET : up to 500k)

(http.method = PUT : up to 3k)

ankitpatel96 · 2024-06-03T16:29:34Z

Hi,
Can you give any kind of sample metrics from these series? I understand that the number of unique metrics are growing very fast, but we simply don't have enough information to figure out why new labels keep being created.

If you could provide us a list of ~50 metrics from the normal case (before your spike) and 50ish during the spike, we could help you figure out what the origin of the spike is and whether the span metrics connector is involved in the problem. We would want the metric_name and full set of labels for each metric.

You could definitely anonymize this data however you see fit - but keeping the label keys the same would really help.

pingping95 · 2024-06-03T17:22:51Z

@ankitpatel96 Hi. thanks for help!

because of your help, i found something new.

counter-type metrics does not grow exponentially.

when removing rate() function, calls_total metrics looks drop at once.

spanMetrics connector is Stateful component ?

I use loadBalancing OTEL Collector in front of spnaMetrics connector.

and it is routed by serviceName.

# loadBalancing Collector  --> (serviceName) --> spanMetrics Collector
                           --> (traceID)          --> tailSampling Collector

opentelemetry-collector-spanmetrics-78cb9f7ff-6jrfw     1/1     Running   0          5d3h
opentelemetry-collector-spanmetrics-78cb9f7ff-hwvh9     1/1     Running   0       **17h**
opentelemetry-collector-spanmetrics-78cb9f7ff-rfdz7     1/1     Running   0          **8h**

The time when the metric looks strange and the time when the Collector containing the spanMetrics Connector is restarted are exactly the same.

(8h , 17h)

Do i need to run spanMetrics connector using PVC ?

First of all, I found out the cause because of your help. Thank you.

pingping95 · 2024-06-03T18:36:29Z

I gonna try to add new label that increase cardinality (for example, pod_id to calls_total metrics)

in order to check if issue exists or not, i would try restart collector pod.

If the same phenomenon does not occur after restarting, I think i can conclude that the issue is not a problem with spanmetrics connectors, but rather the absence of labels that increase cardinality in the exporter component.

https://grafana.com/docs/grafana-cloud/monitor-applications/application-observability/setup/scaling/

pingping95 · 2024-06-04T03:01:35Z

I`ll close this issue.

When issue occurs again, then i`ll re-open issue again.

thanks

pingping95 added bug Something isn't working needs triage New item requiring triage labels May 21, 2024

github-actions bot added the connector/spanmetrics label May 21, 2024

pingping95 closed this as completed May 22, 2024

github-actions bot mentioned this issue May 28, 2024

Weekly Report: 2024-05-21 - 2024-05-28 #33243

Closed

pingping95 reopened this May 30, 2024

LucaLanziani mentioned this issue Jun 2, 2024

Weekly Report: 2024-05-26 - 2024-06-02 #33329

Closed

pingping95 closed this as completed Jun 4, 2024

pingping95 mentioned this issue Jul 9, 2024

[connector/spanmetrics] - Spanmetrics connector is not producing correct metrics #32043

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spanmetrics Connector] sometimes counter-type metrics grow exponentially. #33136

[Spanmetrics Connector] sometimes counter-type metrics grow exponentially. #33136

pingping95 commented May 21, 2024 •

edited

Loading

github-actions bot commented May 21, 2024

pingping95 commented May 22, 2024 •

edited

Loading

pingping95 commented May 30, 2024

pingping95 commented May 30, 2024

pingping95 commented May 30, 2024 •

edited

Loading

Frapschen commented May 30, 2024 •

edited

Loading

pingping95 commented May 30, 2024 •

edited

Loading

ankitpatel96 commented Jun 3, 2024

pingping95 commented Jun 3, 2024 •

edited

Loading

pingping95 commented Jun 3, 2024 •

edited

Loading

pingping95 commented Jun 4, 2024

[Spanmetrics Connector] sometimes counter-type metrics grow exponentially. #33136

[Spanmetrics Connector] sometimes counter-type metrics grow exponentially. #33136

Comments

pingping95 commented May 21, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

Architecture

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented May 21, 2024

pingping95 commented May 22, 2024 • edited Loading

pingping95 commented May 30, 2024

pingping95 commented May 30, 2024

pingping95 commented May 30, 2024 • edited Loading

Frapschen commented May 30, 2024 • edited Loading

pingping95 commented May 30, 2024 • edited Loading

ankitpatel96 commented Jun 3, 2024

pingping95 commented Jun 3, 2024 • edited Loading

pingping95 commented Jun 3, 2024 • edited Loading

pingping95 commented Jun 4, 2024

pingping95 commented May 21, 2024 •

edited

Loading

pingping95 commented May 22, 2024 •

edited

Loading

pingping95 commented May 30, 2024 •

edited

Loading

Frapschen commented May 30, 2024 •

edited

Loading

pingping95 commented May 30, 2024 •

edited

Loading

pingping95 commented Jun 3, 2024 •

edited

Loading

pingping95 commented Jun 3, 2024 •

edited

Loading