Filter processor metrics are dropped randomly #36900

afreyermuth98 · 2024-12-19T08:57:38Z

Component(s)

internal/filter

What happened?

Description

I'm using filter processor to drop some metrics inside my otel configuration. And some of them are not dropped without any reason

Steps to Reproduce

Use this filter in your otel config

filter/drop_metrics:
      error_mode: ignore
      metrics:
        metric:
            - 'name == METRIC1'
            - 'name == METRIC2'
            - ...

Expected Result

Not be able to query METRIC1, METRIC2 and all the others

Actual Result

METRIC1 is not here for example as expected but METRICN is here

Collector version

0.111.0

Environment information

Environment

OS: Amazon Linux 2, ARM64
Container Runtime: containerd://1.7.23

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-12-19T08:59:02Z

Pinging code owners:

internal/filter: @open-telemetry/collector-approvers

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-12-19T13:34:48Z

Pinging code owners for processor/filter: @TylerHelmuth @boostchicken. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

bacherfl · 2024-12-19T14:01:26Z

Hi @afreyermuth98! Quick question to clarify: Is it consistently the same metric that fails to be filtered (e.g. METRIC2 is never filtered), or is the filtering happening randomly for each metric (e.g. some samples of METRIC2 are filtered, while others are not)?
To help troubleshooting this, can you please provide the full configuration of your collector (of course without any sensitive data)?

afreyermuth98 · 2024-12-19T14:19:52Z

Hey @bacherfl . Yes always the same, i've reduced to 2 metrics with one that works and the other that never works. It's ununderstanble really.

    exporters:
      debug: {}
      prometheusremotewrite/org:
        endpoint: ENDPOINT
    processors:
      batch/common:
        send_batch_max_size: 8192
        send_batch_size: 8192
        timeout: 2s
      filter/drop_metrics:
        error_mode: ignore
        metrics:
          metric: 
            - 'name == "karpenter_cloudprovider_instance_type_offering_available"'
            - 'name == "cilium_agent_api_process_time_seconds_bucket"'
    receivers:
      prometheus/common:
        config:
          global:
            evaluation_interval: 10s
            scrape_interval: 15s
            scrape_timeout: 10s
          scrape_configs:
          - honor_labels: true
            honor_timestamps: true
            job_name: kubernetes-service-endpoints
            kubernetes_sd_configs:
            - role: endpoints
              selectors:
              - field: spec.nodeName=${KUBE_NODE_NAME}
                role: pod
            metric_relabel_configs:
            metrics_path: /metrics
            relabel_configs:
            - action: keep
              regex: true|"true"
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scrape
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              source_labels:
              - __address__
              - __meta_kubernetes_service_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_service_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_service_name
              target_label: service
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: k8s_namespace_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_name
              target_label: k8s_pod_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_node_name
              target_label: kubernetes_node
            scheme: http
          - job_name: kubernetes-pods
            kubernetes_sd_configs:
            - role: pod
            relabel_configs:
            - action: keep
              regex: true|"true"
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scrape
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              source_labels:
              - __address__
              - __meta_kubernetes_pod_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_pod_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: k8s_namespace_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_name
              target_label: k8s_pod_name
            scrape_interval: 60s
            scrape_timeout: 50s
          
    service:
      extensions:
      - health_check
      pipelines:
        metrics:
          exporters:
          - prometheusremotewrite/org
          processors:
          - memory_limiter
          - filter/drop_metrics
          receivers:
          - prometheus/common

I kept only the metrics part of the config and kept the real names of the metrics that work and does not work.
Btw, I also tried to drop the metrics at scrapping step using

metric_relabel_configs:
          - action: drop
            regex: "METRICNAME"
            source_labels:
            - __name__

And it failed the same way on the same metrics that didnt worked on the filter processor ...
Nothing is shown in the log even on debug mode. The filter processor as the same behavior with the karpenter and the cilium metric. Really need help here

bacherfl · 2024-12-19T15:03:03Z

Thanks for the additional information @afreyermuth98 - can you try to remove the _bucket suffix from the filter expression? As this metric seems to be a histogram, the actual name of the metric forwarded to the processor by the prometheus receiver should be just cilium_agent_api_process_time_seconds.

Therefore, please try to change the filter expressions to the following:

      filter/drop_metrics:
        error_mode: ignore
        metrics:
          metric: 
            - 'name == "karpenter_cloudprovider_instance_type_offering_available"'
            - 'name == "cilium_agent_api_process_time_seconds"' # removed the _bucket suffix

afreyermuth98 · 2024-12-19T15:28:03Z

It worked thanks a lot @bacherfl !!! 😍
How can we know if a metric is an histogram or not ?

bacherfl · 2024-12-20T06:04:46Z

That's good news, glad to hear it worked!
Since the metrics were received by the prometheus receiver, the first place to tell the type of a metric would be the payload of the prometheus endpoint from which the collector is fetching the metrics. Above each metric, there should be a line in the format of # TYPE <metric_name> <metric_type>, followed by the actual metric data, such as for example:

# TYPE my_histogram histogram
my_histogram_bucket{label="example",le="0.1"} 0
my_histogram_bucket{label="example",le="1"} 1
my_histogram_bucket{label="example",le="10"} 2
my_histogram_sum{label="example"} 3.14
my_histogram_count{label="example"} 3

On the collector side, if you would like to know the metric type, you can use the debug exporter with the verbosity set to detailed:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otelcol'
          scrape_interval: 10s
          static_configs:
            - targets: ['0.0.0.0:8080']

processors:

exporters:
  debug:
    verbosity: detailed

extensions:
  health_check:

service:
  extensions: [health_check]
  telemetry:
    logs:
      level: "info"
    metrics:
      address: "0.0.0.0:8888"
  pipelines:
    metrics:
      receivers: [otlp,prometheus]
      processors: []
      exporters: [debug]

This will print all received metrics to the console, and there you can inspect the type of a metric via the DataType in the Descriptor of the metric, so for example:

Metric #4
Descriptor:
     -> Name: my_histogram
     -> Description: 
     -> Unit: 
     -> DataType: Histogram 
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> label: Str(example)
StartTimestamp: 2024-12-20 06:01:57.389 +0000 UTC
Timestamp: 2024-12-20 06:02:17.389 +0000 UTC
Count: 3
Sum: 3.140000
ExplicitBounds #0: 0.100000
ExplicitBounds #1: 1.000000
ExplicitBounds #2: 10.000000
Buckets #0, Count: 0
Buckets #1, Count: 1
Buckets #2, Count: 1
Buckets #3, Count: 1

I will close this issue as the filter is working as expected now, but if yu have any further questions, feel free to reach out!

afreyermuth98 · 2024-12-30T09:52:30Z

Hello @bacherfl !
I'm dropping very well my metrics now except for one job : the cadvisor one. It seems impossible to drop for no reason. Have you already faced that ?
Here are the metrics https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md

afreyermuth98 added bug Something isn't working needs triage New item requiring triage labels Dec 19, 2024

github-actions bot added the internal/filter label Dec 19, 2024

bacherfl added processor/filter Filter processor and removed internal/filter labels Dec 19, 2024

bacherfl added question Further information is requested and removed bug Something isn't working needs triage New item requiring triage labels Dec 20, 2024

bacherfl closed this as completed Dec 20, 2024

github-actions bot mentioned this issue Dec 24, 2024

Weekly Report: 2024-12-17 - 2024-12-24 #36929

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter processor metrics are dropped randomly #36900

Filter processor metrics are dropped randomly #36900

afreyermuth98 commented Dec 19, 2024

github-actions bot commented Dec 19, 2024

github-actions bot commented Dec 19, 2024

bacherfl commented Dec 19, 2024

afreyermuth98 commented Dec 19, 2024

bacherfl commented Dec 19, 2024

afreyermuth98 commented Dec 19, 2024

bacherfl commented Dec 20, 2024

afreyermuth98 commented Dec 30, 2024 •

edited

Loading

Filter processor metrics are dropped randomly #36900

Filter processor metrics are dropped randomly #36900

Comments

afreyermuth98 commented Dec 19, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Dec 19, 2024

github-actions bot commented Dec 19, 2024

bacherfl commented Dec 19, 2024

afreyermuth98 commented Dec 19, 2024

bacherfl commented Dec 19, 2024

afreyermuth98 commented Dec 19, 2024

bacherfl commented Dec 20, 2024

afreyermuth98 commented Dec 30, 2024 • edited Loading

afreyermuth98 commented Dec 30, 2024 •

edited

Loading