Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter processor metrics are dropped randomly #36900

Closed
afreyermuth98 opened this issue Dec 19, 2024 · 8 comments
Closed

Filter processor metrics are dropped randomly #36900

afreyermuth98 opened this issue Dec 19, 2024 · 8 comments
Labels
processor/filter Filter processor question Further information is requested

Comments

@afreyermuth98
Copy link

Component(s)

internal/filter

What happened?

Description

I'm using filter processor to drop some metrics inside my otel configuration. And some of them are not dropped without any reason

Steps to Reproduce

Use this filter in your otel config

filter/drop_metrics:
      error_mode: ignore
      metrics:
        metric:
            - 'name == METRIC1'
            - 'name == METRIC2'
            - ...

Expected Result

Not be able to query METRIC1, METRIC2 and all the others

Actual Result

METRIC1 is not here for example as expected but METRICN is here

Collector version

0.111.0

Environment information

Environment

OS: Amazon Linux 2, ARM64
Container Runtime: containerd://1.7.23

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@afreyermuth98 afreyermuth98 added bug Something isn't working needs triage New item requiring triage labels Dec 19, 2024
Copy link
Contributor

Pinging code owners:

  • internal/filter: @open-telemetry/collector-approvers

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@bacherfl bacherfl added processor/filter Filter processor and removed internal/filter labels Dec 19, 2024
Copy link
Contributor

Pinging code owners for processor/filter: @TylerHelmuth @boostchicken. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

@bacherfl
Copy link
Contributor

Hi @afreyermuth98! Quick question to clarify: Is it consistently the same metric that fails to be filtered (e.g. METRIC2 is never filtered), or is the filtering happening randomly for each metric (e.g. some samples of METRIC2 are filtered, while others are not)?
To help troubleshooting this, can you please provide the full configuration of your collector (of course without any sensitive data)?

@afreyermuth98
Copy link
Author

Hey @bacherfl . Yes always the same, i've reduced to 2 metrics with one that works and the other that never works. It's ununderstanble really.

    exporters:
      debug: {}
      prometheusremotewrite/org:
        endpoint: ENDPOINT
    processors:
      batch/common:
        send_batch_max_size: 8192
        send_batch_size: 8192
        timeout: 2s
      filter/drop_metrics:
        error_mode: ignore
        metrics:
          metric: 
            - 'name == "karpenter_cloudprovider_instance_type_offering_available"'
            - 'name == "cilium_agent_api_process_time_seconds_bucket"'
    receivers:
      prometheus/common:
        config:
          global:
            evaluation_interval: 10s
            scrape_interval: 15s
            scrape_timeout: 10s
          scrape_configs:
          - honor_labels: true
            honor_timestamps: true
            job_name: kubernetes-service-endpoints
            kubernetes_sd_configs:
            - role: endpoints
              selectors:
              - field: spec.nodeName=${KUBE_NODE_NAME}
                role: pod
            metric_relabel_configs:
            metrics_path: /metrics
            relabel_configs:
            - action: keep
              regex: true|"true"
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scrape
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              source_labels:
              - __address__
              - __meta_kubernetes_service_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_service_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_service_name
              target_label: service
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: k8s_namespace_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_name
              target_label: k8s_pod_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_node_name
              target_label: kubernetes_node
            scheme: http
          - job_name: kubernetes-pods
            kubernetes_sd_configs:
            - role: pod
            relabel_configs:
            - action: keep
              regex: true|"true"
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scrape
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              source_labels:
              - __address__
              - __meta_kubernetes_pod_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_pod_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: k8s_namespace_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_name
              target_label: k8s_pod_name
            scrape_interval: 60s
            scrape_timeout: 50s
          
    service:
      extensions:
      - health_check
      pipelines:
        metrics:
          exporters:
          - prometheusremotewrite/org
          processors:
          - memory_limiter
          - filter/drop_metrics
          receivers:
          - prometheus/common

I kept only the metrics part of the config and kept the real names of the metrics that work and does not work.
Btw, I also tried to drop the metrics at scrapping step using

metric_relabel_configs:
          - action: drop
            regex: "METRICNAME"
            source_labels:
            - __name__

And it failed the same way on the same metrics that didnt worked on the filter processor ...
Nothing is shown in the log even on debug mode. The filter processor as the same behavior with the karpenter and the cilium metric. Really need help here

@bacherfl
Copy link
Contributor

Thanks for the additional information @afreyermuth98 - can you try to remove the _bucket suffix from the filter expression? As this metric seems to be a histogram, the actual name of the metric forwarded to the processor by the prometheus receiver should be just cilium_agent_api_process_time_seconds.

Therefore, please try to change the filter expressions to the following:

      filter/drop_metrics:
        error_mode: ignore
        metrics:
          metric: 
            - 'name == "karpenter_cloudprovider_instance_type_offering_available"'
            - 'name == "cilium_agent_api_process_time_seconds"' # removed the _bucket suffix

@afreyermuth98
Copy link
Author

It worked thanks a lot @bacherfl !!! 😍
How can we know if a metric is an histogram or not ?

@bacherfl
Copy link
Contributor

That's good news, glad to hear it worked!
Since the metrics were received by the prometheus receiver, the first place to tell the type of a metric would be the payload of the prometheus endpoint from which the collector is fetching the metrics. Above each metric, there should be a line in the format of # TYPE <metric_name> <metric_type>, followed by the actual metric data, such as for example:

# TYPE my_histogram histogram
my_histogram_bucket{label="example",le="0.1"} 0
my_histogram_bucket{label="example",le="1"} 1
my_histogram_bucket{label="example",le="10"} 2
my_histogram_sum{label="example"} 3.14
my_histogram_count{label="example"} 3

On the collector side, if you would like to know the metric type, you can use the debug exporter with the verbosity set to detailed:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otelcol'
          scrape_interval: 10s
          static_configs:
            - targets: ['0.0.0.0:8080']

processors:

exporters:
  debug:
    verbosity: detailed

extensions:
  health_check:

service:
  extensions: [health_check]
  telemetry:
    logs:
      level: "info"
    metrics:
      address: "0.0.0.0:8888"
  pipelines:
    metrics:
      receivers: [otlp,prometheus]
      processors: []
      exporters: [debug]

This will print all received metrics to the console, and there you can inspect the type of a metric via the DataType in the Descriptor of the metric, so for example:

Metric #4
Descriptor:
     -> Name: my_histogram
     -> Description: 
     -> Unit: 
     -> DataType: Histogram 
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> label: Str(example)
StartTimestamp: 2024-12-20 06:01:57.389 +0000 UTC
Timestamp: 2024-12-20 06:02:17.389 +0000 UTC
Count: 3
Sum: 3.140000
ExplicitBounds #0: 0.100000
ExplicitBounds #1: 1.000000
ExplicitBounds #2: 10.000000
Buckets #0, Count: 0
Buckets #1, Count: 1
Buckets #2, Count: 1
Buckets #3, Count: 1

I will close this issue as the filter is working as expected now, but if yu have any further questions, feel free to reach out!

@bacherfl bacherfl added question Further information is requested and removed bug Something isn't working needs triage New item requiring triage labels Dec 20, 2024
@afreyermuth98
Copy link
Author

afreyermuth98 commented Dec 30, 2024

Hello @bacherfl !
I'm dropping very well my metrics now except for one job : the cadvisor one. It seems impossible to drop for no reason. Have you already faced that ?
Here are the metrics https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/filter Filter processor question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants