Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some metric names cannot be matched with regex #34376

Closed
wilstdu opened this issue Aug 1, 2024 · 3 comments
Closed

Some metric names cannot be matched with regex #34376

wilstdu opened this issue Aug 1, 2024 · 3 comments
Assignees
Labels
bug Something isn't working receiver/prometheus Prometheus receiver

Comments

@wilstdu
Copy link

wilstdu commented Aug 1, 2024

Component(s)

receiver/prometheus

What happened?

Description

AWS EKS cluster has OpenTelemetry collector deployed as DeamonSet and uses TargetAllocator to discover metrics endpoints from ServiceMonitors.

Shortened list of metrics I'm trying to ingest:

# HELP tekton_pipelines_controller_pipelinerun_duration_seconds The pipelinerun execution time in seconds
# TYPE tekton_pipelines_controller_pipelinerun_duration_seconds histogram
tekton_pipelines_controller_pipelinerun_duration_seconds_bucket{namespace="tekton-verification",pipeline="tekton-verification",status="success",le="43200"} 1
tekton_pipelines_controller_pipelinerun_duration_seconds_bucket{namespace="tekton-verification",pipeline="tekton-verification",status="success",le="86400"} 1
tekton_pipelines_controller_pipelinerun_duration_seconds_bucket{namespace="tekton-verification",pipeline="tekton-verification",status="success",le="+Inf"} 1
tekton_pipelines_controller_pipelinerun_duration_seconds_sum{namespace="tekton-verification",pipeline="tekton-verification",status="success"} 13.087762487
tekton_pipelines_controller_pipelinerun_duration_seconds_count{namespace="tekton-verification",pipeline="tekton-verification",status="success"} 1

# HELP tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds The pipelinerun's taskrun execution time in seconds
# TYPE tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds histogram
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket{namespace="tekton-verification",pipeline="tekton-verification",status="success",task="anonymous",le="43200"} 1
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket{namespace="tekton-verification",pipeline="tekton-verification",status="success",task="anonymous",le="86400"} 1
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket{namespace="tekton-verification",pipeline="tekton-verification",status="success",task="anonymous",le="+Inf"} 1
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_sum{namespace="tekton-verification",pipeline="tekton-verification",status="success",task="anonymous"} 13.06821713
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_count{namespace="tekton-verification",pipeline="tekton-verification",status="success",task="anonymous"} 1

Service Monitor configuration used to whitelist specific metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-pipelines-controller
spec:
  endpoints:
    - honorLabels: true
      metricRelabelings:
        - action: keep
          regex: (tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_sum|tekton_pipelines_controller_pipelinerun_duration_seconds_sum)
          sourceLabels:
            - __name__
        - action: replace
          replacement: 'true'
          targetLabel: cx_ingest
      path: /metrics
      port: http-metrics
      scheme: http
  namespaceSelector:
    matchNames:
      - tekton-pipelines
  selector:
    matchLabels:
      app: tekton-pipelines-controller

In Otel agent configuration there are no additional filters - it's just a direct passthrough from TargetAllocator created scrape_configs.

Expected Result

Both tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_sum and tekton_pipelines_controller_pipelinerun_duration_seconds_sum metrics are ingested and other metrics are discarded

Actual Result

tekton_pipelines_controller_pipelinerun_duration_seconds_sum - ingested tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_sum - not
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_(.*), - then tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_sum, _bucket, and _count are ingested

When trying to use the wildcard to allow ingest a bit more and then additionally drop _bucket, and _count metrics - this also doesn't work.

It may be related to the length of the metric name and adding a wildcard at the end allows the metric to be ingested. Additional observation that none of the metrics longer than tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_sum can be whitelisted without wildcard at the end.

Collector version

otel/opentelemetry-collector-contrib:0.96.0

Environment information

Environment

Cloud, AWS EKS DeamonSet

OpenTelemetry Collector configuration

!!!This is shortened version of the actual config with only relevant parts!!!

exporters:
  coralogix:
  debug: {}
  logging: {}
extensions:
processors:
  attributes/service_monitors:
    actions:
    - action: delete
      key: cx_ingest
  filter/reducer:
    metrics:
      include:
        expressions:
        - Label("cx_ingest") == "true"
        - MetricName == "system.cpu.time"
        - MetricName == "system.memory.usage"
        - MetricName == "system.disk.io"
        - MetricName == "system.network.io"
        - MetricName == "k8s.pod.cpu.time"
        - MetricName == "k8s.pod.cpu.utilization"
        - MetricName == "k8s.pod.network.io"
        - MetricName == "k8s.pod.memory.usage"
        - MetricName == "k8s.node.cpu.utilization"
        - MetricName == "container.cpu.utilization"
        - MetricName == "container.cpu.time"
        - MetricName == "k8s.node.network.io"
        - MetricName == "k8s.node.filesystem.available"
        - MetricName == "container.memory.usage"
        match_type: expr
receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: opentelemetry-collector
        scrape_interval: 30s
        static_configs:
        - targets:
          - ${MY_POD_IP}:8888
    target_allocator:
      collector_id: ${MY_POD_NAME}
      endpoint: http://coralogix-opentelemetry-targetallocator
      interval: 30s
service:
  extensions:
  pipelines:
    metrics:
      exporters:
      - coralogix
      processors:
      - filter/reducer
      - attributes/service_monitors
      receivers:
      - prometheus
  telemetry:
    logs:
      encoding: json
      level: 'warn'
    metrics:
      address: ${MY_POD_IP}:8888
    resource:
    - service.instance.id: null
    - service.name: null

Log output

No response

Additional context

No response

@wilstdu wilstdu added bug Something isn't working needs triage New item requiring triage labels Aug 1, 2024
@github-actions github-actions bot added the receiver/prometheus Prometheus receiver label Aug 1, 2024
Copy link
Contributor

github-actions bot commented Aug 1, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole dashpole self-assigned this Aug 1, 2024
@dashpole dashpole removed the needs triage New item requiring triage label Aug 1, 2024
@dashpole
Copy link
Contributor

dashpole commented Aug 1, 2024

First, note that you have "keep" for the action, so the other series should be discarded, and the ones that match the regex will be kept.

These are histogram metrics, so the resulting metric should be named tekton_pipelines_controller_pipelinerun_duration_seconds or tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds

If you only keep the _sum series, the collector may drop your histogram entirely, as it won't be a valid Histogram. I haven't tested it, but you might get strange behavior doing this.

It shouldn't have anything to do with the length of the regex or the length of the metric.

@wilstdu
Copy link
Author

wilstdu commented Aug 2, 2024

@dashpole thank you for the explanation.

I figured out what was different in my regex between the two histogram metrics and noticed that for tekton_pipelines_controller_pipelinerun_duration_seconds I whitelisted both _sum and _count, when I tried the same for tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds - then it worked.

So to sum it up - to ingest histogram metric you need to whitelist _sum and _count series, otherwise the metric will be rejected.

For the case I was trying to resolve - this helped, because _bucket metric was the problem since it produced lots of data that was not used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/prometheus Prometheus receiver
Projects
None yet
Development

No branches or pull requests

2 participants