Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanmetricsprocessor doesn't prune histograms when metric cache is pruned #27080

Closed
nijave opened this issue Sep 22, 2023 · 6 comments
Closed
Assignees
Labels
bug Something isn't working priority:p1 High processor/spanmetrics Span Metrics processor

Comments

@nijave
Copy link
Contributor

nijave commented Sep 22, 2023

Component(s)

processor/spanmetrics

What happened?

Description

span metrics processor doesn't drop old histograms
Graphs in grafana/agent#5271

Steps to Reproduce

leave the collector running a while, watch exported metric count drop indefinitely

Expected Result

metric series should be pruned if they haven't been updated a while

Actual Result

metric series dimension cache is pruned but histograms are not

Collector version

v0.80.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

Configured is automatically generated by Grafana Agent. See https://github.com/grafana/agent/blob/main/pkg/traces/config.go#L647

Log output

N/A

Additional context

It looks like histograms map should have been pruned/LRU'd in addition to metricsKeyToDimensions #2179

I think this is the same/similar but it's closed so I figured I'd collect everything into a bug report #17306 (comment)

@nijave nijave added bug Something isn't working needs triage New item requiring triage labels Sep 22, 2023
@github-actions github-actions bot added the processor/spanmetrics Span Metrics processor label Sep 22, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@nijave
Copy link
Contributor Author

nijave commented Sep 23, 2023

Took a crack at a PR #27083

@Frapschen Frapschen removed the needs triage New item requiring triage label Sep 25, 2023
@mfilipe
Copy link

mfilipe commented Sep 25, 2023

Hello @nijave, I can confirm the issue in my environment:

Screenshot 2023-09-25 at 17 48 46

How you can see, the metrics exposed in /metrics only grow over the time.

There are two details about your issue that doesn't match with the environment that I have: spanmetricsconnector and v0.85.0. Could you consider using those on your work? spanmetricsprocessor is deprecated.

@mfilipe
Copy link

mfilipe commented Sep 25, 2023

My current config:

receivers:
  otlp:
    protocols:
      grpc:
exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    metric_expiration: 60s
connectors:
  spanmetrics:
    histogram:
      unit: "ms"
      explicit:
        buckets: []
    metrics_flush_interval: 15s
    dimensions:
      - name: build_name
      - name: build_number
    exclude_dimensions:
      - span.kind
    dimensions_cache_size: 100
processors:
  batch:
  attributes/spanmetrics:
    actions:
      - action: extract
        key: host.name
        pattern: ^(?P<kubernetes_cluster>.+)-jenkins-(?P<organization>tantofaz|whatever-org)-(?P<build_name>.+)-(?P<build_number>[0-9]+)(?P<build_id>(?:-[^-]+){2}|--.*?)$
  filter/spanmetrics:
    error_mode: ignore
    metrics:
      metric:
        - 'resource.attributes["service.name"] != "jenkins"'
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes/spanmetrics, batch]
      exporters: [spanmetrics]
    metrics:
      receivers: [spanmetrics]
      processors: [filter/spanmetrics]
      exporters: [prometheus]

@mfilipe
Copy link

mfilipe commented Sep 29, 2023

This issue should be considerable critical, once /metrics grows in a way that makes the metric backend unstable over the time to scrape the metrics and there isn't a workaround. Basically, /metrics starts with a few kilobytes and, after a few days, it goes to hundreds of megabytes, making the backend unstable. The environment where I have the problem is common: spanmetrics connector saving the metrics in Prometheus with many repositories generating metrics.

Based on this article, Prometheus only supports cumulative metrics, so I cannot use delta metrics to avoid the issue.

If there is a workaround for the issue, please let me know. AFAIK the workaround doesn't exist, making this issue critical.

MovieStoreGuy added a commit that referenced this issue Oct 4, 2023
Prune histograms when the dimension cache evictions are removed

**Description:**
Prunes histograms when the dimension cache is pruned. This prevents
metric series from growing indefinitely

**Link to tracking Issue:**
 #27080

**Testing:**
I modified the the existing test to check `histograms` length instead of
dimensions cache length. This required simulating ticks to hit the
exportMetrics function

**Documentation:** <Describe the documentation added.>

Co-authored-by: Sean Marciniak <30928402+MovieStoreGuy@users.noreply.github.com>
nijave added a commit to nijave/opentelemetry-collector-contrib that referenced this issue Oct 4, 2023
Prune histograms when the dimension cache evictions are removed

**Description:**
Prunes histograms when the dimension cache is pruned. This prevents
metric series from growing indefinitely

**Link to tracking Issue:**
 open-telemetry#27080

**Testing:**
I modified the the existing test to check `histograms` length instead of
dimensions cache length. This required simulating ticks to hit the
exportMetrics function

**Documentation:** <Describe the documentation added.>

Co-authored-by: Sean Marciniak <30928402+MovieStoreGuy@users.noreply.github.com>
@crobert-1
Copy link
Member

crobert-1 commented Oct 12, 2023

@nijave From #27083 and your results shared here, it looks like this has been fixed. Is that correct? If so we can close this issue.

Thanks for your help here!

@nijave nijave closed this as completed Oct 13, 2023
jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023
Prune histograms when the dimension cache evictions are removed

**Description:**
Prunes histograms when the dimension cache is pruned. This prevents
metric series from growing indefinitely

**Link to tracking Issue:**
 open-telemetry#27080

**Testing:**
I modified the the existing test to check `histograms` length instead of
dimensions cache length. This required simulating ticks to hit the
exportMetrics function

**Documentation:** <Describe the documentation added.>

Co-authored-by: Sean Marciniak <30928402+MovieStoreGuy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p1 High processor/spanmetrics Span Metrics processor
Projects
None yet
Development

No branches or pull requests

5 participants