Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheusremotewriteexporter logs spurious errors with WAL configured and no metrics to be sent #24399

Open
gburek-fastly opened this issue Jul 19, 2023 · 10 comments

Comments

@gburek-fastly
Copy link

Component(s)

exporter/prometheusremotewrite

What happened?

Description

When the prometheusremotewriteexporter is configured to use a WAL directory and is receiving no metrics to export, it will emit error logs that error processing WAL entries where there are, in fact, none to be processed. The exporter should fall back to using fsnotify.NewWatcher() but it does not.

Steps to Reproduce

run otelcol-contrib with included config

Expected Result

no errors and no error logs

Actual Result

error logs emitted every 5s

Collector version

v0.81.0

Environment information

Environment

OS: osx, colima, docker
docker run --rm --mount type=bind,source=$(pwd),target=/etc/otelcol/ otel/opentelemetry-collector-contrib:latest --config /etc/otelcol/config.yaml

OpenTelemetry Collector configuration

exporters:
  prometheusremotewrite:
    endpoint: "http://host.docker.internal:9090/api/v1/push"
    wal:
      directory: /etc/otelcol/
extensions:
  health_check: {}
  memory_ballast:
    size_in_percentage: 40
processors:
  batch: {}
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 25

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
service:
  extensions:
  - health_check
  - memory_ballast
  pipelines:
    metrics:
      exporters:
      - prometheusremotewrite
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp

Log output

2023-07-19T23:00:48.424Z        info    service/service.go:158  Everything is ready. Begin running and processing data.
2023-07-19T23:00:52.535Z        error   prw.wal prometheusremotewriteexporter@v0.77.0/wal.go:182        error processing WAL entries    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "not found"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter.(*prweWAL).run.func1
        github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter@v0.77.0/wal.go:182
2023-07-19T23:00:56.659Z        error   prw.wal prometheusremotewriteexporter@v0.77.0/wal.go:182        error processing WAL entries    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "not found"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter.(*prweWAL).run.func1
        github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter@v0.77.0/wal.go:182
2023-07-19T23:01:00.792Z        error   prw.wal prometheusremotewriteexporter@v0.77.0/wal.go:182        error processing WAL entries    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "not found"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter.(*prweWAL).run.func1
        github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter@v0.77.0/wal.go:182

Additional context

It appears that the error check in *prweWAL.readPrompbFromWAL for wal.ErrNotFound is not correctly catching it, as the "error": "not found" text is rising up to the "error processing WAL entries" error logging.

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/prometheusremotewriteexporter/wal.go#L358

@gburek-fastly gburek-fastly added bug Something isn't working needs triage New item requiring triage labels Jul 19, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • exporter/prometheusremotewrite: @Aneurysm9 @rapphil
  • needs: Github issue template generation code needs this to generate the corresponding labels.

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

crobert-1 commented Oct 24, 2023

I was able to reproduce and debug this locally with the shared config.

The error is actually being caught as expected, it's just that after 12 retries the read fails and returns the err variable, which is simply the last err seen.

After every read attempt, there is a sleep call that increases duration with each failure. This is where the ~4 second delay comes in, because after all of the delays are done it will start over from the beginning, disregarding the default (or set) wal.TruncateFrequency

I agree this is annoying log spam, but I don't have enough expertise here to confidently say what to do about it. I think it's a good idea in most cases to log an error, since you'd theoretically expect to be getting data if you're running the collector. 4 seconds without data in a regular use case seems like a problem, so I think we do the right thing logging.

I'll defer to code owners but I think at most we may decrease log level to WARN, I don't think we would want to ignore this kind of error (unless I'm missing something).

Apologies for the delayed response!

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Oct 24, 2023
@github-actions github-actions bot removed the Stale label Oct 25, 2023
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Dec 25, 2023
@crobert-1 crobert-1 removed the Stale label Jan 2, 2024
@toughnoah
Copy link

@crobert-1 However, if I configure the wal, no metrics will be sent, and I can just find this error log. I think it is not merely the problems of logs.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Mar 25, 2024
@crobert-1 crobert-1 removed the Stale label Mar 25, 2024
@blockloop
Copy link

@crobert-1 However, if I configure the wal, no metrics will be sent, and I can just find this error log. I think it is not merely the problems of logs.

I can confirm the same problem. No metrics are sent and I see this "not found" log message.

Copy link
Contributor

github-actions bot commented Jul 1, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@iblancasa
Copy link
Contributor

@toughnoah @blockloop do you mean that, if you enable wal, the metrics are never sent?

Copy link
Contributor

github-actions bot commented Oct 7, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants