Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus output do not remove metric that are not published anymore #1334

Closed
rvrignaud opened this issue Jun 6, 2016 · 12 comments · Fixed by #1476
Closed

Prometheus output do not remove metric that are not published anymore #1334

rvrignaud opened this issue Jun 6, 2016 · 12 comments · Fixed by #1476
Labels
bug unexpected problem or unintended behavior

Comments

@rvrignaud
Copy link

Bug report

System info:

Telegraf 0.13.1
Docker container : Ubuntu 16.04
Rabbitmq : 3.5.7

Steps to reproduce:

  1. Configure rabbitmq input on telegraf
  2. Configure prometheus output on telegraf
  3. Create an ephemeral queue on rabbitmq
  4. Publish a message
  5. Wait that telegraf get metrics
  6. Delete ephemeral queue
  7. Wait a bit
  8. Go to telegraf's prometheus output
  9. See that rabbitmq_queue_messages / rabbitmq_queue_messages_ready metric is not 0

Expected behavior:

As the queue no longer exists, telegraf should output 0 as message number for rabbitmq_queue_messages / rabbitmq_queue_messages_ready

Actual behavior:

I still see non zero metric for rabbitmq_queue_messages / rabbitmq_queue_messages_ready metrics of non existing queue.

Additional info:

With influxdb output, telegraf seems to stop sending the metric with the corresponding tag.

The problem of prometheus output not being refreshed / showing wrong metrics may be present for other input too.

Use case: [Why is this important (helps with prioritizing requests)]

This is important because we are sending metrics that are wrong.

@sparrc
Copy link
Contributor

sparrc commented Jun 6, 2016

sounds like the issue could generally be that the prometheus output is not getting refreshed properly?

@sparrc sparrc added the bug unexpected problem or unintended behavior label Jun 6, 2016
@rvrignaud
Copy link
Author

@sparrc yes it seems so, but I need to find a way / time to prove that

@rvrignaud
Copy link
Author

@sparrc ok so I just confirmed the bug. It seems not related at all to rabbitmq input.

Here is the steps to reproduce:

telegraf configuration:

[agent]
 interval               = "15s"
 debug                  = true
 flush_buffer_when_full = true
 metric_buffer_limit    = 10000

[[outputs.prometheus_client]]
  listen = ":9126"

[[inputs.exec]]
    command = "bash test.sh"
    data_format="influx"

test.sh

#!/bin/bash

echo "metric,foo=bar value=42"
echo "metric,foo=foo value=43"
  1. Start telegraf.
  2. Go to localhost:9126. You should see the 2 metrics.
  3. Edit test.sh: comment first metric line and change value of second line.
  4. Go to localhost:9126. The second metric value has been refreshed but the first metric is still present.

This mean that when a metric is not published by an input anymore, it's not removed from prometheus output.

@rvrignaud rvrignaud changed the title Prometheus output still show non existing rabbitmq queues Prometheus output do not remove metric that are not published anymore Jun 8, 2016
@lswith
Copy link
Contributor

lswith commented Jun 9, 2016

I can confirm this happening.

@sparrc
Copy link
Contributor

sparrc commented Jun 10, 2016

I'm actually not sure that this is possible with prometheus. Prometheus basically comes with the assumption that once a metric has been reported, it must be reported at every interval. If you want to "unregister" a metric, it can only be done on a completely permanent basis.

This means that, in the above example, if we were to "unregister" metric,foo=bar value=42, then we can never have a metric with name/tag combo metric,foo=bar ever again.

Telegraf (and influxdb), on the other hand, is quite a bit less stateful, and doesn't particularly care if a metric is reported at one timestamp and then not in another.

It's not entirely clear to me what the best solution is here. I don't think it's a good solution for ephemeral metrics to stick around forever reporting "0" values from all plugins, so I'm not sure there is a generic solution that will work.

For an overview of the register/unregister model of Prometheus you can view their Golang client instrumentation example code here: https://godoc.org/github.com/prometheus/client_golang/prometheus#example-Register

@lswith
Copy link
Contributor

lswith commented Jun 14, 2016

I think more fine-grained control on what is being published in the prometheus Output may be the solution here.

Couple of examples:

  • being able to rename the metric going into the prometheus output (similar to the statsd template stuff)
  • being able to add/divide/subtract/multiply/count and combine metrics going into the prometheus output

With this functionality, I could simply make sure that only the stat "metric" was set to 42 and 43 and that a counter was incremented whenever foo=bar.

Also, I did not know that a metric must be reported at every interval. This would be useful to have on the README.md for the Prometheus Output.

@rvrignaud
Copy link
Author

@sparrc it looks like prometheus developers do not agree with the fact that "prometheus basically comes with the assumption that once a metric has been reported, it must be reported at every interval." : https://groups.google.com/forum/#!topic/prometheus-developers/iP2k68eUVrM

@sparrc
Copy link
Contributor

sparrc commented Jun 14, 2016

It's not that it has to be reported, it's more that if it's removed it can't be re-reported again. AFAICT this is exactly the situation that they state as impossible in this example: https://godoc.org/github.com/prometheus/client_golang/prometheus#example-Register

specifically, this part of the code:

// Try registering taskCounterVec again.
if err := prometheus.Register(taskCounterVec); err != nil {
    fmt.Println("taskCounterVec not registered:", err)
} else {
    fmt.Println("taskCounterVec registered.")
}
// Bummer! Still doesn't work.

// Prometheus will not allow you to ever export metrics with
// inconsistent help strings or label names. After unregistering, the
// unregistered metrics will cease to show up in the /metrics HTTP
// response, but the registry still remembers that those metrics had
// been exported before. For this example, we will now choose a
// different name. (In a real program, you would obviously not export
// the obsolete metric in the first place.)

@sparrc
Copy link
Contributor

sparrc commented Jun 14, 2016

from what I read there, I understand that we would have to track all metrics ever reported and unregistered through the prometheus endpoint, and then artificially rename metrics that might have "returned" before we can re-register them.

@sparrc
Copy link
Contributor

sparrc commented Jun 14, 2016

I think there is in fact a way to do this. From that mailing list thread it looks like telegraf needs to become a prometheus "collector" rather than doing "direct instrumentation".

I'm still not 100% sure I understand the distinction, but implementing the prometheus.Collector interface will allow Telegraf to report metrics that it collected on an interval in an on-demand manner.

This being said, it also would be going against one of the tenants of prometheus, which is that metrics are supposed to be collected at the time the http request is made. Telegraf will still be collecting on the interval and the cached values will get presented when the http request is made.

@ghost
Copy link

ghost commented Feb 13, 2018

What is the solution to this?
I am also experiencing the same in Prometheus. It is showing the metric that is not existing or getting published.

@danielnelson
Copy link
Contributor

@sidsingla If this is still an issue in version 1.5 can you open a new issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants