Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[[inputs.ping]] percent_packet_loss not written to InfluxDB when ping command fails #5499

Closed
dirkaholic opened this issue Feb 28, 2019 · 11 comments
Labels
feature request Requests for new plugin and for new features to existing plugins

Comments

@dirkaholic
Copy link
Contributor

dirkaholic commented Feb 28, 2019

Relevant telegraf.conf:

[[inputs.ping]]
  urls = [
    "host-123",
    "host-124",
  ]

System info:

telegraf version: 1.9.5
operating system: Ubuntu 16.04.5 LTS

Steps to reproduce:

We used to monitor all the hosts in our infrastructure using the percent_packet_loss returned from the ping plugin. When the value would exceed a given treshold an alert was triggered. Formerly it appeared to always write the percent_packet_loss even when the ping command failed. Since some of the latest updates (#4703 looks like a candidate) it seems to return early and write the result_code but does not fill the percent_packet_loss which leads to the respective alert not being triggered.

Expected behavior:

percent_packet_loss is always 100 when an error occurs during the ping command execution.

time                percent_packet_loss result_code
----                ------------------- -----------
1551250190000000000 100                 2
1551250200000000000 100                 2

Actual behavior:

percent_packet_loss is always empty when an error occurs during the ping command execution.

time                percent_packet_loss result_code
----                ------------------- -----------
1551255636000000000                     2
1551255646000000000                     2

Additional info:

Is it really expected that percent_packet_loss is empty in case of an ping error ? I would expect that percent_packet_loss is always 100 when an error occurs during ping execution. If that assumption is correct (and as it was like that earlier) I would try to implement a fix.

@glinton
Copy link
Contributor

glinton commented Feb 28, 2019

What was the error message that telegraf output with that result 2?
When trying to reproduce this, I get a result code of 2 when the ping fails to start and sends no packets.
We may have to hard-code

fields["percent_packet_loss"] = 100.0

wherever we set

fields["result_code"] = 2

@glinton
Copy link
Contributor

glinton commented Feb 28, 2019

Looking back through previous telegraf versions, I don't see where we set percent_packet_loss when the result code is 2. Can you confirm which version you were using when you saw 100 packet loss and a result code of 2?

@glinton glinton added the feature request Requests for new plugin and for new features to existing plugins label Feb 28, 2019
@dirkaholic
Copy link
Contributor Author

@glinton Unfortunately I'm not able to say with which version it changed, I just noticed it now because my alert about the package loss didn't fire any more when a host went down.

@glinton
Copy link
Contributor

glinton commented Mar 1, 2019

Are you able to paste the error message from the telegraf logs that's associated with that ping?

@dirkaholic
Copy link
Contributor Author

These are the logs when a host is not available

2019-03-05T10:57:06Z E! [inputs.ping]: Error in plugin: host MyHostName: signal: killed
2019-03-05T10:57:16Z E! [inputs.ping]: Error in plugin: host MyHostName: signal: killed
2019-03-05T10:57:26Z E! [inputs.ping]: Error in plugin: host MyHostName: signal: killed

@danielnelson
Copy link
Contributor

In the case that ping cannot be ran there isn't any packet loss, in this case we shouldn't fill out the percent_packet_loss field because we have no ability to set it correctly. I know it makes the alert a little trickier to setup, but I think you will want a deadman alert on percent_packet_loss since if it stops being reported you want an alert.

Are you alerting with Kapacitor? It would be nice to add an example to the README.

@dirkaholic
Copy link
Contributor Author

We are alerting with grafana. Unfortunately it is not possible there at the moment to get an alert if no data is recorded unless you have a single series per graph which is not practical. If it would be useful I could still add an example. Should that go to the README of the ping plugin then ?

As for the percent_packet_loss: Obviously the logs I wrote above are the result when a host is not pingable and I would really expect that percent_packet_loss is 100 in this case. I'm not sure why the ping command gets killed when a host is not available but there is no general problem with running the ping command as it works as expected when the host is available (and additionally for all the other hosts that we are pinging).

@danielnelson
Copy link
Contributor

We usually only add Kapacitor scripts, and soon Flux queries, to the README.

This signal: killed error probably indicates a timeout, and the timeout that is used is based on several settings in the plugin: count, timeout, and ping_interval. It may make sense to add more control over this, but are you setting any of these values or just going with the defaults?

Would it be possible to set multiple alerts, one for percent packet loss and one for the result code?

@dirkaholic
Copy link
Contributor Author

Just using defaults. And yes, have set up another alert based on the status now. Still thought, the percent_packet_loss needs to be adjusted.

@zfouts
Copy link

zfouts commented Aug 22, 2019

I too am curious as I am trying to actually log packet loss.

As soon as it times out, it just gives the error of signal:killed which is infuriating because packet loss is always set to 0%.

Edit:
Changing defaults to the following seems to allow logging packet loss % now. Must be defaults are just too few of a count to work?

[[inputs.ping]]
  count = 10
  urls = ["8.8.8.8", "8.8.4.4", "1.1.1.1"]

@glinton
Copy link
Contributor

glinton commented Aug 22, 2019

This issue is resolved by #6267 (specifically the internal.WaitTimeout function changes). It wasn't an issue with method = "native" as nothing was exec'd.

Now that the proper error is being returned on a timeout, this will work as normal. Check the nightlies or wait for 1.12.

@glinton glinton closed this as completed Aug 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants