[[inputs.ping]] percent_packet_loss not written to InfluxDB when ping command fails #5499

dirkaholic · 2019-02-28T13:24:23Z

Relevant telegraf.conf:

[[inputs.ping]]
  urls = [
    "host-123",
    "host-124",
  ]

System info:

telegraf version: 1.9.5
operating system: Ubuntu 16.04.5 LTS

Steps to reproduce:

We used to monitor all the hosts in our infrastructure using the percent_packet_loss returned from the ping plugin. When the value would exceed a given treshold an alert was triggered. Formerly it appeared to always write the percent_packet_loss even when the ping command failed. Since some of the latest updates (#4703 looks like a candidate) it seems to return early and write the result_code but does not fill the percent_packet_loss which leads to the respective alert not being triggered.

Expected behavior:

percent_packet_loss is always 100 when an error occurs during the ping command execution.

time                percent_packet_loss result_code
----                ------------------- -----------
1551250190000000000 100                 2
1551250200000000000 100                 2

Actual behavior:

percent_packet_loss is always empty when an error occurs during the ping command execution.

time                percent_packet_loss result_code
----                ------------------- -----------
1551255636000000000                     2
1551255646000000000                     2

Additional info:

Is it really expected that percent_packet_loss is empty in case of an ping error ? I would expect that percent_packet_loss is always 100 when an error occurs during ping execution. If that assumption is correct (and as it was like that earlier) I would try to implement a fix.

The text was updated successfully, but these errors were encountered:

glinton · 2019-02-28T19:51:43Z

What was the error message that telegraf output with that result 2?
When trying to reproduce this, I get a result code of 2 when the ping fails to start and sends no packets.
We may have to hard-code

fields["percent_packet_loss"] = 100.0

wherever we set

fields["result_code"] = 2

glinton · 2019-02-28T19:59:21Z

Looking back through previous telegraf versions, I don't see where we set percent_packet_loss when the result code is 2. Can you confirm which version you were using when you saw 100 packet loss and a result code of 2?

dirkaholic · 2019-03-01T10:06:46Z

@glinton Unfortunately I'm not able to say with which version it changed, I just noticed it now because my alert about the package loss didn't fire any more when a host went down.

glinton · 2019-03-01T19:28:11Z

Are you able to paste the error message from the telegraf logs that's associated with that ping?

dirkaholic · 2019-03-06T08:26:37Z

These are the logs when a host is not available

2019-03-05T10:57:06Z E! [inputs.ping]: Error in plugin: host MyHostName: signal: killed
2019-03-05T10:57:16Z E! [inputs.ping]: Error in plugin: host MyHostName: signal: killed
2019-03-05T10:57:26Z E! [inputs.ping]: Error in plugin: host MyHostName: signal: killed

danielnelson · 2019-03-06T21:20:50Z

In the case that ping cannot be ran there isn't any packet loss, in this case we shouldn't fill out the percent_packet_loss field because we have no ability to set it correctly. I know it makes the alert a little trickier to setup, but I think you will want a deadman alert on percent_packet_loss since if it stops being reported you want an alert.

Are you alerting with Kapacitor? It would be nice to add an example to the README.

dirkaholic · 2019-03-07T08:09:57Z

We are alerting with grafana. Unfortunately it is not possible there at the moment to get an alert if no data is recorded unless you have a single series per graph which is not practical. If it would be useful I could still add an example. Should that go to the README of the ping plugin then ?

As for the percent_packet_loss: Obviously the logs I wrote above are the result when a host is not pingable and I would really expect that percent_packet_loss is 100 in this case. I'm not sure why the ping command gets killed when a host is not available but there is no general problem with running the ping command as it works as expected when the host is available (and additionally for all the other hosts that we are pinging).

danielnelson · 2019-03-07T23:26:06Z

We usually only add Kapacitor scripts, and soon Flux queries, to the README.

This signal: killed error probably indicates a timeout, and the timeout that is used is based on several settings in the plugin: count, timeout, and ping_interval. It may make sense to add more control over this, but are you setting any of these values or just going with the defaults?

Would it be possible to set multiple alerts, one for percent packet loss and one for the result code?

dirkaholic · 2019-03-08T13:52:20Z

Just using defaults. And yes, have set up another alert based on the status now. Still thought, the percent_packet_loss needs to be adjusted.

zfouts · 2019-08-22T18:44:07Z

I too am curious as I am trying to actually log packet loss.

As soon as it times out, it just gives the error of signal:killed which is infuriating because packet loss is always set to 0%.

Edit:
Changing defaults to the following seems to allow logging packet loss % now. Must be defaults are just too few of a count to work?

[[inputs.ping]]
  count = 10
  urls = ["8.8.8.8", "8.8.4.4", "1.1.1.1"]

glinton · 2019-08-22T19:44:20Z

This issue is resolved by #6267 (specifically the internal.WaitTimeout function changes). It wasn't an issue with method = "native" as nothing was exec'd.

Now that the proper error is being returned on a timeout, this will work as normal. Check the nightlies or wait for 1.12.

glinton added the feature request Requests for new plugin and for new features to existing plugins label Feb 28, 2019

glinton mentioned this issue Mar 6, 2019

Add percent_packet_loss field to ping input on error #5542

Closed

glinton closed this as completed Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[[inputs.ping]] percent_packet_loss not written to InfluxDB when ping command fails #5499

[[inputs.ping]] percent_packet_loss not written to InfluxDB when ping command fails #5499

dirkaholic commented Feb 28, 2019 •

edited

Loading

glinton commented Feb 28, 2019 •

edited

Loading

glinton commented Feb 28, 2019

dirkaholic commented Mar 1, 2019

glinton commented Mar 1, 2019

dirkaholic commented Mar 6, 2019

danielnelson commented Mar 6, 2019

dirkaholic commented Mar 7, 2019

danielnelson commented Mar 7, 2019

dirkaholic commented Mar 8, 2019

zfouts commented Aug 22, 2019 •

edited

Loading

glinton commented Aug 22, 2019

[[inputs.ping]] percent_packet_loss not written to InfluxDB when ping command fails #5499

[[inputs.ping]] percent_packet_loss not written to InfluxDB when ping command fails #5499

Comments

dirkaholic commented Feb 28, 2019 • edited Loading

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

glinton commented Feb 28, 2019 • edited Loading

glinton commented Feb 28, 2019

dirkaholic commented Mar 1, 2019

glinton commented Mar 1, 2019

dirkaholic commented Mar 6, 2019

danielnelson commented Mar 6, 2019

dirkaholic commented Mar 7, 2019

danielnelson commented Mar 7, 2019

dirkaholic commented Mar 8, 2019

zfouts commented Aug 22, 2019 • edited Loading

glinton commented Aug 22, 2019

dirkaholic commented Feb 28, 2019 •

edited

Loading

glinton commented Feb 28, 2019 •

edited

Loading

zfouts commented Aug 22, 2019 •

edited

Loading