inputs.snmp fails all agents if one agent does not respond in time #3823

aurimasplu · 2018-02-23T11:10:12Z

# Bug report
When running inputs.snmp plugin and polling several devices (agents) within same instance, if at least one device does not send back all requested information all devices in that inputs.snmp instance fails. Other instances running at the same time are not impacted.
My case example: I am poling several hundred network devices with several inputs.snmp instances every 1 minute. Each instance is dedicated for single metric.
When polling interface metrics on remotely located with big latency (~100ms) large network device (few hundred interfaces), for all requests and responses there simply is not enough time. And this is my problem to solve. But due to this condition all other few hundred devices in same instance fails even though they send back all data in time.
Log message is generated:
E! Error in plugin [inputs.snmp]: took longer to collect than collection interval (1m0s)
And from log it is impossible to tell which device is failing.

# Relevant telegraf.conf:

[[inputs.snmp]]

agents = ["test1:161" , "test2:161" , "test5" , "test3:161"]
interval = "60s"
timeout = "5s"
retries = 3
version = 2
community = "SNMPcommunity"
max_repetitions = 100

 [[inputs.snmp.table]]
 name="interface"
 oid="IF-MIB::ifXTable"
 
 [[inputs.snmp.table.field]]
 name="ifDescr"
 oid="IF-MIB::ifDescr"
 is_tag=true
 
 [[inputs.snmp.table.field]]
 name="ifOperStatus"
 oid="IF-MIB::ifOperStatus"
 is_tag=true

# System info:
Telegraf: 1.5.1-1
OS: Rhel 7.4

# Expected behavior:
Telegraf should fail only that device which does not respond in time and generate log message which device failed.
# Actual behavior:
All instance fails and only general message is generated.

The text was updated successfully, but these errors were encountered:

toni-moreno · 2018-02-23T15:29:25Z

You can move to snmpcollector to gather snmp metrics. Polling time or state does not affect to the other devices and you have a nice web interface to check device runtime statistics (state polling time, num metrics , errors ) etc.

https://github.com/toni-moreno/snmpcollector

Check the wiki for configuration examples

https://github.com/toni-moreno/snmpcollector/wiki

aurimasplu · 2018-02-26T14:03:52Z

toni-moreno, thanks for advice I will check it out, but I don't believe that bug reports is the place to advertise your products.

adambaumeister · 2018-03-21T05:21:47Z

+1

However I think the issue is more serious than implied by your report. I believe the collection is failing because the input is exceeding the collection time for the entire input. That is, if you've got 100 routers and it takes a couple of seconds to poll each one then to complete the collection run exceeds the interval even if each box responds quickly.

I have proven this in my environment as I have the same problem even with no devices timing out, doing basically just ifTable queries, with as low as about 150 devices.

I've tried to resolve the issue by breaking my configuration into chunks of [[inputs.snmp]] but they all seem to be executed as if they were one so you run into the same problem:

Mar 21 15:38:00 act-collector01 telegraf[6933]: 2018-03-21T04:38:00Z E! Error in plugin [inputs.snmp]: took longer to collect than collection interval (2m0s)

Unfortunately, this makes Telegraf with the snmp plugin unusable for the purpose of collecting network metrics via SNMP on a reasonably sized network without a considerable amount of scripting to stand up additional telegraf instances and delegate appropriately sized configurations to each.

My environment details:

Telegraf 1.5.3
6 core xeon E5-2697 (virtualized under ESX)
16GB Memory
SSD storage

Example file (replace the agent string with 100+ hostnames...):

[[inputs.snmp]]
    interval = "120s"
    agents = [ "spaghetti.csiro.au" ]
    version = 2
    community = fake
    name = "switch_snmp"
    timeout = "2s"
    retries = 1

  [[inputs.snmp.field]]
    name = "hostname"
    oid = "RFC1213-MIB::sysName.0"
    is_tag = true

  [[inputs.snmp.table]]
    name = "cisco_physical_cpu"
    inherit_tags = [ "hostname" ]
    oid = "CISCO-PROCESS-MIB::cpmCPUTotalTable"
    index_as_tag = true

  # Poll an entire table for all of it's fields
  [[inputs.snmp.table]]
    name = "interface"
    inherit_tags = [ "hostname" ]
    oid = "IF-MIB::ifTable"

    # Interface tag - used to identify interface in metrics database
    # Mark the OID IF-MIB::ifDescr as "ifDescr" in the snmp table
    [[inputs.snmp.table.field]]
      name = "ifDescr"
      oid = "IF-MIB::ifDescr"
      is_tag = true

    #
    [[inputs.snmp.table.field]]
      name = "ifName"
      oid = "IF-MIB::ifName"
      is_tag = true

  # IF-MIB::ifXTable contains newer High Capacity (HC) counters that do not overflow as fast for a few of the ifTable counters
  [[inputs.snmp.table]]
    name = "interface"
    inherit_tags = [ "hostname" ]
    oid = "IF-MIB::ifXTable"

    # Interface tag - used to identify interface in metrics database
    [[inputs.snmp.table.field]]
      name = "ifName"
      oid = "IF-MIB::ifName"
      is_tag = true

    [[inputs.snmp.table.field]]
      name = "ifDescr"
      oid = "IF-MIB::ifDescr"
      is_tag = true

danielnelson · 2018-03-21T18:15:57Z

Plugins don't fail if they exceed the interval, this is really just a warning message that the plugin is scheduled to run again but it still has not completed. The plugin will continue to run and it will not run again until the first collection completes.

E! Error in plugin [inputs.snmp]: took longer to collect than collection interval (2m0s)

The timeout in the configuration is per network operation, so if you have a 5s timeout and multiple retries and multiple fields, it can add up pretty fast and it is difficult to know how long it can take. This definitely needs improved, I think we should have one timeout that applies to an entire agents work.

However, if you split the agents into separately plugins, each one does run independently. The workaround of splitting the agents into smaller multiple plugins should work, though obviously hard to do until we improve the log messages.

adambaumeister · 2018-03-22T02:18:57Z

@danielnelson I agree a per-agent timeout value would be much better. Basically a hard limit for how long a collection an take on a single host. You could even make it automatic maybe by dividing the interval by the number of hosts by the amount of parallel SNMP sessions.

I tried splitting my snmp inputs into groups of 50 hosts each and it almost works but I still get some groups that take too long to collect so you end up with weird looking graphs for certain devices. I wrote some automation to do this which, while nice, is probably too much additional overhead for the relatively simple task of collecting network metrics.

aurimasplu · 2018-06-08T13:10:32Z

Issue actually exists and plugin fails if you use lost of agents in one instance.

But I have solved my problem by creating separate configuration file per device and put them all to /etc/telegraf/telegraf.d/ directory. Each file contains several snmp plugin instances usually one per snmp table. I have created several templates per device type, came up with some naming convention for config files and wrote script which takes list of devices and creates config files. I made it really easy for myself to provision devices monitored by telegraf.
So I am running two 8 core VMs and monitor more 2700 devices polling them every 1 minute quite smoothly :)

This way I isolated all "took longer to collect" problems to a per device basis, but logging improvement still needed because if some device takes longer to collect I don't know which one.

Now I only run into problems with very large devices, for example I have several switches with more that 1200 interfaces and telegraf is not able to poll all ifTable and ifXtable withing one minute. But this is different topic.

mcaulifn · 2018-10-09T15:04:28Z

Have there been any updates on this? I'm seeing the same thing when one of three routers is down.

n1nj4888 · 2019-07-24T06:14:14Z

Any update on this?

I'm seeing the same behaviour - When multiple hosts are configured and one is down, the snmp queries appear to be sent sequentially to the hosts and so if a single/multiple host(s) are down, this can cause the snmp plugin to timeout on the offline hosts before actually completing the list of online hosts?

Is it not possible to run the snmp queries to all configured hosts / agents in parallel so that one host being down wouldn't affect gathering metrics for the others?

Thanks!

danielnelson · 2019-07-24T06:20:26Z

In a nutshell, the workaround is to talk to a single remote agent per plugin:

[[inputs.snmp]]
  agents = ["host1"]
  # other options
[[inputs.snmp]]
  agents = ["host2"]
  # other options

n1nj4888 · 2019-07-24T07:51:55Z

Ok thanks @danielnelson. Understand the workaround but, if "# other options" is an extensive list of OIDs, managing any changes to those options would be complicated across a number of hosts.

Are there plans to make the snmp calls in parallel to the listed agents or, if a response is not received from AgentA, continue with AgentB before retrying AgentA?

Thanks!

danielnelson · 2019-07-24T21:46:20Z

Yes, the other options would be whatever tables/fields you want to collect and would need repeated. If you are reading this issue you probably have lots of agents and for managing that I recommend using a templating program to generate your configuration.

To be clear, the plugin does make SNMP calls in parallel to the agents, but all agents must complete before the next collection will begin. This means one agent can theoretically hold up all the other agents for as long as the total timeout is set. This behavior is probably won't change anytime soon. Placing them in a separate plugin definition will allow them to be fully independent.

Hipska · 2021-05-07T12:32:53Z

I just tested with current telegraf version, and here the metrics for agentB come through if agentA is not responding, so it seems this issue isn't relevant anymore.

Note that this still counts:

To be clear, the plugin does make SNMP calls in parallel to the agents, but all agents must complete before the next collection will begin. This means one agent can theoretically hold up all the other agents for as long as the total timeout is set. This behavior is probably won't change anytime soon. Placing them in a separate plugin definition will allow them to be fully independent.

Stephan-Walter · 2021-10-11T08:45:22Z

Hi,

we observe right now the same/similar issue with telegraf version 1.20.0

We used the following definition and observed very spare data within influxdb. So something like one data point for every agent every 20-50 minutes

agents = [ "1a1","1a2","1a3","1a4","1a5","1a6","1a7","1a8","2a1","2a2","2a3","2a4","2a5","2a6","2a7","2a8"]
   # polling interval
   interval = "15s"
   ## Timeout for each SNMP query.
   timeout = "60s"
   ## Number of retries to attempt within timeout.
   retries = 3
   ## SNMP version, values can be 1, 2, or 3
   version = 2

   ## SNMP community string.
   community = "public"

   ## The GETBULK max-repetitions parameter
   max_repetitions = 10

   ## measurement name
   name = "internal"
   [inputs.snmp.tags]
    influxdb_database = "internal"
   [[inputs.snmp.field]]
     name = "hostname"
     oid = "SNMPv2-MIB::sysName.0"
     is_tag = true
   [[inputs.snmp.field]]
     name = "uptime"
     oid = "DISMAN-EVENT-MIB::sysUpTimeInstance"

   [[inputs.snmp.table]]
     ## measurement name
     name = "snmp"
     inherit_tags = [ "hostname" ]
     oid = "Some::Table1"

   [[inputs.snmp.table.field]]
     name = "SomeName1"
     oid = "Some::Name1"
     is_tag = true

   [[inputs.snmp.table]]
     ## measurement name
     name = "snmp"
     inherit_tags = [ "hostname" ]
     oid = "Some::Table2"

   [[inputs.snmp.table.field]]
     name = "SomeName2"
     oid = "Some::Name2"
     is_tag = true

A timed test execution of the configuration has shown the following

2021-10-08T10:39:02Z E! [inputs.snmp] Error in plugin: agent 1a7: performing get on field hostname: request timeout (after 3 retries)
2021-10-08T10:39:02Z E! [inputs.snmp] Error in plugin: agent 2a7: performing get on field hostname: request timeout (after 3 retries)
2021-10-08T10:47:02Z E! [inputs.snmp] Error in plugin: agent 1a7: gathering table snmp: performing bulk walk for field Name1: request timeout (after 3 retries)
2021-10-08T10:47:02Z E! [inputs.snmp] Error in plugin: agent 2a7: gathering table snmp: performing bulk walk for field Name1: request timeout (after 3 retries)
2021-10-08T10:55:02Z E! [inputs.snmp] Error in plugin: agent 2a7: gathering table snmp: performing bulk walk for field Name2: request timeout (after 3 retries)
2021-10-08T10:55:02Z E! [inputs.snmp] Error in plugin: agent 1a7: gathering table snmp: performing bulk walk for field Name2: request timeout (after 3 retries)
2021-10-08T10:55:02Z E! [telegraf] Error running agent: input plugins recorded 6 errors

real	24m0.331s
user	0m0.502s
sys	0m0.314s

My checks has shown, that the devices 1a7 and 2a7 were not ping able because a network issue, but at the first moment this wasn't clear, since the output above required a very long time to appear, so that we thought that telegraf was hanging at all with the test cmd. Just after we have decreased the interval and timeout to 5s, the execution time came down to

real	2m0.493s
user	0m0.219s
sys	0m0.185s

So we were able to see the root cause of the problem.

Right now it is not clear for us, why we see such a long execution time, when we just define 5s a timeout for the snmp block. Especial, since we get as a result only the same interval of data points for the other devices.

Our expectation was, that if devices are not accessible, we would get at least every timeout seconds a new value within the influxdb.

It would be great if somebody could have a look, if the problem wasn't really fixed at all, or if another problem was introduced/discovered.

@Hipska please let me know if I should open a new ticket for this problem, but right now I think this is the same issue. So could you please reopen it?

Best Regards

Hipska · 2021-10-11T09:53:55Z

It is indeed not a new issue, see comment here: #7300 (comment)
See also my last comment to this issue, it is advised to split the configs if you have devices that are unresponsive.

Stephan-Walter · 2021-10-11T10:03:52Z

Hi Hipska,

yes, I understand that a split into single input.snmp blocks would resolve the problem for the other agent definitions, but it feels wrong for me, that two faulty agents with a timeout of 60s lead to a hanging block of 24 minutes. So a factor of 12-24 longer than what could be expected. Or do I oversee something?

Hipska · 2021-10-11T10:10:55Z

It is indeed strange, but see that comment, the total delay is something like this: 2 * timeout * retries * number_of_tables. So if you want to comment on that, you should go to #7300.

Stephan-Walter · 2021-10-11T10:24:23Z

Ah ok, that seems to make sens.

We have to agents not responding with two tables each and an timeout of 60s. And since test serialise we can calculate

2 agents * (2 * timeout * 3 * 2) = 24 minutes (timeout=60s) or 2 minutes (timeout=5s)

Still I wonder why there is the retries factor, since my understanding of the retry was, that they happen within the timeout and not after the timeout, but ok.

So yes, the spare output of the other agents every 24 or 2 minutes is the expected behaviour right now.

Since this is a really big issue and the maintenance effort quickly explodes, is there anything we can do to help to change the actual behaviour?

Hipska · 2021-10-11T10:36:58Z

Anybody is free to create pull requests for fixing code and issues. I'm using Ansible to generate a snmp config file per agent, and works great with 3k+ agents.

Stephan-Walter · 2021-11-09T12:28:46Z

yeah, we have moved also to a single agent per snmp entry now, with an auto generated config. Nevertheless, it is the best solution.

danielnelson added bug unexpected problem or unintended behavior area/snmp labels Mar 6, 2018

danielnelson added this to the 1.8.0 milestone Jun 8, 2018

danielnelson modified the milestones: 1.8.0, 1.9.0 Sep 7, 2018

russorat modified the milestones: 1.9.0, 1.10 Oct 29, 2018

russorat assigned danielnelson Dec 3, 2018

russorat modified the milestones: 1.10.0, 1.11.0 Jan 14, 2019

danielnelson removed their assignment May 10, 2019

danielnelson removed this from the 1.11.0 milestone May 10, 2019

danielnelson mentioned this issue Oct 21, 2019

bulk walk timeout results in no data #6450

Closed

danielnelson mentioned this issue May 7, 2020

SNMP Input plugin removal of metrics on timeout polling multiple hosts #7470

Closed

Hipska closed this as completed May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inputs.snmp fails all agents if one agent does not respond in time #3823

inputs.snmp fails all agents if one agent does not respond in time #3823

aurimasplu commented Feb 23, 2018 •

edited

Loading

toni-moreno commented Feb 23, 2018

aurimasplu commented Feb 26, 2018

adambaumeister commented Mar 21, 2018

danielnelson commented Mar 21, 2018

adambaumeister commented Mar 22, 2018

aurimasplu commented Jun 8, 2018

mcaulifn commented Oct 9, 2018

n1nj4888 commented Jul 24, 2019 •

edited

Loading

danielnelson commented Jul 24, 2019

n1nj4888 commented Jul 24, 2019

danielnelson commented Jul 24, 2019

Hipska commented May 7, 2021

Stephan-Walter commented Oct 11, 2021 •

edited

Loading

Hipska commented Oct 11, 2021

Stephan-Walter commented Oct 11, 2021

Hipska commented Oct 11, 2021

Stephan-Walter commented Oct 11, 2021 •

edited

Loading

Hipska commented Oct 11, 2021

Stephan-Walter commented Nov 9, 2021

inputs.snmp fails all agents if one agent does not respond in time #3823

inputs.snmp fails all agents if one agent does not respond in time #3823

Comments

aurimasplu commented Feb 23, 2018 • edited Loading

toni-moreno commented Feb 23, 2018

aurimasplu commented Feb 26, 2018

adambaumeister commented Mar 21, 2018

danielnelson commented Mar 21, 2018

adambaumeister commented Mar 22, 2018

aurimasplu commented Jun 8, 2018

mcaulifn commented Oct 9, 2018

n1nj4888 commented Jul 24, 2019 • edited Loading

danielnelson commented Jul 24, 2019

n1nj4888 commented Jul 24, 2019

danielnelson commented Jul 24, 2019

Hipska commented May 7, 2021

Stephan-Walter commented Oct 11, 2021 • edited Loading

Hipska commented Oct 11, 2021

Stephan-Walter commented Oct 11, 2021

Hipska commented Oct 11, 2021

Stephan-Walter commented Oct 11, 2021 • edited Loading

Hipska commented Oct 11, 2021

Stephan-Walter commented Nov 9, 2021

aurimasplu commented Feb 23, 2018 •

edited

Loading

n1nj4888 commented Jul 24, 2019 •

edited

Loading

Stephan-Walter commented Oct 11, 2021 •

edited

Loading

Stephan-Walter commented Oct 11, 2021 •

edited

Loading