Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inputs.snmp fails all agents if one agent does not respond in time #3823

Closed
aurimasplu opened this issue Feb 23, 2018 · 19 comments
Closed

inputs.snmp fails all agents if one agent does not respond in time #3823

aurimasplu opened this issue Feb 23, 2018 · 19 comments
Labels
area/snmp bug unexpected problem or unintended behavior

Comments

@aurimasplu
Copy link

aurimasplu commented Feb 23, 2018

# Bug report
When running inputs.snmp plugin and polling several devices (agents) within same instance, if at least one device does not send back all requested information all devices in that inputs.snmp instance fails. Other instances running at the same time are not impacted.
My case example: I am poling several hundred network devices with several inputs.snmp instances every 1 minute. Each instance is dedicated for single metric.
When polling interface metrics on remotely located with big latency (~100ms) large network device (few hundred interfaces), for all requests and responses there simply is not enough time. And this is my problem to solve. But due to this condition all other few hundred devices in same instance fails even though they send back all data in time.
Log message is generated:
E! Error in plugin [inputs.snmp]: took longer to collect than collection interval (1m0s)
And from log it is impossible to tell which device is failing.

# Relevant telegraf.conf:

[[inputs.snmp]]

agents = ["test1:161" , "test2:161" , "test5" , "test3:161"]
interval = "60s"
timeout = "5s"
retries = 3
version = 2
community = "SNMPcommunity"
max_repetitions = 100

 [[inputs.snmp.table]]
 name="interface"
 oid="IF-MIB::ifXTable"
 
 [[inputs.snmp.table.field]]
 name="ifDescr"
 oid="IF-MIB::ifDescr"
 is_tag=true
 
 [[inputs.snmp.table.field]]
 name="ifOperStatus"
 oid="IF-MIB::ifOperStatus"
 is_tag=true

# System info:
Telegraf: 1.5.1-1
OS: Rhel 7.4

# Expected behavior:
Telegraf should fail only that device which does not respond in time and generate log message which device failed.
# Actual behavior:
All instance fails and only general message is generated.

@toni-moreno
Copy link
Contributor

You can move to snmpcollector to gather snmp metrics. Polling time or state does not affect to the other devices and you have a nice web interface to check device runtime statistics (state polling time, num metrics , errors ) etc.

https://github.com/toni-moreno/snmpcollector

Check the wiki for configuration examples

https://github.com/toni-moreno/snmpcollector/wiki

@aurimasplu
Copy link
Author

toni-moreno, thanks for advice I will check it out, but I don't believe that bug reports is the place to advertise your products.

@danielnelson danielnelson added bug unexpected problem or unintended behavior area/snmp labels Mar 6, 2018
@adambaumeister
Copy link

+1

However I think the issue is more serious than implied by your report. I believe the collection is failing because the input is exceeding the collection time for the entire input. That is, if you've got 100 routers and it takes a couple of seconds to poll each one then to complete the collection run exceeds the interval even if each box responds quickly.

I have proven this in my environment as I have the same problem even with no devices timing out, doing basically just ifTable queries, with as low as about 150 devices.

I've tried to resolve the issue by breaking my configuration into chunks of [[inputs.snmp]] but they all seem to be executed as if they were one so you run into the same problem:

Mar 21 15:38:00 act-collector01 telegraf[6933]: 2018-03-21T04:38:00Z E! Error in plugin [inputs.snmp]: took longer to collect than collection interval (2m0s)

Unfortunately, this makes Telegraf with the snmp plugin unusable for the purpose of collecting network metrics via SNMP on a reasonably sized network without a considerable amount of scripting to stand up additional telegraf instances and delegate appropriately sized configurations to each.

My environment details:

  • Telegraf 1.5.3
  • 6 core xeon E5-2697 (virtualized under ESX)
  • 16GB Memory
  • SSD storage

Example file (replace the agent string with 100+ hostnames...):

[[inputs.snmp]]
    interval = "120s"
    agents = [ "spaghetti.csiro.au" ]
    version = 2
    community = fake
    name = "switch_snmp"
    timeout = "2s"
    retries = 1

  [[inputs.snmp.field]]
    name = "hostname"
    oid = "RFC1213-MIB::sysName.0"
    is_tag = true

  [[inputs.snmp.table]]
    name = "cisco_physical_cpu"
    inherit_tags = [ "hostname" ]
    oid = "CISCO-PROCESS-MIB::cpmCPUTotalTable"
    index_as_tag = true

  # Poll an entire table for all of it's fields
  [[inputs.snmp.table]]
    name = "interface"
    inherit_tags = [ "hostname" ]
    oid = "IF-MIB::ifTable"

    # Interface tag - used to identify interface in metrics database
    # Mark the OID IF-MIB::ifDescr as "ifDescr" in the snmp table
    [[inputs.snmp.table.field]]
      name = "ifDescr"
      oid = "IF-MIB::ifDescr"
      is_tag = true

    #
    [[inputs.snmp.table.field]]
      name = "ifName"
      oid = "IF-MIB::ifName"
      is_tag = true

  # IF-MIB::ifXTable contains newer High Capacity (HC) counters that do not overflow as fast for a few of the ifTable counters
  [[inputs.snmp.table]]
    name = "interface"
    inherit_tags = [ "hostname" ]
    oid = "IF-MIB::ifXTable"

    # Interface tag - used to identify interface in metrics database
    [[inputs.snmp.table.field]]
      name = "ifName"
      oid = "IF-MIB::ifName"
      is_tag = true

    [[inputs.snmp.table.field]]
      name = "ifDescr"
      oid = "IF-MIB::ifDescr"
      is_tag = true

@danielnelson
Copy link
Contributor

Plugins don't fail if they exceed the interval, this is really just a warning message that the plugin is scheduled to run again but it still has not completed. The plugin will continue to run and it will not run again until the first collection completes.

E! Error in plugin [inputs.snmp]: took longer to collect than collection interval (2m0s)

The timeout in the configuration is per network operation, so if you have a 5s timeout and multiple retries and multiple fields, it can add up pretty fast and it is difficult to know how long it can take. This definitely needs improved, I think we should have one timeout that applies to an entire agents work.

However, if you split the agents into separately plugins, each one does run independently. The workaround of splitting the agents into smaller multiple plugins should work, though obviously hard to do until we improve the log messages.

@adambaumeister
Copy link

@danielnelson I agree a per-agent timeout value would be much better. Basically a hard limit for how long a collection an take on a single host. You could even make it automatic maybe by dividing the interval by the number of hosts by the amount of parallel SNMP sessions.

I tried splitting my snmp inputs into groups of 50 hosts each and it almost works but I still get some groups that take too long to collect so you end up with weird looking graphs for certain devices. I wrote some automation to do this which, while nice, is probably too much additional overhead for the relatively simple task of collecting network metrics.

@aurimasplu
Copy link
Author

Issue actually exists and plugin fails if you use lost of agents in one instance.

But I have solved my problem by creating separate configuration file per device and put them all to /etc/telegraf/telegraf.d/ directory. Each file contains several snmp plugin instances usually one per snmp table. I have created several templates per device type, came up with some naming convention for config files and wrote script which takes list of devices and creates config files. I made it really easy for myself to provision devices monitored by telegraf.
So I am running two 8 core VMs and monitor more 2700 devices polling them every 1 minute quite smoothly :)

This way I isolated all "took longer to collect" problems to a per device basis, but logging improvement still needed because if some device takes longer to collect I don't know which one.

Now I only run into problems with very large devices, for example I have several switches with more that 1200 interfaces and telegraf is not able to poll all ifTable and ifXtable withing one minute. But this is different topic.

@danielnelson danielnelson added this to the 1.8.0 milestone Jun 8, 2018
@danielnelson danielnelson modified the milestones: 1.8.0, 1.9.0 Sep 7, 2018
@mcaulifn
Copy link

mcaulifn commented Oct 9, 2018

Have there been any updates on this? I'm seeing the same thing when one of three routers is down.

@russorat russorat modified the milestones: 1.9.0, 1.10 Oct 29, 2018
@russorat russorat modified the milestones: 1.10.0, 1.11.0 Jan 14, 2019
@danielnelson danielnelson removed their assignment May 10, 2019
@danielnelson danielnelson removed this from the 1.11.0 milestone May 10, 2019
@n1nj4888
Copy link

n1nj4888 commented Jul 24, 2019

Any update on this?

I'm seeing the same behaviour - When multiple hosts are configured and one is down, the snmp queries appear to be sent sequentially to the hosts and so if a single/multiple host(s) are down, this can cause the snmp plugin to timeout on the offline hosts before actually completing the list of online hosts?

Is it not possible to run the snmp queries to all configured hosts / agents in parallel so that one host being down wouldn't affect gathering metrics for the others?

Thanks!

@danielnelson
Copy link
Contributor

In a nutshell, the workaround is to talk to a single remote agent per plugin:

[[inputs.snmp]]
  agents = ["host1"]
  # other options
[[inputs.snmp]]
  agents = ["host2"]
  # other options

@n1nj4888
Copy link

Ok thanks @danielnelson. Understand the workaround but, if "# other options" is an extensive list of OIDs, managing any changes to those options would be complicated across a number of hosts.

Are there plans to make the snmp calls in parallel to the listed agents or, if a response is not received from AgentA, continue with AgentB before retrying AgentA?

Thanks!

@danielnelson
Copy link
Contributor

Yes, the other options would be whatever tables/fields you want to collect and would need repeated. If you are reading this issue you probably have lots of agents and for managing that I recommend using a templating program to generate your configuration.

To be clear, the plugin does make SNMP calls in parallel to the agents, but all agents must complete before the next collection will begin. This means one agent can theoretically hold up all the other agents for as long as the total timeout is set. This behavior is probably won't change anytime soon. Placing them in a separate plugin definition will allow them to be fully independent.

@Hipska
Copy link
Contributor

Hipska commented May 7, 2021

I just tested with current telegraf version, and here the metrics for agentB come through if agentA is not responding, so it seems this issue isn't relevant anymore.

Note that this still counts:

To be clear, the plugin does make SNMP calls in parallel to the agents, but all agents must complete before the next collection will begin. This means one agent can theoretically hold up all the other agents for as long as the total timeout is set. This behavior is probably won't change anytime soon. Placing them in a separate plugin definition will allow them to be fully independent.

@Hipska Hipska closed this as completed May 7, 2021
@Stephan-Walter
Copy link

Stephan-Walter commented Oct 11, 2021

Hi,

we observe right now the same/similar issue with telegraf version 1.20.0

We used the following definition and observed very spare data within influxdb. So something like one data point for every agent every 20-50 minutes

agents = [ "1a1","1a2","1a3","1a4","1a5","1a6","1a7","1a8","2a1","2a2","2a3","2a4","2a5","2a6","2a7","2a8"]
   # polling interval
   interval = "15s"
   ## Timeout for each SNMP query.
   timeout = "60s"
   ## Number of retries to attempt within timeout.
   retries = 3
   ## SNMP version, values can be 1, 2, or 3
   version = 2

   ## SNMP community string.
   community = "public"

   ## The GETBULK max-repetitions parameter
   max_repetitions = 10

   ## measurement name
   name = "internal"
   [inputs.snmp.tags]
    influxdb_database = "internal"
   [[inputs.snmp.field]]
     name = "hostname"
     oid = "SNMPv2-MIB::sysName.0"
     is_tag = true
   [[inputs.snmp.field]]
     name = "uptime"
     oid = "DISMAN-EVENT-MIB::sysUpTimeInstance"

   [[inputs.snmp.table]]
     ## measurement name
     name = "snmp"
     inherit_tags = [ "hostname" ]
     oid = "Some::Table1"

   [[inputs.snmp.table.field]]
     name = "SomeName1"
     oid = "Some::Name1"
     is_tag = true

   [[inputs.snmp.table]]
     ## measurement name
     name = "snmp"
     inherit_tags = [ "hostname" ]
     oid = "Some::Table2"

   [[inputs.snmp.table.field]]
     name = "SomeName2"
     oid = "Some::Name2"
     is_tag = true

A timed test execution of the configuration has shown the following

2021-10-08T10:39:02Z E! [inputs.snmp] Error in plugin: agent 1a7: performing get on field hostname: request timeout (after 3 retries)
2021-10-08T10:39:02Z E! [inputs.snmp] Error in plugin: agent 2a7: performing get on field hostname: request timeout (after 3 retries)
2021-10-08T10:47:02Z E! [inputs.snmp] Error in plugin: agent 1a7: gathering table snmp: performing bulk walk for field Name1: request timeout (after 3 retries)
2021-10-08T10:47:02Z E! [inputs.snmp] Error in plugin: agent 2a7: gathering table snmp: performing bulk walk for field Name1: request timeout (after 3 retries)
2021-10-08T10:55:02Z E! [inputs.snmp] Error in plugin: agent 2a7: gathering table snmp: performing bulk walk for field Name2: request timeout (after 3 retries)
2021-10-08T10:55:02Z E! [inputs.snmp] Error in plugin: agent 1a7: gathering table snmp: performing bulk walk for field Name2: request timeout (after 3 retries)
2021-10-08T10:55:02Z E! [telegraf] Error running agent: input plugins recorded 6 errors

real	24m0.331s
user	0m0.502s
sys	0m0.314s

My checks has shown, that the devices 1a7 and 2a7 were not ping able because a network issue, but at the first moment this wasn't clear, since the output above required a very long time to appear, so that we thought that telegraf was hanging at all with the test cmd. Just after we have decreased the interval and timeout to 5s, the execution time came down to

real	2m0.493s
user	0m0.219s
sys	0m0.185s

So we were able to see the root cause of the problem.

Right now it is not clear for us, why we see such a long execution time, when we just define 5s a timeout for the snmp block. Especial, since we get as a result only the same interval of data points for the other devices.

Our expectation was, that if devices are not accessible, we would get at least every timeout seconds a new value within the influxdb.

It would be great if somebody could have a look, if the problem wasn't really fixed at all, or if another problem was introduced/discovered.

@Hipska please let me know if I should open a new ticket for this problem, but right now I think this is the same issue. So could you please reopen it?

Best Regards

@Hipska
Copy link
Contributor

Hipska commented Oct 11, 2021

It is indeed not a new issue, see comment here: #7300 (comment)
See also my last comment to this issue, it is advised to split the configs if you have devices that are unresponsive.

@Stephan-Walter
Copy link

Hi Hipska,

yes, I understand that a split into single input.snmp blocks would resolve the problem for the other agent definitions, but it feels wrong for me, that two faulty agents with a timeout of 60s lead to a hanging block of 24 minutes. So a factor of 12-24 longer than what could be expected. Or do I oversee something?

@Hipska
Copy link
Contributor

Hipska commented Oct 11, 2021

It is indeed strange, but see that comment, the total delay is something like this: 2 * timeout * retries * number_of_tables. So if you want to comment on that, you should go to #7300.

@Stephan-Walter
Copy link

Stephan-Walter commented Oct 11, 2021

Ah ok, that seems to make sens.

We have to agents not responding with two tables each and an timeout of 60s. And since test serialise we can calculate

2 agents * (2 * timeout * 3 * 2) = 24 minutes (timeout=60s) or 2 minutes (timeout=5s)

Still I wonder why there is the retries factor, since my understanding of the retry was, that they happen within the timeout and not after the timeout, but ok.

So yes, the spare output of the other agents every 24 or 2 minutes is the expected behaviour right now.

Since this is a really big issue and the maintenance effort quickly explodes, is there anything we can do to help to change the actual behaviour?

@Hipska
Copy link
Contributor

Hipska commented Oct 11, 2021

Anybody is free to create pull requests for fixing code and issues. I'm using Ansible to generate a snmp config file per agent, and works great with 3k+ agents.

@Stephan-Walter
Copy link

yeah, we have moved also to a single agent per snmp entry now, with an auto generated config. Nevertheless, it is the best solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snmp bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

9 participants