-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inputs.snmp fails all agents if one agent does not respond in time #3823
Comments
You can move to snmpcollector to gather snmp metrics. Polling time or state does not affect to the other devices and you have a nice web interface to check device runtime statistics (state polling time, num metrics , errors ) etc. https://github.com/toni-moreno/snmpcollector Check the wiki for configuration examples |
toni-moreno, thanks for advice I will check it out, but I don't believe that bug reports is the place to advertise your products. |
+1 However I think the issue is more serious than implied by your report. I believe the collection is failing because the input is exceeding the collection time for the entire input. That is, if you've got 100 routers and it takes a couple of seconds to poll each one then to complete the collection run exceeds the interval even if each box responds quickly. I have proven this in my environment as I have the same problem even with no devices timing out, doing basically just ifTable queries, with as low as about 150 devices. I've tried to resolve the issue by breaking my configuration into chunks of [[inputs.snmp]] but they all seem to be executed as if they were one so you run into the same problem: Mar 21 15:38:00 act-collector01 telegraf[6933]: 2018-03-21T04:38:00Z E! Error in plugin [inputs.snmp]: took longer to collect than collection interval (2m0s) Unfortunately, this makes Telegraf with the snmp plugin unusable for the purpose of collecting network metrics via SNMP on a reasonably sized network without a considerable amount of scripting to stand up additional telegraf instances and delegate appropriately sized configurations to each. My environment details:
Example file (replace the agent string with 100+ hostnames...):
|
Plugins don't fail if they exceed the interval, this is really just a warning message that the plugin is scheduled to run again but it still has not completed. The plugin will continue to run and it will not run again until the first collection completes.
The timeout in the configuration is per network operation, so if you have a 5s timeout and multiple retries and multiple fields, it can add up pretty fast and it is difficult to know how long it can take. This definitely needs improved, I think we should have one timeout that applies to an entire agents work. However, if you split the agents into separately plugins, each one does run independently. The workaround of splitting the agents into smaller multiple plugins should work, though obviously hard to do until we improve the log messages. |
@danielnelson I agree a per-agent timeout value would be much better. Basically a hard limit for how long a collection an take on a single host. You could even make it automatic maybe by dividing the interval by the number of hosts by the amount of parallel SNMP sessions. I tried splitting my snmp inputs into groups of 50 hosts each and it almost works but I still get some groups that take too long to collect so you end up with weird looking graphs for certain devices. I wrote some automation to do this which, while nice, is probably too much additional overhead for the relatively simple task of collecting network metrics. |
Issue actually exists and plugin fails if you use lost of agents in one instance. But I have solved my problem by creating separate configuration file per device and put them all to /etc/telegraf/telegraf.d/ directory. Each file contains several snmp plugin instances usually one per snmp table. I have created several templates per device type, came up with some naming convention for config files and wrote script which takes list of devices and creates config files. I made it really easy for myself to provision devices monitored by telegraf. This way I isolated all "took longer to collect" problems to a per device basis, but logging improvement still needed because if some device takes longer to collect I don't know which one. Now I only run into problems with very large devices, for example I have several switches with more that 1200 interfaces and telegraf is not able to poll all ifTable and ifXtable withing one minute. But this is different topic. |
Have there been any updates on this? I'm seeing the same thing when one of three routers is down. |
Any update on this? I'm seeing the same behaviour - When multiple hosts are configured and one is down, the snmp queries appear to be sent sequentially to the hosts and so if a single/multiple host(s) are down, this can cause the snmp plugin to timeout on the offline hosts before actually completing the list of online hosts? Is it not possible to run the snmp queries to all configured hosts / agents in parallel so that one host being down wouldn't affect gathering metrics for the others? Thanks! |
In a nutshell, the workaround is to talk to a single remote agent per plugin: [[inputs.snmp]]
agents = ["host1"]
# other options
[[inputs.snmp]]
agents = ["host2"]
# other options |
Ok thanks @danielnelson. Understand the workaround but, if "# other options" is an extensive list of OIDs, managing any changes to those options would be complicated across a number of hosts. Are there plans to make the snmp calls in parallel to the listed agents or, if a response is not received from AgentA, continue with AgentB before retrying AgentA? Thanks! |
Yes, the other options would be whatever tables/fields you want to collect and would need repeated. If you are reading this issue you probably have lots of agents and for managing that I recommend using a templating program to generate your configuration. To be clear, the plugin does make SNMP calls in parallel to the agents, but all agents must complete before the next collection will begin. This means one agent can theoretically hold up all the other agents for as long as the total timeout is set. This behavior is probably won't change anytime soon. Placing them in a separate plugin definition will allow them to be fully independent. |
I just tested with current telegraf version, and here the metrics for agentB come through if agentA is not responding, so it seems this issue isn't relevant anymore. Note that this still counts:
|
Hi, we observe right now the same/similar issue with telegraf version 1.20.0 We used the following definition and observed very spare data within influxdb. So something like one data point for every agent every 20-50 minutes
A timed test execution of the configuration has shown the following
My checks has shown, that the devices 1a7 and 2a7 were not ping able because a network issue, but at the first moment this wasn't clear, since the output above required a very long time to appear, so that we thought that telegraf was hanging at all with the test cmd. Just after we have decreased the interval and timeout to 5s, the execution time came down to
So we were able to see the root cause of the problem. Right now it is not clear for us, why we see such a long execution time, when we just define 5s a timeout for the snmp block. Especial, since we get as a result only the same interval of data points for the other devices. Our expectation was, that if devices are not accessible, we would get at least every timeout seconds a new value within the influxdb. It would be great if somebody could have a look, if the problem wasn't really fixed at all, or if another problem was introduced/discovered. @Hipska please let me know if I should open a new ticket for this problem, but right now I think this is the same issue. So could you please reopen it? Best Regards |
It is indeed not a new issue, see comment here: #7300 (comment) |
Hi Hipska, yes, I understand that a split into single input.snmp blocks would resolve the problem for the other agent definitions, but it feels wrong for me, that two faulty agents with a timeout of 60s lead to a hanging block of 24 minutes. So a factor of 12-24 longer than what could be expected. Or do I oversee something? |
It is indeed strange, but see that comment, the total delay is something like this: |
Ah ok, that seems to make sens. We have to agents not responding with two tables each and an timeout of 60s. And since test serialise we can calculate 2 agents * (2 * timeout * 3 * 2) = 24 minutes (timeout=60s) or 2 minutes (timeout=5s) Still I wonder why there is the So yes, the spare output of the other agents every 24 or 2 minutes is the expected behaviour right now. Since this is a really big issue and the maintenance effort quickly explodes, is there anything we can do to help to change the actual behaviour? |
Anybody is free to create pull requests for fixing code and issues. I'm using Ansible to generate a snmp config file per agent, and works great with 3k+ agents. |
yeah, we have moved also to a single agent per snmp entry now, with an auto generated config. Nevertheless, it is the best solution. |
# Bug report
When running inputs.snmp plugin and polling several devices (agents) within same instance, if at least one device does not send back all requested information all devices in that inputs.snmp instance fails. Other instances running at the same time are not impacted.
My case example: I am poling several hundred network devices with several inputs.snmp instances every 1 minute. Each instance is dedicated for single metric.
When polling interface metrics on remotely located with big latency (~100ms) large network device (few hundred interfaces), for all requests and responses there simply is not enough time. And this is my problem to solve. But due to this condition all other few hundred devices in same instance fails even though they send back all data in time.
Log message is generated:
E! Error in plugin [inputs.snmp]: took longer to collect than collection interval (1m0s)
And from log it is impossible to tell which device is failing.
# Relevant telegraf.conf:
[[inputs.snmp]]
agents = ["test1:161" , "test2:161" , "test5" , "test3:161"]
interval = "60s"
timeout = "5s"
retries = 3
version = 2
community = "SNMPcommunity"
max_repetitions = 100
# System info:
Telegraf: 1.5.1-1
OS: Rhel 7.4
# Expected behavior:
Telegraf should fail only that device which does not respond in time and generate log message which device failed.
# Actual behavior:
All instance fails and only general message is generated.
The text was updated successfully, but these errors were encountered: