-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inputs.snmp walk requests OIDs multiple times and causes high device load #10427
Comments
Hi Sven. No, that PR is unrelated - it is about MIB parsing.
This issue is about the how the SNMP protocol itself is used - not the SMI/MIB parsing or the recent gosmi changes.
I have not tried the latest master, but there have not been any changes related to the SNMP protocol since 1.21.2 (which I am using).
… On 13/01/2022, at 01:44, Sven Rebhan ***@***.***> wrote:
I think PR #10206 should fix the issue. Have you tried latest master?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you authored the thread.
|
@nward yeah I saw it and thus deleted the comment. :-) |
Ahh sorry! I replied via email so didn’t catch that. |
Thank you updating telegraf! I miss understood your original issue as a problem with I believe walk and get are handled by the
Meaning this is an upstream issue. If you wouldn't mind opening an issue with them :) |
Hi @MyaLongmire Yes, Walk and Get are handled by Their That is not the appropriate method for regularly polling a table for which we know the structure. Instead, If it is useful, I can write some pseudocode of how this needs to operate. |
Telegraf is indeed using the mentioned walk function, see the following parts: telegraf/plugins/inputs/snmp/snmp.go Line 479 in 30d981d
telegraf/internal/snmp/wrapper.go Lines 23 to 30 in 30d981d
https://github.com/gosnmp/gosnmp/blob/96366f3fa26cabbea05d3d854ad8577650246ffd/gosnmp.go#L567-L573 Using GetBulk implies you know on beforehand how many rows a table has, this is not known for telegraf and so walk is still the better option. |
Hi @Hipska, thanks for looking in to this.
This is not correct - the packet's max_repetitions is set to set the size of GetBulk, and is of course an option included in Telegraf - so I am not clear why you make this assertion. The correct way to poll a table, regardless of whether you know the number or rows, is to call GetBulk with the requested OIDs set to some columns, and the request's max_repetitions set to the desired number of responses in a single SNMP message divided by the number of columns you are requesting at once For example, say you have a table with 3 columns (colA, colB, colC), and an unknown number of rows. You want to receive 10 values per response packet (as indicated by what the user configures max_repetitions to be in
You iterate through this, checking that each varbind is within the column requested. If not, you discard the value and move on. When you have completed looking at this response, you construct another GetBulk, with the request OIDs being the last "in-column" OIDs of each column. In this case, the request is the OIDs for If you have a GetBulk where all columns have a varbind which is not in the requested column, you have completed polling those columns. You then move on to the next set of columns, until you have polled all columns in the table. Some things to note which catch out some implementations:
|
Hmm, okay you seem to know a bit on this subject. Would you be able to create a (draft) PR where you adapt the use of this method? That way, I can check the code and see if that would indeed be the better scenario. |
It is a curse :-) I will try - I don't know Go, but, that's never stopped me before. I have some rough sketches of the code that I put together over the weekend, but it will take some time to get working I am sure - it will require a rewrite of a lot of the module I think.. will see ! |
@nward not knowing Golang is not an issue. If you got something rough (where I can see the logic), please submit it as a draft and let me know. We can work-out the details together if you like. |
Thanks @srebhan. I intend to get to this in the next week or so. For now we are going to production using collectd for SNMP polling and passing to telegraf - so I will have time once I have got this delivered. |
@nward - A cautious request :) Will you find time for this PR? /Henrik |
Hi @henriknoerr - I haven't had time yet, but I plan to in the next couple of weeks. We have so far used collectd sending data to telegraf for situations where polling SNMP tables has significant performance impacts on devices. |
@nward I'm still interested to see your solution.. |
Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you! |
Relevent telegraf.conf
Logs from Telegraf
etc.
System info
Telegraf 1.21.2
Docker
Standard telegraf:latest (1.21.2)
Steps to reproduce
Expected behavior
I expect SNMP walk to request multiple columns at once, reducing max_repetitions appropriately (i.e. to
int(max_repetitions / num_columns)
).Actual behavior
SNMP walk requests a single column at a time.
Because max_repetitions is set, GetBulk will return data past the requested column. SNMP bulkwalk implementations should request multiple columns at once, and set max_repetitions per request to
int(max_repetitions/column count)
.Telegraf's current implementation does not do this. Instead, it requests a single column at a time, and each column goes past the end by up to
max_repetitions - 1
(i.e. if there is only a single row remaining to request).See the following packet capture. This shows walking the
IF-MIB::ifInOctets
column, and then theIF-MIB::ifInUcastPkts
column as two separate walks - note there is only a single OID in each GetBulk request.When we reach the end of the
IF-MIB::ifInOctets
column, the GetBulk begins to return values fromIF-MIB::ifInUcastPkts
- as this is the next variables in the SNMP agent.Then, we request
IF-MIB::ifInUcastPkts
from the top, and get values for the first several rows of theIF-MIB::ifInUcastPkts
column a second time.In the below capture, note the request at timestamp
15:16:31.263950
is asking for an OID which has been returned in a previous response.1.3.6.1.2.1.2.2.1.11
- as it is walking a single column at a time, after walking.1.3.6.1.2.1.2.2.1.10
.I have included IF-MIB here as this is an SNMP table available on near every SNMP agent. It is not a very good example of this causing an issue on the device, as it is only returning a few extra responses and is quite a low % of "overrun". This is included to allow the issue to be reproduced by anyone.
Note the following example where this effect is much more pronounced. When walking
JUNIPER-SRX5000-SPU-MONITORING-MIB::jnxJsSPUMonitoringObjectsTable
, which has 15 columns and on my device has a single row, telegraf runs separate walks for each column with a max_repetitions of 10. This causes the first and only entry of each column to be returned, along with the first and only entry in each of the next 9 columns. This repeats like this until column 8, and then we start also getting OIDs from outside the table until we finish walking the table with column 15.This is very inefficient, as each OID in the table is returned from the device 10 times - and we are getting OIDs outside of the table up to 9 times each.
Additional info
No response
The text was updated successfully, but these errors were encountered: