-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inputs.snmp walk requesting OIDs multiple times and causing significant device load #10420
Comments
If Telegraf wants to request individual columns, it should do this by putting the columns in a single GetBulk request - and then call GetBulk on the last value from each column in the previous response. It should set max_repetitions on a per table basis to I.e. for max_repetitions = 10, and 4 columns in one request, max_repetitions should be set to 2. This significantly reduces the OIDs returned from other tables, significantly reducing device load. |
We moved away from For now I will be closing this issue as |
Hi @MyaLongmire this is not an snmptranslate issue - this is an issue in how telegrafs snmp polling works. I have included the snmptranslate logs as they are the only logs I get out of telegraf for snmp. Can you please re-open this issue? |
Hi @nward sorry, I misunderstood your issue. Thank you for updating telegraf and being very thorough in your new issue. Next time please allow some time for our team to respond before opening another issue. Please understand that we cannot be online all the time and are trying out best to assist everyone. |
Relevent telegraf.conf
Logs from Telegraf
Note - this is 1.20.4 - I am unable to test to see if this issue exists in 1.21.2 as the SNMP MIB parser appears to still be broken for Juniper MIBs.
These logs illustrate snmptranslate running for each column - but is not very useful logging. There doesn't appear to be detailed logging for the SNMP module.
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingTable"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "JUNIPER-MIB::jnxOperatingTable.1"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptable" "-Ch" "-Cl" "-c" "public" "127.0.0.1" "JUNIPER-MIB::jnxOperatingTable"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingDescr"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingContentsIndex"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingL1Index"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingL2Index"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingL3Index"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingState"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingTemp"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingCPU"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingISR"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingDRAMSize"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingBuffer"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingHeap"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingUpTime"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingLastRestart"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingMemory"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingStateOrdered"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingChassisId"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingChassisDescr"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingRestartTime"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperating1MinLoadAvg"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperating5MinLoadAvg"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperating15MinLoadAvg"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperating1MinAvgCPU"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperating5MinAvgCPU"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperating15MinAvgCPU"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingFRUPower"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingBufferCP"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingMemoryCP"
System info
Telegraf 1.20.4
Docker
No response
Steps to reproduce
Expected behavior
When walking a table, an SNMP manager (i.e. client) should call GetNext or GetBulk on the last OID in the previous response.
Actual behavior
Telegraf does not correctly implement walking whole SNMP tables - instead it treats each column as an SNMP table and walks each column independently, which has a high performance cost on the monitored devices.
If GetBulk returns values in the next column in the table those values are ignored, and the next column is fetched fresh.
In a table with few rows, the same columns may be returned many times - which may cause very high load on the SNMP agent (i.e. device).
For example - Juniper SRX branch devices, not operating in a cluster, have a
JUNIPER-SRX5000-SPU-MONITORING-MIB::jnxJsSPUMonitoringObjectsTable
table with one entry (row), and 14 columns. This should be fetched with 2 GetBulk requests with the default max_repetitions of 10 (one getting the first 10, the other the final 4 and then the next 6 entries in the SNMP tree). However, telegraf instead sends 14 GetBulk requests - one for each column, and each response contains the next 10 entries which are then requested again in the next GetBulk request.When using max_repetitions set higher, this effect gets significantly worse.
Note that SNMP agents will correctly return values outside the table if the end of the table is reached - this should only ever happen once per walk of a table, as the walk detects that the OIDs in the response are outside the requested table. This is OK, even if requesting these values is expensive, as it only happens once. In telegraf's SNMP implementation, this is not the case - it requests data past the end of the table over and over again.
Additional info
No response
The text was updated successfully, but these errors were encountered: