Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inputs.snmp walk requesting OIDs multiple times and causing significant device load #10420

Closed
nward opened this issue Jan 11, 2022 · 4 comments
Closed
Labels
area/snmp bug unexpected problem or unintended behavior

Comments

@nward
Copy link

nward commented Jan 11, 2022

Relevent telegraf.conf

[[inputs.snmp]]
    interval = "60s"
    agents = [ "1.2.3.4" ]
    version = 2
    community = "SECRET"

[[inputs.snmp.table]]
    inherit_tags = [ "hostname" ]
    oid = "JUNIPER-MIB::jnxOperatingTable"

    [[inputs.snmp.table.field]]
        oid = "JUNIPER-MIB::jnxOperatingDescr"
        is_tag = true

Logs from Telegraf

Note - this is 1.20.4 - I am unable to test to see if this issue exists in 1.21.2 as the SNMP MIB parser appears to still be broken for Juniper MIBs.

These logs illustrate snmptranslate running for each column - but is not very useful logging. There doesn't appear to be detailed logging for the SNMP module.

2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingTable"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "JUNIPER-MIB::jnxOperatingTable.1"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptable" "-Ch" "-Cl" "-c" "public" "127.0.0.1" "JUNIPER-MIB::jnxOperatingTable"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingDescr"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingContentsIndex"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingL1Index"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingL2Index"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingL3Index"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingState"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingTemp"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingCPU"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingISR"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingDRAMSize"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingBuffer"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingHeap"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingUpTime"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingLastRestart"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingMemory"
2022-01-11T09:02:00Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingStateOrdered"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingChassisId"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingChassisDescr"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingRestartTime"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperating1MinLoadAvg"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperating5MinLoadAvg"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperating15MinLoadAvg"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperating1MinAvgCPU"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperating5MinAvgCPU"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperating15MinAvgCPU"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingFRUPower"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingBufferCP"
2022-01-11T09:02:01Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "JUNIPER-MIB::jnxOperatingMemoryCP"

System info

Telegraf 1.20.4

Docker

No response

Steps to reproduce

  1. Configure telegraf to poll an SNMP table
  2. Look at tcpdump, and note that when walking a table the same OIDs are in more than one response

Expected behavior

When walking a table, an SNMP manager (i.e. client) should call GetNext or GetBulk on the last OID in the previous response.

Actual behavior

Telegraf does not correctly implement walking whole SNMP tables - instead it treats each column as an SNMP table and walks each column independently, which has a high performance cost on the monitored devices.

If GetBulk returns values in the next column in the table those values are ignored, and the next column is fetched fresh.

In a table with few rows, the same columns may be returned many times - which may cause very high load on the SNMP agent (i.e. device).

For example - Juniper SRX branch devices, not operating in a cluster, have a JUNIPER-SRX5000-SPU-MONITORING-MIB::jnxJsSPUMonitoringObjectsTable table with one entry (row), and 14 columns. This should be fetched with 2 GetBulk requests with the default max_repetitions of 10 (one getting the first 10, the other the final 4 and then the next 6 entries in the SNMP tree). However, telegraf instead sends 14 GetBulk requests - one for each column, and each response contains the next 10 entries which are then requested again in the next GetBulk request.

When using max_repetitions set higher, this effect gets significantly worse.

Note that SNMP agents will correctly return values outside the table if the end of the table is reached - this should only ever happen once per walk of a table, as the walk detects that the OIDs in the response are outside the requested table. This is OK, even if requesting these values is expensive, as it only happens once. In telegraf's SNMP implementation, this is not the case - it requests data past the end of the table over and over again.

Additional info

No response

@nward nward added the bug unexpected problem or unintended behavior label Jan 11, 2022
@nward
Copy link
Author

nward commented Jan 11, 2022

If Telegraf wants to request individual columns, it should do this by putting the columns in a single GetBulk request - and then call GetBulk on the last value from each column in the previous response.

It should set max_repetitions on a per table basis to int(max_repetitions / number of columns).

I.e. for max_repetitions = 10, and 4 columns in one request, max_repetitions should be set to 2.

This significantly reduces the OIDs returned from other tables, significantly reducing device load.

@MyaLongmire
Copy link
Contributor

We moved away from snmptranslate due to heavy resource load. Since we are moving away from using snmptranslate we will no longer be patching that version but moving forward with gosmi. If you have a specific issue with gosmi please do open issues and we can determine how to fix those going forward.

For now I will be closing this issue as snmptranslate is deprecated in telegraf.

@nward
Copy link
Author

nward commented Jan 11, 2022

Hi @MyaLongmire this is not an snmptranslate issue - this is an issue in how telegrafs snmp polling works.

I have included the snmptranslate logs as they are the only logs I get out of telegraf for snmp.

Can you please re-open this issue?

@MyaLongmire
Copy link
Contributor

Hi @nward sorry, I misunderstood your issue. Thank you for updating telegraf and being very thorough in your new issue. Next time please allow some time for our team to respond before opening another issue. Please understand that we cannot be online all the time and are trying out best to assist everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snmp bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants