inputs.snmp walk requests OIDs multiple times and causes high device load #10427

nward · 2022-01-12T04:00:30Z

Relevent telegraf.conf

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

[[inputs.snmp]]
  agents = ["udp://192.168.0.1:161"]
  version = 2
  path = ["/mibs"]
  community = "test"

  [[inputs.snmp.table]]
    oid = "IF-MIB::ifTable"
    name = "interface"

    [[inputs.snmp.table.field]]
      oid = "IF-MIB::ifDescr"
      name = "ifDescr"
      is_tag = true

[[inputs.snmp.table]]
    oid = "JUNIPER-SRX5000-SPU-MONITORING-MIB::jnxJsSPUMonitoringObjectsTable"

    [[inputs.snmp.table.field]]
        oid = "JUNIPER-SRX5000-SPU-MONITORING-MIB::jnxJsSPUMonitoringObjectsEntry"
        is_tag = true

[[outputs.file]]
  files = ["stdout"]

Logs from Telegraf

❯ docker run -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro -v $PWD/mibs:/mibs:ro telegraf
2022-01-12T02:08:37Z I! Starting Telegraf 1.21.2
2022-01-12T02:08:37Z I! Using config file: /etc/telegraf/telegraf.conf
2022-01-12T02:08:37Z I! Loaded inputs: snmp
2022-01-12T02:08:37Z I! Loaded aggregators:
2022-01-12T02:08:37Z I! Loaded processors:
2022-01-12T02:08:37Z I! Loaded outputs: file
2022-01-12T02:08:37Z I! Tags enabled: host=04fe51ab0f31
2022-01-12T02:08:37Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"04fe51ab0f31", Flush Interval:10s
Parse module: /mibs/RFC-1212:63:33: unexpected ".." (expected ")")
2022-01-12T02:08:40Z W! [inputs.snmp] module RFC-1212 could not be loaded
interface,agent_host=192.168.0.1,host=04fe51ab0f31,ifDescr=ge-0/0/5,ifIndex=521 ifOperStatus=2i,ifInNUcastPkts=0i,ifOutDiscards=0i,ifOutErrors=0i,ifInUcastPkts=0i,ifOutOctets=0i,ifOutUcastPkts=0i,ifOutNUcastPkts=0i,ifSpecific=".0.0",ifPhysAddress="58:00:bb:df:ea:c1",ifAdminStatus=1i,ifInOctets=0i,ifInErrors=0i,ifInUnknownProtos=0i,ifOutQLen=0i,ifType=6i,ifMtu=1514i,ifSpeed=1000000000i,ifLastChange=9703i,ifInDiscards=0i 1641953333000000000

etc.

System info

Telegraf 1.21.2

Docker

Standard telegraf:latest (1.21.2)

Steps to reproduce

Configure telegraf to poll an SNMP table
Note with tcpdump that the SNMP table is being polled inefficiently, leading to data being requested multiple times

Expected behavior

I expect SNMP walk to request multiple columns at once, reducing max_repetitions appropriately (i.e. to int(max_repetitions / num_columns)).

Actual behavior

SNMP walk requests a single column at a time.

Because max_repetitions is set, GetBulk will return data past the requested column. SNMP bulkwalk implementations should request multiple columns at once, and set max_repetitions per request to int(max_repetitions/column count).

Telegraf's current implementation does not do this. Instead, it requests a single column at a time, and each column goes past the end by up to max_repetitions - 1 (i.e. if there is only a single row remaining to request).

See the following packet capture. This shows walking the IF-MIB::ifInOctets column, and then the IF-MIB::ifInUcastPkts column as two separate walks - note there is only a single OID in each GetBulk request.
When we reach the end of the IF-MIB::ifInOctets column, the GetBulk begins to return values from IF-MIB::ifInUcastPkts - as this is the next variables in the SNMP agent.
Then, we request IF-MIB::ifInUcastPkts from the top, and get values for the first several rows of the IF-MIB::ifInUcastPkts column a second time.

In the below capture, note the request at timestamp 15:16:31.263950 is asking for an OID which has been returned in a previous response .1.3.6.1.2.1.2.2.1.11 - as it is walking a single column at a time, after walking .1.3.6.1.2.1.2.2.1.10.

15:16:30.931772 192.168.0.235.55798 > 192.168.0.1.161:  C="test" GetBulk(29)  N=0 M=10 .1.3.6.1.2.1.2.2.1.10
15:16:30.960187 192.168.0.1.161 > 192.168.0.235.55798:  C="test" GetResponse(195)  .1.3.6.1.2.1.2.2.1.10.4=0 .1.3.6.1.2.1.2.2.1.10.6=3841180218 .1.3.6.1.2.1.2.2.1.10.7=0 .1.3.6.1.2.1.2.2.1.10.8=0 .1.3.6.1.2.1.2.2.1.10.9=0 .1.3.6.1.2.1.2.2.1.10.10=0 .1.3.6.1.2.1.2.2.1.10.11=0 .1.3.6.1.2.1.2.2.1.10.12=0 .1.3.6.1.2.1.2.2.1.10.21=49036 .1.3.6.1.2.1.2.2.1.10.22=3841120174
15:16:30.960659 192.168.0.235.55798 > 192.168.0.1.161:  C="test" GetBulk(30)  N=0 M=10 .1.3.6.1.2.1.2.2.1.10.22
15:16:31.032817 192.168.0.1.161 > 192.168.0.235.55798:  C="test" GetResponse(198)  .1.3.6.1.2.1.2.2.1.10.251=0 .1.3.6.1.2.1.2.2.1.10.501=0 .1.3.6.1.2.1.2.2.1.10.502=0 .1.3.6.1.2.1.2.2.1.10.503=0 .1.3.6.1.2.1.2.2.1.10.504=0 .1.3.6.1.2.1.2.2.1.10.505=0 .1.3.6.1.2.1.2.2.1.10.506=0 .1.3.6.1.2.1.2.2.1.10.507=0 .1.3.6.1.2.1.2.2.1.10.508=709435975 .1.3.6.1.2.1.2.2.1.10.509=0
15:16:31.033360 192.168.0.235.55798 > 192.168.0.1.161:  C="test" GetBulk(31)  N=0 M=10 .1.3.6.1.2.1.2.2.1.10.509
15:16:31.097061 192.168.0.1.161 > 192.168.0.235.55798:  C="test" GetResponse(210)  .1.3.6.1.2.1.2.2.1.10.510=0 .1.3.6.1.2.1.2.2.1.10.511=0 .1.3.6.1.2.1.2.2.1.10.513=4210062172 .1.3.6.1.2.1.2.2.1.10.515=709435975 .1.3.6.1.2.1.2.2.1.10.516=3240068682 .1.3.6.1.2.1.2.2.1.10.517=2185747961 .1.3.6.1.2.1.2.2.1.10.518=0 .1.3.6.1.2.1.2.2.1.10.519=0 .1.3.6.1.2.1.2.2.1.10.520=0 .1.3.6.1.2.1.2.2.1.10.521=0
15:16:31.097564 192.168.0.235.55798 > 192.168.0.1.161:  C="test" GetBulk(31)  N=0 M=10 .1.3.6.1.2.1.2.2.1.10.521
15:16:31.221709 192.168.0.1.161 > 192.168.0.235.55798:  C="test" GetResponse(202)  .1.3.6.1.2.1.2.2.1.10.522=0 .1.3.6.1.2.1.2.2.1.10.523=0 .1.3.6.1.2.1.2.2.1.10.524=3239957758 .1.3.6.1.2.1.2.2.1.10.525=0 .1.3.6.1.2.1.2.2.1.10.526=0 .1.3.6.1.2.1.2.2.1.10.527=0 .1.3.6.1.2.1.2.2.1.10.528=0 .1.3.6.1.2.1.2.2.1.10.529=0 .1.3.6.1.2.1.2.2.1.10.530=0 .1.3.6.1.2.1.2.2.1.10.531=272993948
15:16:31.222397 192.168.0.235.55798 > 192.168.0.1.161:  C="test" GetBulk(31)  N=0 M=10 .1.3.6.1.2.1.2.2.1.10.531
15:16:31.263247 192.168.0.1.161 > 192.168.0.235.55798:  C="test" GetResponse(194)  .1.3.6.1.2.1.2.2.1.10.532=0 .1.3.6.1.2.1.2.2.1.10.533=0 .1.3.6.1.2.1.2.2.1.10.534=0 .1.3.6.1.2.1.2.2.1.10.536=0 .1.3.6.1.2.1.2.2.1.10.537=0 .1.3.6.1.2.1.2.2.1.10.538=0 .1.3.6.1.2.1.2.2.1.11.4=0 .1.3.6.1.2.1.2.2.1.11.6=83475283 .1.3.6.1.2.1.2.2.1.11.7=0 .1.3.6.1.2.1.2.2.1.11.8=0
15:16:31.263950 192.168.0.235.55798 > 192.168.0.1.161:  C="test" GetBulk(29)  N=0 M=10 .1.3.6.1.2.1.2.2.1.11
15:16:31.282304 192.168.0.1.161 > 192.168.0.235.55798:  C="test" GetResponse(191)  .1.3.6.1.2.1.2.2.1.11.4=0 .1.3.6.1.2.1.2.2.1.11.6=83475283 .1.3.6.1.2.1.2.2.1.11.7=0 .1.3.6.1.2.1.2.2.1.11.8=0 .1.3.6.1.2.1.2.2.1.11.9=0 .1.3.6.1.2.1.2.2.1.11.10=0 .1.3.6.1.2.1.2.2.1.11.11=0 .1.3.6.1.2.1.2.2.1.11.12=0 .1.3.6.1.2.1.2.2.1.11.21=118 .1.3.6.1.2.1.2.2.1.11.22=83475018

I have included IF-MIB here as this is an SNMP table available on near every SNMP agent. It is not a very good example of this causing an issue on the device, as it is only returning a few extra responses and is quite a low % of "overrun". This is included to allow the issue to be reproduced by anyone.

Note the following example where this effect is much more pronounced. When walking JUNIPER-SRX5000-SPU-MONITORING-MIB::jnxJsSPUMonitoringObjectsTable, which has 15 columns and on my device has a single row, telegraf runs separate walks for each column with a max_repetitions of 10. This causes the first and only entry of each column to be returned, along with the first and only entry in each of the next 9 columns. This repeats like this until column 8, and then we start also getting OIDs from outside the table until we finish walking the table with column 15.

This is very inefficient, as each OID in the table is returned from the device 10 times - and we are getting OIDs outside of the table up to 9 times each.

16:25:22.636851 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(34)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1
16:25:22.813952 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.2.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.3.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.4.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0=28 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single"
16:25:22.814526 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(36)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0
16:25:22.926949 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(248)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.2.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.3.0=18
16:25:22.927498 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.1
16:25:23.092537 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.2.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.3.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.4.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0=28 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single"
16:25:23.093380 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetRequest(35)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.1
16:25:23.095980 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(35)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.1=[noSuchObject]
16:25:23.096559 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.2
16:25:23.265454 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.2.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.3.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.4.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0=28 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single"
16:25:23.266724 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.3
16:25:23.434807 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.3.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.4.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0=28 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18
16:25:23.435532 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.4
16:25:23.647472 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.4.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0=28 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=0
16:25:23.648413 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5
16:25:23.815512 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0=28 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0
16:25:23.816695 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6
16:25:23.982120 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0
16:25:23.983285 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7
16:25:24.263094 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0
16:25:24.263926 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8
16:25:24.418767 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(248)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=18
16:25:24.419594 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9
16:25:24.585044 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(248)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=20 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536
16:25:24.585625 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10
16:25:24.723222 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(248)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=20 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0
16:25:24.723924 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11
16:25:24.889485 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(253)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=20 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.2.0="single"
16:25:24.890164 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12
16:25:25.023276 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(248)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=20 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.2.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.3.0=21
16:25:25.023882 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13
16:25:25.116783 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(250)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.2.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.3.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.4.0=65536
16:25:25.117332 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14
16:25:25.265586 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(250)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.2.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.3.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.4.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.5.0=1
16:25:25.266332 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15
16:25:25.326821 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(250)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.2.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.3.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.4.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.5.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.6.0=0

Additional info

No response

The text was updated successfully, but these errors were encountered:

nward · 2022-01-12T04:01:04Z

Note - this is a re-post of #10420 with additional information, as #10420 was closed after incorrect triage.

nward · 2022-01-12T13:11:54Z

Hi Sven. No, that PR is unrelated - it is about MIB parsing. This issue is about the how the SNMP protocol itself is used - not the SMI/MIB parsing or the recent gosmi changes. I have not tried the latest master, but there have not been any changes related to the SNMP protocol since 1.21.2 (which I am using).

…

On 13/01/2022, at 01:44, Sven Rebhan ***@***.***> wrote: I think PR #10206 should fix the issue. Have you tried latest master? — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.

srebhan · 2022-01-12T13:13:24Z

@nward yeah I saw it and thus deleted the comment. :-)

nward · 2022-01-12T13:15:05Z

@nward yeah I saw it and thus deleted the comment. :-)

Ahh sorry! I replied via email so didn’t catch that.

MyaLongmire · 2022-01-12T18:53:17Z

Thank you updating telegraf! I miss understood your original issue as a problem with snmptranslate or mib parsing.

I believe walk and get are handled by the gosnmp library

Walk(string, gosnmp.WalkFunc) error
Get(oids []string) (*gosnmp.SnmpPacket, error)

Meaning this is an upstream issue. If you wouldn't mind opening an issue with them :)

nward · 2022-01-13T08:51:52Z

Hi @MyaLongmire

Yes, Walk and Get are handled by gosnmp - however this is not an issue with their project - their project has multiple methods to send SNMP requests, and telegraf is not using the best one for polling a table.

Their Walk function is a wrapper around GetNext/GetBulk and walks the subtree of the specified OID - a single OID. This is the sort of method you would use when exploring an SNMP agent - perhaps in some automated discovery, or a human dumping out what is available when trying to decide what to monitor.

That is not the appropriate method for regularly polling a table for which we know the structure. Instead, gosnmp.GetBulk should be used, passing multiple OIDs - one for each column, and repeating until each column has been completely fetched.
This is a lower level function, so additional logic is required - see walk.go in the gosnmp source for some examples, specifically the request loop, and checking each of the returned variables to make sure it is in the subtree of the requested OID.

If it is useful, I can write some pseudocode of how this needs to operate.

Hipska · 2022-01-14T09:19:22Z

Telegraf is indeed using the mentioned walk function, see the following parts:

telegraf/plugins/inputs/snmp/snmp.go

Line 479 in 30d981d

err := gs.Walk(oid, func(ent gosnmp.SnmpPDU) error {

telegraf/internal/snmp/wrapper.go

Lines 23 to 30 in 30d981d

    
           // Walk wraps GoSNMP.Walk() or GoSNMP.BulkWalk(), depending on whether the 
        
           // connection is using SNMPv1 or newer. 
        
           func (gs GosnmpWrapper) Walk(oid string, fn gosnmp.WalkFunc) error { 
        
           	if gs.Version == gosnmp.Version1 { 
        
           		return gs.GoSNMP.Walk(oid, fn) 
        
           	} 
        
           	return gs.GoSNMP.BulkWalk(oid, fn) 
        
           }

https://github.com/gosnmp/gosnmp/blob/96366f3fa26cabbea05d3d854ad8577650246ffd/gosnmp.go#L567-L573

Using GetBulk implies you know on beforehand how many rows a table has, this is not known for telegraf and so walk is still the better option.

nward · 2022-01-14T11:20:11Z

Hi @Hipska, thanks for looking in to this.

Using GetBulk implies you know on beforehand how many rows a table has, this is not known for telegraf and so walk is still the better option.

This is not correct - the packet's max_repetitions is set to set the size of GetBulk, and is of course an option included in Telegraf - so I am not clear why you make this assertion.
In your examples above, x.walk is called with getRequestType (the first parameter) set to GetBulkRequest. This causes GetBulk to be used - see walk.go

The correct way to poll a table, regardless of whether you know the number or rows, is to call GetBulk with the requested OIDs set to some columns, and the request's max_repetitions set to the desired number of responses in a single SNMP message divided by the number of columns you are requesting at once

For example, say you have a table with 3 columns (colA, colB, colC), and an unknown number of rows. You want to receive 10 values per response packet (as indicated by what the user configures max_repetitions to be in inputs.snmp).
You construct a GetBulk request with three OIDs - [colA, colB, colC], and max_repetitions set to 3 - this allows for each of the requested OIDs to repeat 3 times, without going over 10 responses (i.e. you get 9).
You send the GetBulk, and get back three values back (varbinds in snmp language) per column.
The values in the response come in sets of three - so you get:

[
 [colA varbind 1, colB varbind 1, colC varbind 1],
 [colA varbind 2, colB varbind 2, colC varbind 2],
 [colA varbind 3, colB varbind 3, colC varbind 3]
]

You iterate through this, checking that each varbind is within the column requested. If not, you discard the value and move on.
If it is within the column, you store the value - of course.

When you have completed looking at this response, you construct another GetBulk, with the request OIDs being the last "in-column" OIDs of each column. In this case, the request is the OIDs for colA varbind 3, colB varbind 3, colC varbind 3]

If you have a GetBulk where all columns have a varbind which is not in the requested column, you have completed polling those columns. You then move on to the next set of columns, until you have polled all columns in the table.

Some things to note which catch out some implementations:

columns may have different numbers of rows, so you may get a response which looks like one or more columns have skipped rows if the agent does not have values for those rows in all columns - so you must use the varbind OID to match which row each value belongs to, you cannot use an index you have generated.
subsequent GetBulks for each set of columns may have fewer columns - i.e. if you get a response which has one column with values outside the requested column, you exclude that column from subsequent requests - and you should remember to updated the packet's max_repetitions accordingly.

Hipska · 2022-01-17T09:08:25Z

Hmm, okay you seem to know a bit on this subject. Would you be able to create a (draft) PR where you adapt the use of this method? That way, I can check the code and see if that would indeed be the better scenario.

nward · 2022-01-17T09:11:27Z

It is a curse :-)

I will try - I don't know Go, but, that's never stopped me before. I have some rough sketches of the code that I put together over the weekend, but it will take some time to get working I am sure - it will require a rewrite of a lot of the module I think.. will see !

srebhan · 2022-01-24T09:04:33Z

@nward not knowing Golang is not an issue. If you got something rough (where I can see the logic), please submit it as a draft and let me know. We can work-out the details together if you like.

nward · 2022-02-01T10:12:04Z

Thanks @srebhan. I intend to get to this in the next week or so.

For now we are going to production using collectd for SNMP polling and passing to telegraf - so I will have time once I have got this delivered.

henriknoerr · 2022-09-10T09:14:55Z

@nward - A cautious request :) Will you find time for this PR?
I know nothing of go, but we do use telegraf heavily in our company for snmp polling.
We are of course interested in the most efficient polling.

/Henrik

nward · 2022-09-21T10:41:21Z

Hi @henriknoerr - I haven't had time yet, but I plan to in the next couple of weeks.

We have so far used collectd sending data to telegraf for situations where polling SNMP tables has significant performance impacts on devices.

Hipska · 2023-07-14T15:11:33Z

@nward I'm still interested to see your solution..

telegraf-tiger · 2023-08-14T18:09:43Z

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

nward added the bug unexpected problem or unintended behavior label Jan 12, 2022

telegraf-tiger bot added the area/snmp label Jan 12, 2022

nward mentioned this issue Jan 14, 2022

SNMP plugin GET for multiple values at once #3784

Open

Hipska self-assigned this Sep 12, 2022

Hipska removed their assignment Sep 23, 2022

Hipska added the waiting for response waiting for response from contributor label Jul 31, 2023

telegraf-tiger bot closed this as completed Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inputs.snmp walk requests OIDs multiple times and causes high device load #10427

inputs.snmp walk requests OIDs multiple times and causes high device load #10427

nward commented Jan 12, 2022

nward commented Jan 12, 2022

nward commented Jan 12, 2022 via email

srebhan commented Jan 12, 2022

nward commented Jan 12, 2022

MyaLongmire commented Jan 12, 2022 •

edited

Loading

nward commented Jan 13, 2022

Hipska commented Jan 14, 2022

nward commented Jan 14, 2022

Hipska commented Jan 17, 2022

nward commented Jan 17, 2022

srebhan commented Jan 24, 2022

nward commented Feb 1, 2022

henriknoerr commented Sep 10, 2022

nward commented Sep 21, 2022

Hipska commented Jul 14, 2023

telegraf-tiger bot commented Aug 14, 2023

inputs.snmp walk requests OIDs multiple times and causes high device load #10427

inputs.snmp walk requests OIDs multiple times and causes high device load #10427

Comments

nward commented Jan 12, 2022

Relevent telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

nward commented Jan 12, 2022

nward commented Jan 12, 2022 via email

srebhan commented Jan 12, 2022

nward commented Jan 12, 2022

MyaLongmire commented Jan 12, 2022 • edited Loading

nward commented Jan 13, 2022

Hipska commented Jan 14, 2022

nward commented Jan 14, 2022

Hipska commented Jan 17, 2022

nward commented Jan 17, 2022

srebhan commented Jan 24, 2022

nward commented Feb 1, 2022

henriknoerr commented Sep 10, 2022

nward commented Sep 21, 2022

Hipska commented Jul 14, 2023

telegraf-tiger bot commented Aug 14, 2023

MyaLongmire commented Jan 12, 2022 •

edited

Loading