Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inputs.snmp walk requests OIDs multiple times and causes high device load #10427

Closed
nward opened this issue Jan 12, 2022 · 16 comments
Closed

inputs.snmp walk requests OIDs multiple times and causes high device load #10427

nward opened this issue Jan 12, 2022 · 16 comments
Labels
area/snmp bug unexpected problem or unintended behavior waiting for response waiting for response from contributor

Comments

@nward
Copy link

nward commented Jan 12, 2022

Relevent telegraf.conf

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

[[inputs.snmp]]
  agents = ["udp://192.168.0.1:161"]
  version = 2
  path = ["/mibs"]
  community = "test"

  [[inputs.snmp.table]]
    oid = "IF-MIB::ifTable"
    name = "interface"

    [[inputs.snmp.table.field]]
      oid = "IF-MIB::ifDescr"
      name = "ifDescr"
      is_tag = true

[[inputs.snmp.table]]
    oid = "JUNIPER-SRX5000-SPU-MONITORING-MIB::jnxJsSPUMonitoringObjectsTable"

    [[inputs.snmp.table.field]]
        oid = "JUNIPER-SRX5000-SPU-MONITORING-MIB::jnxJsSPUMonitoringObjectsEntry"
        is_tag = true

[[outputs.file]]
  files = ["stdout"]

Logs from Telegraf

❯ docker run -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro -v $PWD/mibs:/mibs:ro telegraf
2022-01-12T02:08:37Z I! Starting Telegraf 1.21.2
2022-01-12T02:08:37Z I! Using config file: /etc/telegraf/telegraf.conf
2022-01-12T02:08:37Z I! Loaded inputs: snmp
2022-01-12T02:08:37Z I! Loaded aggregators:
2022-01-12T02:08:37Z I! Loaded processors:
2022-01-12T02:08:37Z I! Loaded outputs: file
2022-01-12T02:08:37Z I! Tags enabled: host=04fe51ab0f31
2022-01-12T02:08:37Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"04fe51ab0f31", Flush Interval:10s
Parse module: /mibs/RFC-1212:63:33: unexpected ".." (expected ")")
2022-01-12T02:08:40Z W! [inputs.snmp] module RFC-1212 could not be loaded
interface,agent_host=192.168.0.1,host=04fe51ab0f31,ifDescr=ge-0/0/5,ifIndex=521 ifOperStatus=2i,ifInNUcastPkts=0i,ifOutDiscards=0i,ifOutErrors=0i,ifInUcastPkts=0i,ifOutOctets=0i,ifOutUcastPkts=0i,ifOutNUcastPkts=0i,ifSpecific=".0.0",ifPhysAddress="58:00:bb:df:ea:c1",ifAdminStatus=1i,ifInOctets=0i,ifInErrors=0i,ifInUnknownProtos=0i,ifOutQLen=0i,ifType=6i,ifMtu=1514i,ifSpeed=1000000000i,ifLastChange=9703i,ifInDiscards=0i 1641953333000000000

etc.

System info

Telegraf 1.21.2

Docker

Standard telegraf:latest (1.21.2)

Steps to reproduce

  1. Configure telegraf to poll an SNMP table
  2. Note with tcpdump that the SNMP table is being polled inefficiently, leading to data being requested multiple times

Expected behavior

I expect SNMP walk to request multiple columns at once, reducing max_repetitions appropriately (i.e. to int(max_repetitions / num_columns)).

Actual behavior

SNMP walk requests a single column at a time.

Because max_repetitions is set, GetBulk will return data past the requested column. SNMP bulkwalk implementations should request multiple columns at once, and set max_repetitions per request to int(max_repetitions/column count).

Telegraf's current implementation does not do this. Instead, it requests a single column at a time, and each column goes past the end by up to max_repetitions - 1 (i.e. if there is only a single row remaining to request).

See the following packet capture. This shows walking the IF-MIB::ifInOctets column, and then the IF-MIB::ifInUcastPkts column as two separate walks - note there is only a single OID in each GetBulk request.
When we reach the end of the IF-MIB::ifInOctets column, the GetBulk begins to return values from IF-MIB::ifInUcastPkts - as this is the next variables in the SNMP agent.
Then, we request IF-MIB::ifInUcastPkts from the top, and get values for the first several rows of the IF-MIB::ifInUcastPkts column a second time.

In the below capture, note the request at timestamp 15:16:31.263950 is asking for an OID which has been returned in a previous response .1.3.6.1.2.1.2.2.1.11 - as it is walking a single column at a time, after walking .1.3.6.1.2.1.2.2.1.10.

15:16:30.931772 192.168.0.235.55798 > 192.168.0.1.161:  C="test" GetBulk(29)  N=0 M=10 .1.3.6.1.2.1.2.2.1.10
15:16:30.960187 192.168.0.1.161 > 192.168.0.235.55798:  C="test" GetResponse(195)  .1.3.6.1.2.1.2.2.1.10.4=0 .1.3.6.1.2.1.2.2.1.10.6=3841180218 .1.3.6.1.2.1.2.2.1.10.7=0 .1.3.6.1.2.1.2.2.1.10.8=0 .1.3.6.1.2.1.2.2.1.10.9=0 .1.3.6.1.2.1.2.2.1.10.10=0 .1.3.6.1.2.1.2.2.1.10.11=0 .1.3.6.1.2.1.2.2.1.10.12=0 .1.3.6.1.2.1.2.2.1.10.21=49036 .1.3.6.1.2.1.2.2.1.10.22=3841120174
15:16:30.960659 192.168.0.235.55798 > 192.168.0.1.161:  C="test" GetBulk(30)  N=0 M=10 .1.3.6.1.2.1.2.2.1.10.22
15:16:31.032817 192.168.0.1.161 > 192.168.0.235.55798:  C="test" GetResponse(198)  .1.3.6.1.2.1.2.2.1.10.251=0 .1.3.6.1.2.1.2.2.1.10.501=0 .1.3.6.1.2.1.2.2.1.10.502=0 .1.3.6.1.2.1.2.2.1.10.503=0 .1.3.6.1.2.1.2.2.1.10.504=0 .1.3.6.1.2.1.2.2.1.10.505=0 .1.3.6.1.2.1.2.2.1.10.506=0 .1.3.6.1.2.1.2.2.1.10.507=0 .1.3.6.1.2.1.2.2.1.10.508=709435975 .1.3.6.1.2.1.2.2.1.10.509=0
15:16:31.033360 192.168.0.235.55798 > 192.168.0.1.161:  C="test" GetBulk(31)  N=0 M=10 .1.3.6.1.2.1.2.2.1.10.509
15:16:31.097061 192.168.0.1.161 > 192.168.0.235.55798:  C="test" GetResponse(210)  .1.3.6.1.2.1.2.2.1.10.510=0 .1.3.6.1.2.1.2.2.1.10.511=0 .1.3.6.1.2.1.2.2.1.10.513=4210062172 .1.3.6.1.2.1.2.2.1.10.515=709435975 .1.3.6.1.2.1.2.2.1.10.516=3240068682 .1.3.6.1.2.1.2.2.1.10.517=2185747961 .1.3.6.1.2.1.2.2.1.10.518=0 .1.3.6.1.2.1.2.2.1.10.519=0 .1.3.6.1.2.1.2.2.1.10.520=0 .1.3.6.1.2.1.2.2.1.10.521=0
15:16:31.097564 192.168.0.235.55798 > 192.168.0.1.161:  C="test" GetBulk(31)  N=0 M=10 .1.3.6.1.2.1.2.2.1.10.521
15:16:31.221709 192.168.0.1.161 > 192.168.0.235.55798:  C="test" GetResponse(202)  .1.3.6.1.2.1.2.2.1.10.522=0 .1.3.6.1.2.1.2.2.1.10.523=0 .1.3.6.1.2.1.2.2.1.10.524=3239957758 .1.3.6.1.2.1.2.2.1.10.525=0 .1.3.6.1.2.1.2.2.1.10.526=0 .1.3.6.1.2.1.2.2.1.10.527=0 .1.3.6.1.2.1.2.2.1.10.528=0 .1.3.6.1.2.1.2.2.1.10.529=0 .1.3.6.1.2.1.2.2.1.10.530=0 .1.3.6.1.2.1.2.2.1.10.531=272993948
15:16:31.222397 192.168.0.235.55798 > 192.168.0.1.161:  C="test" GetBulk(31)  N=0 M=10 .1.3.6.1.2.1.2.2.1.10.531
15:16:31.263247 192.168.0.1.161 > 192.168.0.235.55798:  C="test" GetResponse(194)  .1.3.6.1.2.1.2.2.1.10.532=0 .1.3.6.1.2.1.2.2.1.10.533=0 .1.3.6.1.2.1.2.2.1.10.534=0 .1.3.6.1.2.1.2.2.1.10.536=0 .1.3.6.1.2.1.2.2.1.10.537=0 .1.3.6.1.2.1.2.2.1.10.538=0 .1.3.6.1.2.1.2.2.1.11.4=0 .1.3.6.1.2.1.2.2.1.11.6=83475283 .1.3.6.1.2.1.2.2.1.11.7=0 .1.3.6.1.2.1.2.2.1.11.8=0
15:16:31.263950 192.168.0.235.55798 > 192.168.0.1.161:  C="test" GetBulk(29)  N=0 M=10 .1.3.6.1.2.1.2.2.1.11
15:16:31.282304 192.168.0.1.161 > 192.168.0.235.55798:  C="test" GetResponse(191)  .1.3.6.1.2.1.2.2.1.11.4=0 .1.3.6.1.2.1.2.2.1.11.6=83475283 .1.3.6.1.2.1.2.2.1.11.7=0 .1.3.6.1.2.1.2.2.1.11.8=0 .1.3.6.1.2.1.2.2.1.11.9=0 .1.3.6.1.2.1.2.2.1.11.10=0 .1.3.6.1.2.1.2.2.1.11.11=0 .1.3.6.1.2.1.2.2.1.11.12=0 .1.3.6.1.2.1.2.2.1.11.21=118 .1.3.6.1.2.1.2.2.1.11.22=83475018

I have included IF-MIB here as this is an SNMP table available on near every SNMP agent. It is not a very good example of this causing an issue on the device, as it is only returning a few extra responses and is quite a low % of "overrun". This is included to allow the issue to be reproduced by anyone.

Note the following example where this effect is much more pronounced. When walking JUNIPER-SRX5000-SPU-MONITORING-MIB::jnxJsSPUMonitoringObjectsTable, which has 15 columns and on my device has a single row, telegraf runs separate walks for each column with a max_repetitions of 10. This causes the first and only entry of each column to be returned, along with the first and only entry in each of the next 9 columns. This repeats like this until column 8, and then we start also getting OIDs from outside the table until we finish walking the table with column 15.

This is very inefficient, as each OID in the table is returned from the device 10 times - and we are getting OIDs outside of the table up to 9 times each.

16:25:22.636851 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(34)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1
16:25:22.813952 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.2.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.3.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.4.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0=28 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single"
16:25:22.814526 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(36)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0
16:25:22.926949 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(248)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.2.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.3.0=18
16:25:22.927498 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.1
16:25:23.092537 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.2.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.3.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.4.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0=28 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single"
16:25:23.093380 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetRequest(35)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.1
16:25:23.095980 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(35)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.1=[noSuchObject]
16:25:23.096559 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.2
16:25:23.265454 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.2.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.3.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.4.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0=28 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single"
16:25:23.266724 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.3
16:25:23.434807 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.3.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.4.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0=28 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18
16:25:23.435532 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.4
16:25:23.647472 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.4.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0=28 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=0
16:25:23.648413 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5
16:25:23.815512 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.5.0=28 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0
16:25:23.816695 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6
16:25:23.982120 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.6.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0
16:25:23.983285 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7
16:25:24.263094 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(252)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.7.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0
16:25:24.263926 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8
16:25:24.418767 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(248)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.8.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=18 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=18
16:25:24.419594 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9
16:25:24.585044 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(248)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.9.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=20 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536
16:25:24.585625 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10
16:25:24.723222 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(248)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.10.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=20 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0
16:25:24.723924 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11
16:25:24.889485 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(253)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.11.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=20 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.2.0="single"
16:25:24.890164 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12
16:25:25.023276 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(248)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.12.0=20 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.2.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.3.0=21
16:25:25.023882 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13
16:25:25.116783 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(250)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.13.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.2.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.3.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.4.0=65536
16:25:25.117332 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14
16:25:25.265586 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(250)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.14.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.2.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.3.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.4.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.5.0=1
16:25:25.266332 192.168.0.235.54911 > 192.168.0.1.161:  C="test" GetBulk(35)  N=0 M=10 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15
16:25:25.326821 192.168.0.1.161 > 192.168.0.235.54911:  C="test" GetResponse(250)  .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.15.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.1.1.16.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.2.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.3.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.1.0=0 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.2.0="single" .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.3.0=21 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.4.0=65536 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.5.0=1 .1.3.6.1.4.1.2636.3.39.1.12.1.4.1.6.0=0

Additional info

No response

@nward nward added the bug unexpected problem or unintended behavior label Jan 12, 2022
@nward
Copy link
Author

nward commented Jan 12, 2022

Note - this is a re-post of #10420 with additional information, as #10420 was closed after incorrect triage.

@nward
Copy link
Author

nward commented Jan 12, 2022 via email

@srebhan
Copy link
Member

srebhan commented Jan 12, 2022

@nward yeah I saw it and thus deleted the comment. :-)

@nward
Copy link
Author

nward commented Jan 12, 2022

@nward yeah I saw it and thus deleted the comment. :-)

Ahh sorry! I replied via email so didn’t catch that.

@MyaLongmire
Copy link
Contributor

MyaLongmire commented Jan 12, 2022

Thank you updating telegraf! I miss understood your original issue as a problem with snmptranslate or mib parsing.

I believe walk and get are handled by the gosnmp library

Walk(string, gosnmp.WalkFunc) error
Get(oids []string) (*gosnmp.SnmpPacket, error)

Meaning this is an upstream issue. If you wouldn't mind opening an issue with them :)

@nward
Copy link
Author

nward commented Jan 13, 2022

Hi @MyaLongmire

Yes, Walk and Get are handled by gosnmp - however this is not an issue with their project - their project has multiple methods to send SNMP requests, and telegraf is not using the best one for polling a table.

Their Walk function is a wrapper around GetNext/GetBulk and walks the subtree of the specified OID - a single OID. This is the sort of method you would use when exploring an SNMP agent - perhaps in some automated discovery, or a human dumping out what is available when trying to decide what to monitor.

That is not the appropriate method for regularly polling a table for which we know the structure. Instead, gosnmp.GetBulk should be used, passing multiple OIDs - one for each column, and repeating until each column has been completely fetched.
This is a lower level function, so additional logic is required - see walk.go in the gosnmp source for some examples, specifically the request loop, and checking each of the returned variables to make sure it is in the subtree of the requested OID.

If it is useful, I can write some pseudocode of how this needs to operate.

@Hipska
Copy link
Contributor

Hipska commented Jan 14, 2022

Telegraf is indeed using the mentioned walk function, see the following parts:

err := gs.Walk(oid, func(ent gosnmp.SnmpPDU) error {

// Walk wraps GoSNMP.Walk() or GoSNMP.BulkWalk(), depending on whether the
// connection is using SNMPv1 or newer.
func (gs GosnmpWrapper) Walk(oid string, fn gosnmp.WalkFunc) error {
if gs.Version == gosnmp.Version1 {
return gs.GoSNMP.Walk(oid, fn)
}
return gs.GoSNMP.BulkWalk(oid, fn)
}

https://github.com/gosnmp/gosnmp/blob/96366f3fa26cabbea05d3d854ad8577650246ffd/gosnmp.go#L567-L573

Using GetBulk implies you know on beforehand how many rows a table has, this is not known for telegraf and so walk is still the better option.

@nward
Copy link
Author

nward commented Jan 14, 2022

Hi @Hipska, thanks for looking in to this.

Using GetBulk implies you know on beforehand how many rows a table has, this is not known for telegraf and so walk is still the better option.

This is not correct - the packet's max_repetitions is set to set the size of GetBulk, and is of course an option included in Telegraf - so I am not clear why you make this assertion.
In your examples above, x.walk is called with getRequestType (the first parameter) set to GetBulkRequest. This causes GetBulk to be used - see walk.go

The correct way to poll a table, regardless of whether you know the number or rows, is to call GetBulk with the requested OIDs set to some columns, and the request's max_repetitions set to the desired number of responses in a single SNMP message divided by the number of columns you are requesting at once

For example, say you have a table with 3 columns (colA, colB, colC), and an unknown number of rows. You want to receive 10 values per response packet (as indicated by what the user configures max_repetitions to be in inputs.snmp).
You construct a GetBulk request with three OIDs - [colA, colB, colC], and max_repetitions set to 3 - this allows for each of the requested OIDs to repeat 3 times, without going over 10 responses (i.e. you get 9).
You send the GetBulk, and get back three values back (varbinds in snmp language) per column.
The values in the response come in sets of three - so you get:

[
 [colA varbind 1, colB varbind 1, colC varbind 1],
 [colA varbind 2, colB varbind 2, colC varbind 2],
 [colA varbind 3, colB varbind 3, colC varbind 3]
]

You iterate through this, checking that each varbind is within the column requested. If not, you discard the value and move on.
If it is within the column, you store the value - of course.

When you have completed looking at this response, you construct another GetBulk, with the request OIDs being the last "in-column" OIDs of each column. In this case, the request is the OIDs for colA varbind 3, colB varbind 3, colC varbind 3]

If you have a GetBulk where all columns have a varbind which is not in the requested column, you have completed polling those columns. You then move on to the next set of columns, until you have polled all columns in the table.

Some things to note which catch out some implementations:

  1. columns may have different numbers of rows, so you may get a response which looks like one or more columns have skipped rows if the agent does not have values for those rows in all columns - so you must use the varbind OID to match which row each value belongs to, you cannot use an index you have generated.
  2. subsequent GetBulks for each set of columns may have fewer columns - i.e. if you get a response which has one column with values outside the requested column, you exclude that column from subsequent requests - and you should remember to updated the packet's max_repetitions accordingly.

@Hipska
Copy link
Contributor

Hipska commented Jan 17, 2022

Hmm, okay you seem to know a bit on this subject. Would you be able to create a (draft) PR where you adapt the use of this method? That way, I can check the code and see if that would indeed be the better scenario.

@nward
Copy link
Author

nward commented Jan 17, 2022

It is a curse :-)

I will try - I don't know Go, but, that's never stopped me before. I have some rough sketches of the code that I put together over the weekend, but it will take some time to get working I am sure - it will require a rewrite of a lot of the module I think.. will see !

@srebhan
Copy link
Member

srebhan commented Jan 24, 2022

@nward not knowing Golang is not an issue. If you got something rough (where I can see the logic), please submit it as a draft and let me know. We can work-out the details together if you like.

@nward
Copy link
Author

nward commented Feb 1, 2022

Thanks @srebhan. I intend to get to this in the next week or so.

For now we are going to production using collectd for SNMP polling and passing to telegraf - so I will have time once I have got this delivered.

@henriknoerr
Copy link

@nward - A cautious request :) Will you find time for this PR?
I know nothing of go, but we do use telegraf heavily in our company for snmp polling.
We are of course interested in the most efficient polling.

/Henrik

@Hipska Hipska self-assigned this Sep 12, 2022
@nward
Copy link
Author

nward commented Sep 21, 2022

Hi @henriknoerr - I haven't had time yet, but I plan to in the next couple of weeks.

We have so far used collectd sending data to telegraf for situations where polling SNMP tables has significant performance impacts on devices.

@Hipska Hipska removed their assignment Sep 23, 2022
@Hipska
Copy link
Contributor

Hipska commented Jul 14, 2023

@nward I'm still interested to see your solution..

@Hipska Hipska added the waiting for response waiting for response from contributor label Jul 31, 2023
@telegraf-tiger
Copy link
Contributor

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snmp bug unexpected problem or unintended behavior waiting for response waiting for response from contributor
Projects
None yet
Development

No branches or pull requests

5 participants