bulk walk timeout results in no data #6450

Hipska · 2019-09-26T13:24:36Z

Relevant telegraf.conf:

  [[inputs.snmp.table]]
    name = "storage"

    [[inputs.snmp.table.field]]
      oid = "HOST-RESOURCES-MIB::hrStorageDescr"
      is_tag = true

    [[inputs.snmp.table.field]]
      name = "usage"
      oid = "JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed"

System info:

Telegraf 1.12.2 (git: HEAD 8b4c9a0)

Expected behavior:

Return the already received data (if any)

Actual behavior:

E! [inputs.snmp] Error in plugin: agent x.x.x.x:161: gathering table storage: performing bulk walk for field usage: Request timeout (after 3 retries)
E! [telegraf] Error running agent: One or more input plugins had an error

Additional info:

This is what is returned from snmpwalk or snmpbulkwalk:

... (removed outputs of index 1 to 50) ...
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.51 = Gauge32: 0
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.52 = Gauge32: 100
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.53 = Gauge32: 8
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.54 = Gauge32: 100
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.55 = Gauge32: 8
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.56 = Gauge32: 8
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.57 = Gauge32: 8
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.58 = Gauge32: 100
Timeout: No Response from x.x.x.x

Note that there are 60 indexes on this device, so index 59 and 60 are having issues.

The text was updated successfully, but these errors were encountered:

danielnelson · 2019-09-26T19:11:57Z

Thanks for the report. I don't think we should attempt to return the partial data though, it could be missing tags which would create new unwanted series and could prevent metric filtering from matching as expected, potentially skipping processors or routing to the wrong output.

Hipska · 2019-10-21T11:34:02Z

Hi, I think you could at least return the values that are complete? So in that example it would mean records with index 1 to 58.

danielnelson · 2019-10-21T20:42:08Z

It seems that most of the time the data would be incomplete unless we were very close to finishing the table. Even if we decide that is the behavior we want I'm not sure it will come up enough to be worth it.

IF-MIB::ifIndex.1 = INTEGER: 1
IF-MIB::ifIndex.2 = INTEGER: 2
IF-MIB::ifIndex.3 = INTEGER: 3
IF-MIB::ifIndex.4 = INTEGER: 4
IF-MIB::ifIndex.5 = INTEGER: 5
IF-MIB::ifIndex.6 = INTEGER: 6
IF-MIB::ifIndex.10 = INTEGER: 10
IF-MIB::ifIndex.12 = INTEGER: 12
IF-MIB::ifIndex.16 = INTEGER: 16
IF-MIB::ifIndex.17 = INTEGER: 17
IF-MIB::ifIndex.18 = INTEGER: 18
IF-MIB::ifDescr.1 = STRING: lo
IF-MIB::ifDescr.2 = STRING: eth0
IF-MIB::ifDescr.3 = STRING: wlan0
IF-MIB::ifDescr.4 = STRING: dummy0
#
# Removed 215 lines
#
IF-MIB::ifOutQLen.18 = Gauge32: 0
IF-MIB::ifSpecific.1 = OID: SNMPv2-SMI::zeroDotZero
#
# First completed row
#
IF-MIB::ifSpecific.2 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.3 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.4 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.5 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.6 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.10 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.12 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.16 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.17 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.18 = OID: SNMPv2-SMI::zeroDotZero

I think what we may want to do for this issue is reconsider how the timeouts work in the SNMP plugin (#3823). For example right now we have per request timeouts, but perhaps with a full gather timeout instead the issue would be mitigated?

Hipska · 2019-10-22T08:41:28Z

In my situation, almost all of them would be complete except for the last 2. So I would like to have something implemented so that you have at least the complete ones instead of now, you have nothing as result. (While most of the data is actually already present, the plugin just discards it.)

But indeed, the referenced issue is a much bigger problem that should be fixed first. Wow!

danielnelson · 2019-10-22T17:53:08Z

So in your case, index 59 & 60 never reply no matter the timeout? Is this a bug in the device you are monitoring?

Hipska · 2019-11-04T16:13:08Z

Yes it seems so, and I was hoping to get the results of the other indexes from Telegraf as they do respond and are complete.

danielnelson · 2019-11-06T19:19:34Z

I believe changing this would significantly complicate the code for the plugin, since we would need to keep track of if we have received all the data so we can emit the results. I'm going to close this issue as something we won't fix, at least for now, for this reason. If we hear more reports of this type of issue we can reconsider.

Hipska · 2021-05-12T09:12:27Z

I just checked this again, and it even seems that even 59 and 60 do respond but after that we get a timeout. So it seems a bug in the device to not nicely end a walk for this sequence.

Hipska · 2021-12-01T10:02:48Z

@MyaLongmire why would change the way you translate OID's help with this issue?

nward · 2022-01-14T13:57:38Z

@Hipska This is an interesting issue - what device are you polling here, and are you still able to reproduce this? What is your max_repetitions set to? Are you able to share a capture of the walk where you get the timeout?

@MyaLongmire I don't believe #9518 resolved this issue

Hipska · 2024-04-19T13:04:07Z

I tried to reproduce, but wasn't able to any of the devices (SRX/MX/EX) I have access to, so I won't be able to test any PRs implementing this feature.

srebhan · 2024-04-19T15:29:12Z

Closing this for now. If someone comes across this issue, please reopen or open a new issue!

danielnelson added the discussion Topics for discussion label Sep 26, 2019

danielnelson added the area/snmp label Oct 21, 2019

danielnelson added feature request Requests for new plugin and for new features to existing plugins and removed discussion Topics for discussion labels Nov 6, 2019

danielnelson closed this as completed Nov 6, 2019

Hipska reopened this May 12, 2021

MyaLongmire mentioned this issue Sep 16, 2021

refactor: snmp to use gosmi #9518

Merged

2 tasks

reimda closed this as completed in #9518 Nov 30, 2021

Hipska reopened this Jan 14, 2022

srebhan closed this as completed Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulk walk timeout results in no data #6450

bulk walk timeout results in no data #6450

Hipska commented Sep 26, 2019

danielnelson commented Sep 26, 2019

Hipska commented Oct 21, 2019

danielnelson commented Oct 21, 2019

Hipska commented Oct 22, 2019

danielnelson commented Oct 22, 2019

Hipska commented Nov 4, 2019

danielnelson commented Nov 6, 2019

Hipska commented May 12, 2021

Hipska commented Dec 1, 2021

nward commented Jan 14, 2022

Hipska commented Apr 19, 2024

srebhan commented Apr 19, 2024

bulk walk timeout results in no data #6450

bulk walk timeout results in no data #6450

Comments

Hipska commented Sep 26, 2019

Relevant telegraf.conf:

System info:

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented Sep 26, 2019

Hipska commented Oct 21, 2019

danielnelson commented Oct 21, 2019

Hipska commented Oct 22, 2019

danielnelson commented Oct 22, 2019

Hipska commented Nov 4, 2019

danielnelson commented Nov 6, 2019

Hipska commented May 12, 2021

Hipska commented Dec 1, 2021

nward commented Jan 14, 2022

Hipska commented Apr 19, 2024

srebhan commented Apr 19, 2024