Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk walk timeout results in no data #6450

Closed
Hipska opened this issue Sep 26, 2019 · 12 comments · Fixed by #9518
Closed

bulk walk timeout results in no data #6450

Hipska opened this issue Sep 26, 2019 · 12 comments · Fixed by #9518
Labels
area/snmp feature request Requests for new plugin and for new features to existing plugins

Comments

@Hipska
Copy link
Contributor

Hipska commented Sep 26, 2019

Relevant telegraf.conf:

  [[inputs.snmp.table]]
    name = "storage"

    [[inputs.snmp.table.field]]
      oid = "HOST-RESOURCES-MIB::hrStorageDescr"
      is_tag = true

    [[inputs.snmp.table.field]]
      name = "usage"
      oid = "JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed"

System info:

Telegraf 1.12.2 (git: HEAD 8b4c9a0)

Expected behavior:

Return the already received data (if any)

Actual behavior:

E! [inputs.snmp] Error in plugin: agent x.x.x.x:161: gathering table storage: performing bulk walk for field usage: Request timeout (after 3 retries)
E! [telegraf] Error running agent: One or more input plugins had an error

Additional info:

This is what is returned from snmpwalk or snmpbulkwalk:

... (removed outputs of index 1 to 50) ...
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.51 = Gauge32: 0
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.52 = Gauge32: 100
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.53 = Gauge32: 8
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.54 = Gauge32: 100
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.55 = Gauge32: 8
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.56 = Gauge32: 8
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.57 = Gauge32: 8
JUNIPER-HOSTRESOURCES-MIB::jnxHrStoragePercentUsed.58 = Gauge32: 100
Timeout: No Response from x.x.x.x

Note that there are 60 indexes on this device, so index 59 and 60 are having issues.

@danielnelson
Copy link
Contributor

Thanks for the report. I don't think we should attempt to return the partial data though, it could be missing tags which would create new unwanted series and could prevent metric filtering from matching as expected, potentially skipping processors or routing to the wrong output.

@danielnelson danielnelson added the discussion Topics for discussion label Sep 26, 2019
@Hipska
Copy link
Contributor Author

Hipska commented Oct 21, 2019

Hi, I think you could at least return the values that are complete? So in that example it would mean records with index 1 to 58.

@danielnelson
Copy link
Contributor

It seems that most of the time the data would be incomplete unless we were very close to finishing the table. Even if we decide that is the behavior we want I'm not sure it will come up enough to be worth it.

IF-MIB::ifIndex.1 = INTEGER: 1
IF-MIB::ifIndex.2 = INTEGER: 2
IF-MIB::ifIndex.3 = INTEGER: 3
IF-MIB::ifIndex.4 = INTEGER: 4
IF-MIB::ifIndex.5 = INTEGER: 5
IF-MIB::ifIndex.6 = INTEGER: 6
IF-MIB::ifIndex.10 = INTEGER: 10
IF-MIB::ifIndex.12 = INTEGER: 12
IF-MIB::ifIndex.16 = INTEGER: 16
IF-MIB::ifIndex.17 = INTEGER: 17
IF-MIB::ifIndex.18 = INTEGER: 18
IF-MIB::ifDescr.1 = STRING: lo
IF-MIB::ifDescr.2 = STRING: eth0
IF-MIB::ifDescr.3 = STRING: wlan0
IF-MIB::ifDescr.4 = STRING: dummy0
#
# Removed 215 lines
#
IF-MIB::ifOutQLen.18 = Gauge32: 0
IF-MIB::ifSpecific.1 = OID: SNMPv2-SMI::zeroDotZero
#
# First completed row
#
IF-MIB::ifSpecific.2 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.3 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.4 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.5 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.6 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.10 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.12 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.16 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.17 = OID: SNMPv2-SMI::zeroDotZero
IF-MIB::ifSpecific.18 = OID: SNMPv2-SMI::zeroDotZero

I think what we may want to do for this issue is reconsider how the timeouts work in the SNMP plugin (#3823). For example right now we have per request timeouts, but perhaps with a full gather timeout instead the issue would be mitigated?

@Hipska
Copy link
Contributor Author

Hipska commented Oct 22, 2019

In my situation, almost all of them would be complete except for the last 2. So I would like to have something implemented so that you have at least the complete ones instead of now, you have nothing as result. (While most of the data is actually already present, the plugin just discards it.)

But indeed, the referenced issue is a much bigger problem that should be fixed first. Wow!

@danielnelson
Copy link
Contributor

So in your case, index 59 & 60 never reply no matter the timeout? Is this a bug in the device you are monitoring?

@Hipska
Copy link
Contributor Author

Hipska commented Nov 4, 2019

Yes it seems so, and I was hoping to get the results of the other indexes from Telegraf as they do respond and are complete.

@danielnelson danielnelson added feature request Requests for new plugin and for new features to existing plugins and removed discussion Topics for discussion labels Nov 6, 2019
@danielnelson
Copy link
Contributor

I believe changing this would significantly complicate the code for the plugin, since we would need to keep track of if we have received all the data so we can emit the results. I'm going to close this issue as something we won't fix, at least for now, for this reason. If we hear more reports of this type of issue we can reconsider.

@Hipska
Copy link
Contributor Author

Hipska commented May 12, 2021

I just checked this again, and it even seems that even 59 and 60 do respond but after that we get a timeout. So it seems a bug in the device to not nicely end a walk for this sequence.

@Hipska
Copy link
Contributor Author

Hipska commented Dec 1, 2021

@MyaLongmire why would change the way you translate OID's help with this issue?

@nward
Copy link

nward commented Jan 14, 2022

@Hipska This is an interesting issue - what device are you polling here, and are you still able to reproduce this? What is your max_repetitions set to? Are you able to share a capture of the walk where you get the timeout?

@MyaLongmire I don't believe #9518 resolved this issue

@Hipska Hipska reopened this Jan 14, 2022
@Hipska
Copy link
Contributor Author

Hipska commented Apr 19, 2024

I tried to reproduce, but wasn't able to any of the devices (SRX/MX/EX) I have access to, so I won't be able to test any PRs implementing this feature.

@srebhan
Copy link
Member

srebhan commented Apr 19, 2024

Closing this for now. If someone comes across this issue, please reopen or open a new issue!

@srebhan srebhan closed this as completed Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snmp feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants