Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Revert breaking gosmi change #9518 #10428

Closed
wants to merge 11 commits into from

Conversation

nward
Copy link

@nward nward commented Jan 12, 2022

Required for all PRs:

I have ticked the above - however I have simply reverted commits so have not personally made these changes.

resolves #10309 #10346 #10397

gosmi is not a drop in replacement for snmptranslate:

  • It handles MIB files differently (files must be named the same as the module, which is very unusual and causes a reasonable impact for Telegraf users who are unfamiliar with SNMP internals)
  • It does not parse MIB files the same when it is able to find them - see issues in the gosmi project from telegraf users
  • gosmi was previously a wrapper around libsmi, in which case it would be a drop in replacement - however that has not been the case since v0.2.0

I don't think there is a design problem with moving to gosmi - running snmptranslate many times certainly is wasteful - however it need further testing with a wide range of MIBs to ensure compatibility.
As it is significant change (i.e. MIB files must be renamed and some MIBs no longer work) it should not have been introduced as a refactor:. In its current state it is as BREAKING CHANGE:. Users expect that breaking changes should be major version bumps (i.e. 2.x).

Once the breaking changes are fixed and a very wide range of MIBs have been tested for parity, this could be re-introduced and included in a minor version. I expect many contributions to the gosmi project will be required to resolve these issues. I would be able to contribute to this testing - however I am not a go developer other than making occasional tweaks here and there.

@telegraf-tiger
Copy link
Contributor

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Jan 12, 2022
@nward
Copy link
Author

nward commented Jan 12, 2022

!signed-cla

@powersj
Copy link
Contributor

powersj commented Jan 12, 2022

@nward,

First, thanks for taking the time to file issues against Telegraf. I apologize that the migration to gosmi has caused problems with your setup and configurations. We make every effort to keep Telegraf backward compatible, while continuing to improve the project.

The work to transition to gosmi started last summer. The migration has undergone our testing and testing with community members to try as many devices, and scenarios are possible. As you are well aware, there are some scenarios where gosmi has not been able to read files, and in the cases where those were reported to us, we have worked to resolve those issues.

We would value your help and assistance in resolving the remaining issues. Please continue to file issues that you see with the latest, preferably nightly, version of Telegraf, with whatever MIB files or error messages you see to continue to improve the situation for all users of Telegraf. In the future, we’d welcome a conversation prior to you (or anyone) wasting effort preparing a revert like this

That said, I am going to go ahead and close this PR. Thanks!

@powersj powersj closed this Jan 12, 2022
@nward
Copy link
Author

nward commented Jan 13, 2022

Hi @powersj - sorry, I didn't intend for this PR to be a wasted effort.

I appreciate there has been a lot of effort and learning here, of course the "Simple" in SNMP is rarely true. As I mentioned, it does seem to be the right direction to not run snmptranslate many times (etc.).

However, my view remains that telegraf should revert these changes and not include them in a released version until further compatibility testing can be completed - there have been a number of issues reported so far, which demonstrate that gosmi is not compatible with snmptranslate. I don't know if this means that gosmi is not the right choice, or, if telegraf needs to contribute fixes to gosmi in order to get it to be more compatible, or, if telegraf needs its own code to work around compatibility issues facing gosmi.

SNMP is complicated, and users rarely have a good understanding of it - especially MIB files - despite it being critical for many organisations. Making breaking changes in this area is very impactful so needs to be done a lot more carefully, and at a minimum with a lot of very big upgrade warnings.

I have read the original PR implementing this change, and do not see any information in the discussion re. the testing with many devices and scenarios - not saying it was not done, but, I'm saying I think it is reasonable to raise this PR in absence of that testing being published. Can you describe which vendors/scenarios were tested? I see Cisco, Juniper, Synology MIB issues

@reimda
Copy link
Contributor

reimda commented Jan 13, 2022

@nward Switching to the gosmi parser has fixed problems for other users, specifically when there are many instances of the snmp plugin and many lookups to do. In such cases telegraf startup could take 20+ minutes. With the gosmi parser those startup times dropped to a few seconds. We don't want to lose this benefit.

Telegraf with gosmi does seem to be more picky than when it used netsnmp to parse mibs. That is causing trouble for you and others. We need to find a way to have both benefits: quick startup and mib parsing/translation that isn't more picky than before.

We've fixed one issue with the new gosmi translation that was causing panics. (This should resolve #10346 and #10397) I'm not sure what's causing the remaining parsing/translation pickiness but if we can figure it out I think there will be no need to revert. It could be something in the way telegraf is using gosmi or it could be an issue in gosmi itself. Maybe there's a relatively simple fix telegraf can make so filenames don't have to match the module. To find out, the developers first need to reproduce the problems users are having.

Currently we're hearing reports of incompatibility with popular devices but the reports don't have enough information to reproduce the problems. For example, the last issue this PR is marked to resolve is #10309. It gives a lot of good reproduction information including the oid that is not being translated correctly, but it doesn't include which mib files are being used or how they were installed, which is vital to debugging.

Could you open a new issue to document the specific problems you are running into with translation? Please include which mibs you're using, where you got them, which oids you've noticed aren't translating, and how you trigger the problem with telegraf. I saw you're using juniper in your issue about high device load (10427) so I assume your translation problems are also with juniper mibs. Could you include a link to download the mibs you're using from juniper? I assume it's somewhere on https://www.juniper.net/documentation/software/index_mibs.html but there are multiple products there with multiple versions each. It would be best to be sure we have the exact setup as you when we try to reproduce what you're seeing.

I'm willing to work on building a test suite for telegraf's snmp plugins that includes test cases which download mibs for commonly used devices, install them, then use them with telegraf to translate oids. A few clear/complete bug reports from users like you would be very helpful in building these tests. They'll help fix your issue and prevent future incompatibilities.

Thanks!

@nward
Copy link
Author

nward commented Jan 14, 2022

@nward Switching to the gosmi parser has fixed problems for other users, specifically when there are many instances of the snmp plugin and many lookups to do. In such cases telegraf startup could take 20+ minutes. With the gosmi parser those startup times dropped to a few seconds. We don't want to lose this benefit.

I'm not actually sure that this is true, because in the current implementation gosmi is parsing all MIBs, which is not required for polling (it is currently required for inputs.snmp_trap, but I expect design changes could resolve this). In my environment, with all current Cisco, Juniper, and Opengear MIBs, and polling maybe 10 tables, the startup time is significantly longer than it was previously - snmptranslate is fast.

If I was polling a huge number of tables, then yes I expect we would see longer startup times with snmptranslate.

I will collect some data on this, and raise an issue re. performance if I can do so reproducibly.

Telegraf with gosmi does seem to be more picky than when it used netsnmp to parse mibs. That is causing trouble for you and others. We need to find a way to have both benefits: quick startup and mib parsing/translation that isn't more picky than before.

We've fixed one issue with the new gosmi translation that was causing panics. (This should resolve #10346 and #10397) I'm not sure what's causing the remaining parsing/translation pickiness but if we can figure it out I think there will be no need to revert. It could be something in the way telegraf is using gosmi or it could be an issue in gosmi itself. Maybe there's a relatively simple fix telegraf can make so filenames don't have to match the module. To find out, the developers first need to reproduce the problems users are having.

Currently we're hearing reports of incompatibility with popular devices but the reports don't have enough information to reproduce the problems. For example, the last issue this PR is marked to resolve is #10309. It gives a lot of good reproduction information including the oid that is not being translated correctly, but it doesn't include which mib files are being used or how they were installed, which is vital to debugging.

Could you open a new issue to document the specific problems you are running into with translation? Please include which mibs you're using, where you got them, which oids you've noticed aren't translating, and how you trigger the problem with telegraf. I saw you're using juniper in your issue about high device load (10427) so I assume your translation problems are also with juniper mibs. Could you include a link to download the mibs you're using from juniper? I assume it's somewhere on https://www.juniper.net/documentation/software/index_mibs.html but there are multiple products there with multiple versions each. It would be best to be sure we have the exact setup as you when we try to reproduce what you're seeing.

I'm willing to work on building a test suite for telegraf's snmp plugins that includes test cases which download mibs for commonly used devices, install them, then use them with telegraf to translate oids. A few clear/complete bug reports from users like you would be very helpful in building these tests. They'll help fix your issue and prevent future incompatibilities.

You will see a few issues raised re. the module naming across a few vendors MIBs - I have raised a covering issue describing the problem. I will have more to come.

@reimda reimda mentioned this pull request Jan 21, 2022
@henriknoerr
Copy link

henriknoerr commented Jan 28, 2022

I really support nwards statements - We have stayed on the older release, simply because we can no longer add MIBs and trust Telegraf to run without kernel panics.
I had one report and seen several here, which then asks us to create the bug report to gosmi.
One report here I had fixed was fixed but then the next MIB file comes crashing telegraf.

We as users use the telegraf package, and have no knowledge of various dependencies. When Telegraf switches a dependency, it is seen from the outside as a regression in Telegraf, and seen by the number of bug reports, I do not feel this change has been tested much.

Not really helping here, just offering my point of view.

Best regards,

@reimda
Copy link
Contributor

reimda commented Mar 19, 2022

i @nward, yesterday I merged a change similar to this revert (#10802). It adds back MIB translation through netsnmp's snmptranslate. It also adds a setting that lets people choose to use gosmi but makes netsnmp the default.

When you submitted this PR I thought that it would be more disruptive to users to revert than to fix gosmi's incompatibilities, and that we could fix the incompatibilites quickly. It's clear that we're not fixing them quickly enough and I don't want gosmi to keep people from upgrading.

The change is in the nightly build now so it's available for testing. Could you test it and see if it acts like 1.20.4 for you? We'd like to release it with 1.23.0 which is planned for next Wednesday, March 23. If you have time before then to test and let us know how it goes, it would help us gain confidence in the change and flush out bugs.

Nightly builds: https://github.com/influxdata/telegraf/blob/master/docs/NIGHTLIES.md

@nward
Copy link
Author

nward commented Mar 19, 2022

@reimda Thanks for keeping me in the loop on that - that sounds like a smart move and I welcome the change.

I think there's an opportunity to be a lot smarter in how the problem of slow snmp parsing performance is addressed - rather than just implementing gosmi with the same logic as before.

It's Saturday afternoon here and I'm heading out for the evening but will see if I can do some testing on this tomorrow. I need to have a look at another issue I submitted re. snmp table polling performance so I can get set up and look at both at the same time.

@reimda
Copy link
Contributor

reimda commented Mar 19, 2022

We do need to be more careful addressing performance problems. Performance fixes should never change functionality.

Thanks for your patience and for whatever testing you're able to do especially on the weekend!

@nward
Copy link
Author

nward commented Mar 20, 2022

I've fed it my 1.20.4 config, it ran snmptranslate a bunch, and is collecting data just fine. LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snmp fix pr to fix corresponding bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SNMP Traps from Juniper not fully translated
5 participants