-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Revert breaking gosmi change #9518 #10428
Conversation
…nfluxdata#10346) (influxdata#10354)" This reverts commit 457c98f.
…influxdata#10339)" This reverts commit 7f20e71.
This reverts commit d8cc355.
This reverts commit 8a9fbfd.
This reverts commit 1e81d98.
This reverts commit 49562e8.
This reverts commit 973ffba.
This reverts commit e00147d.
…)" This reverts commit 2d420fb.
This reverts commit b4ef429.
This reverts commit 7675ce6.
Thanks so much for the pull request! |
!signed-cla |
📦 Looks like new artifacts were built from this PR. Expand this list to get them here ! 🐯Artifact URLs |
First, thanks for taking the time to file issues against Telegraf. I apologize that the migration to gosmi has caused problems with your setup and configurations. We make every effort to keep Telegraf backward compatible, while continuing to improve the project. The work to transition to gosmi started last summer. The migration has undergone our testing and testing with community members to try as many devices, and scenarios are possible. As you are well aware, there are some scenarios where gosmi has not been able to read files, and in the cases where those were reported to us, we have worked to resolve those issues. We would value your help and assistance in resolving the remaining issues. Please continue to file issues that you see with the latest, preferably nightly, version of Telegraf, with whatever MIB files or error messages you see to continue to improve the situation for all users of Telegraf. In the future, we’d welcome a conversation prior to you (or anyone) wasting effort preparing a revert like this That said, I am going to go ahead and close this PR. Thanks! |
Hi @powersj - sorry, I didn't intend for this PR to be a wasted effort. I appreciate there has been a lot of effort and learning here, of course the "Simple" in SNMP is rarely true. As I mentioned, it does seem to be the right direction to not run snmptranslate many times (etc.). However, my view remains that telegraf should revert these changes and not include them in a released version until further compatibility testing can be completed - there have been a number of issues reported so far, which demonstrate that gosmi is not compatible with snmptranslate. I don't know if this means that gosmi is not the right choice, or, if telegraf needs to contribute fixes to gosmi in order to get it to be more compatible, or, if telegraf needs its own code to work around compatibility issues facing gosmi. SNMP is complicated, and users rarely have a good understanding of it - especially MIB files - despite it being critical for many organisations. Making breaking changes in this area is very impactful so needs to be done a lot more carefully, and at a minimum with a lot of very big upgrade warnings. I have read the original PR implementing this change, and do not see any information in the discussion re. the testing with many devices and scenarios - not saying it was not done, but, I'm saying I think it is reasonable to raise this PR in absence of that testing being published. Can you describe which vendors/scenarios were tested? I see Cisco, Juniper, Synology MIB issues |
@nward Switching to the gosmi parser has fixed problems for other users, specifically when there are many instances of the snmp plugin and many lookups to do. In such cases telegraf startup could take 20+ minutes. With the gosmi parser those startup times dropped to a few seconds. We don't want to lose this benefit. Telegraf with gosmi does seem to be more picky than when it used netsnmp to parse mibs. That is causing trouble for you and others. We need to find a way to have both benefits: quick startup and mib parsing/translation that isn't more picky than before. We've fixed one issue with the new gosmi translation that was causing panics. (This should resolve #10346 and #10397) I'm not sure what's causing the remaining parsing/translation pickiness but if we can figure it out I think there will be no need to revert. It could be something in the way telegraf is using gosmi or it could be an issue in gosmi itself. Maybe there's a relatively simple fix telegraf can make so filenames don't have to match the module. To find out, the developers first need to reproduce the problems users are having. Currently we're hearing reports of incompatibility with popular devices but the reports don't have enough information to reproduce the problems. For example, the last issue this PR is marked to resolve is #10309. It gives a lot of good reproduction information including the oid that is not being translated correctly, but it doesn't include which mib files are being used or how they were installed, which is vital to debugging. Could you open a new issue to document the specific problems you are running into with translation? Please include which mibs you're using, where you got them, which oids you've noticed aren't translating, and how you trigger the problem with telegraf. I saw you're using juniper in your issue about high device load (10427) so I assume your translation problems are also with juniper mibs. Could you include a link to download the mibs you're using from juniper? I assume it's somewhere on https://www.juniper.net/documentation/software/index_mibs.html but there are multiple products there with multiple versions each. It would be best to be sure we have the exact setup as you when we try to reproduce what you're seeing. I'm willing to work on building a test suite for telegraf's snmp plugins that includes test cases which download mibs for commonly used devices, install them, then use them with telegraf to translate oids. A few clear/complete bug reports from users like you would be very helpful in building these tests. They'll help fix your issue and prevent future incompatibilities. Thanks! |
I'm not actually sure that this is true, because in the current implementation gosmi is parsing all MIBs, which is not required for polling (it is currently required for If I was polling a huge number of tables, then yes I expect we would see longer startup times with snmptranslate. I will collect some data on this, and raise an issue re. performance if I can do so reproducibly.
You will see a few issues raised re. the module naming across a few vendors MIBs - I have raised a covering issue describing the problem. I will have more to come. |
I really support nwards statements - We have stayed on the older release, simply because we can no longer add MIBs and trust Telegraf to run without kernel panics. We as users use the telegraf package, and have no knowledge of various dependencies. When Telegraf switches a dependency, it is seen from the outside as a regression in Telegraf, and seen by the number of bug reports, I do not feel this change has been tested much. Not really helping here, just offering my point of view. Best regards, |
i @nward, yesterday I merged a change similar to this revert (#10802). It adds back MIB translation through netsnmp's snmptranslate. It also adds a setting that lets people choose to use gosmi but makes netsnmp the default. When you submitted this PR I thought that it would be more disruptive to users to revert than to fix gosmi's incompatibilities, and that we could fix the incompatibilites quickly. It's clear that we're not fixing them quickly enough and I don't want gosmi to keep people from upgrading. The change is in the nightly build now so it's available for testing. Could you test it and see if it acts like 1.20.4 for you? We'd like to release it with 1.23.0 which is planned for next Wednesday, March 23. If you have time before then to test and let us know how it goes, it would help us gain confidence in the change and flush out bugs. Nightly builds: https://github.com/influxdata/telegraf/blob/master/docs/NIGHTLIES.md |
@reimda Thanks for keeping me in the loop on that - that sounds like a smart move and I welcome the change. I think there's an opportunity to be a lot smarter in how the problem of slow snmp parsing performance is addressed - rather than just implementing gosmi with the same logic as before. It's Saturday afternoon here and I'm heading out for the evening but will see if I can do some testing on this tomorrow. I need to have a look at another issue I submitted re. snmp table polling performance so I can get set up and look at both at the same time. |
We do need to be more careful addressing performance problems. Performance fixes should never change functionality. Thanks for your patience and for whatever testing you're able to do especially on the weekend! |
I've fed it my 1.20.4 config, it ran snmptranslate a bunch, and is collecting data just fine. LGTM! |
Required for all PRs:
I have ticked the above - however I have simply reverted commits so have not personally made these changes.
resolves #10309 #10346 #10397
gosmi is not a drop in replacement for snmptranslate:
I don't think there is a design problem with moving to gosmi - running snmptranslate many times certainly is wasteful - however it need further testing with a wide range of MIBs to ensure compatibility.
As it is significant change (i.e. MIB files must be renamed and some MIBs no longer work) it should not have been introduced as a
refactor:
. In its current state it is asBREAKING CHANGE:
. Users expect that breaking changes should be major version bumps (i.e. 2.x).Once the breaking changes are fixed and a very wide range of MIBs have been tested for parity, this could be re-introduced and included in a minor version. I expect many contributions to the gosmi project will be required to resolve these issues. I would be able to contribute to this testing - however I am not a go developer other than making occasional tweaks here and there.