Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: snmp to use gosmi #9518

Merged
merged 75 commits into from
Nov 30, 2021
Merged

refactor: snmp to use gosmi #9518

merged 75 commits into from
Nov 30, 2021

Conversation

MyaLongmire
Copy link
Contributor

@MyaLongmire MyaLongmire commented Jul 19, 2021

Required for all PRs:

  • Updated associated README.md.
  • Wrote appropriate unit tests.

resolves #3784
resolves #6450
resolves #5720

No longer calls snmptranslate and parses the output. Now it relies on gosmi library to parse mib files directly to increase performance. Moved all gosmi code into internal/snmp

@telegraf-tiger telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Jul 19, 2021
@MyaLongmire MyaLongmire changed the title Refactor snmp to use gosmi refactor: snmp to use gosmi Sep 15, 2021
@Hipska Hipska self-requested a review September 17, 2021 07:17
@Hipska Hipska linked an issue Sep 17, 2021 that may be closed by this pull request
plugins/inputs/snmp/snmp.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Outdated Show resolved Hide resolved
@henriknoerr
Copy link

I am unsure of the progress of this PR - is it only the tests missing?
Our telegraf environment is at a state where we have +15 minutes reload time of the agents.
If it is only tests I'll cherrypick this PR and go from there.

@Hipska
Copy link
Contributor

Hipska commented Oct 11, 2021

@henriknoerr did you also test how long it takes to startup alone?

The reason I'm asking is because if you have lots of unresponding snmp devices, it takes a while for telegraf to shut down since it will wait for timeouts and retries on those..

@henriknoerr
Copy link

@henriknoerr did you also test how long it takes to startup alone?

I did not know the shutdown process waited for all timeouts - I will look into this.
Thanks

@MyaLongmire MyaLongmire marked this pull request as ready for review October 11, 2021 19:50
@MyaLongmire MyaLongmire requested a review from Hipska October 11, 2021 19:51
@henriknoerr
Copy link

@henriknoerr did you also test how long it takes to startup alone?

The reason I'm asking is because if you have lots of unresponding snmp devices, it takes a while for telegraf to shut down since it will wait for timeouts and retries on those..

And yes indeed, I was blinded by this snmp MIB PR. Fixing retries and timeout values made a huge difference.
Thank you for pointing me in the right direction.

plugins/inputs/snmp/snmp_test.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Show resolved Hide resolved
plugins/inputs/snmp/snmp.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp.go Outdated Show resolved Hide resolved
@MyaLongmire
Copy link
Contributor Author

If a node can't be translated it gives an error. For example translating: Could not find node for OID 999

@Hipska
Copy link
Contributor

Hipska commented Oct 12, 2021

Correct, but we have the OID, so telegraf can continue and do its work.

@MyaLongmire
Copy link
Contributor Author

I changed to error to be a warning so telegraf can continue doing its thing. I also added .999 back to the tests. Sorry for the confusion, I was miss understanding the problem :)

@MyaLongmire MyaLongmire requested a review from Hipska October 12, 2021 23:39
plugins/inputs/snmp/snmp.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp.go Outdated Show resolved Hide resolved
@MyaLongmire
Copy link
Contributor Author

That is super weird. I have no idea what happened. Dave and I will try and fix it. Sorry for the confusion :)

@reimda reimda force-pushed the snmp-gosmi-rework branch from 34d34a7 to 0ea43b3 Compare October 13, 2021 17:57
@Hipska Hipska mentioned this pull request Oct 14, 2021
2 tasks
plugins/inputs/snmp/snmp.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Outdated Show resolved Hide resolved
@telegraf-tiger
Copy link
Contributor

telegraf-tiger bot commented Nov 4, 2021

🥳 This pull request decreases the Telegraf binary size by -2.47 % for linux amd64 (new size: 127.5 MB, nightly size 130.7 MB)

📦 Looks like new artifacts were built from this PR.

Expand this list to get them here! 🐯

Artifact URLs

DEB RPM TAR GZ ZIP
amd64.deb aarch64.rpm darwin_amd64.tar.gz windows_amd64.zip
arm64.deb armel.rpm freebsd_amd64.tar.gz windows_i386.zip
armel.deb armv6hl.rpm freebsd_armv7.tar.gz
armhf.deb i386.rpm freebsd_i386.tar.gz
i386.deb ppc64le.rpm linux_amd64.tar.gz
mips.deb s390x.rpm linux_arm64.tar.gz
mipsel.deb x86_64.rpm linux_armel.tar.gz
ppc64el.deb linux_armhf.tar.gz
s390x.deb linux_i386.tar.gz
linux_mips.tar.gz
linux_mipsel.tar.gz
linux_ppc64le.tar.gz
linux_s390x.tar.gz
static_linux_amd64.tar.gz

@Hipska
Copy link
Contributor

Hipska commented Nov 8, 2021

Why can't it be added without changing functionality?

@MyaLongmire
Copy link
Contributor Author

MyaLongmire commented Nov 29, 2021

@Hipska
@reimda and I paired on this and came up with a solution based on multiple factors:

  1. we have no control over gosmi being global and we are not switching libraries now
  2. snmp_trap has a path setting and that needs to continue to be supported
  3. path in clientconfig should match snmp_trap setting "path" so we have the option in the future to make snmp_trap use clientconfig (outside the scope of this pr)
  4. not wanting a global path setting in the agent

Our solution was to have one global gosmi instance and each plugin will append to its path. While refacorting the paths all gosmi code was moved into internal/snmp/mib.go

Hopefully, this solution puts your worries to rest about the global config paths :)

Copy link
Contributor

@Hipska Hipska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename the filename of internal/snmp/mib.go into translate.go or smi.go which better fits the contained code.

plugins/inputs/snmp/README.md Show resolved Hide resolved
plugins/inputs/snmp/README.md Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp.go Show resolved Hide resolved
internal/snmp/mib.go Outdated Show resolved Hide resolved
plugins/inputs/snmp/snmp_test.go Show resolved Hide resolved
Copy link
Contributor

@Hipska Hipska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, seems ready to me. Don't forget to also revert the other section.

plugins/inputs/snmp/README.md Outdated Show resolved Hide resolved
@Hipska Hipska requested review from powersj and srebhan November 30, 2021 15:54
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @MyaLongmire, I think LoadMibsFromPath requires proper locking and protection of gosmi.Init() to be executed only once.

// must init, append path for each directory, load module for every file
// or gosmi will fail without saying why
func LoadMibsFromPath(paths []string, log telegraf.Logger) error {
gosmi.Init()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, this function is called from each inputs.snmp plugin. Is this correct? If so, this means that you call gosmi.Init() multiple time with gosmi having a global state if my interpretation is correct...

If this is the case, chances are high that you get side-effects between the different plugin-instances potentially ranging from race-conditions when adding the modules to mutual overwriting of the loaded plugins when Init() is called multiple times... To reduce those effects you should init gosmi only once (using sync.Once) and guard this loading function with a lock. I don't know SNMP, but let's hope that there are no collisions possible between multiple (contradicting) mib-definitions...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do call gosmi.Init more than once but the library has a built-in check to see if it has already been initialized or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also added a mutex

Copy link
Member

@srebhan srebhan Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @MyaLongmire, I don't see any check... When calling gosmi.Init() it calls this function which in turn calls smi.Init() that can be found here. Already at this point, multiple files are reread etc. Additionally internal.Init() is called also modifying data. While I agree that currently nothing "critical" is changed there it won't hurt to be on the safe side by

Suggested change
gosmi.Init()
once.Do(gosmi.Init)

and place var once sync.Once in the package scope, to be sure we call that function only once at least in that package. It would be even better to make sure it's not called elsewhere, but that's probably out of scope for this PR...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added in pr 10199

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gosmi.Init() always calls smi.Init("gosmi") which always calls internal.Init(handleName) with the same handleName. That function does check whether it already has an internal handle of that name. That's why it looked to me like calling gosmi.Init() multiple times is safe.

It's a good idea to be defensive in our code and add sync.Once() anyway. I'm not sure gosmi intends it to be safe to call gosmi.Init() multiple times. I didn't see a guarantee in the docs, so even if it is now it might not be in a future version of gosmi.

internal/snmp/translate.go Show resolved Hide resolved
@telegraf-tiger
Copy link
Contributor

@reimda reimda merged commit 7675ce6 into master Nov 30, 2021
@reimda reimda deleted the snmp-gosmi-rework branch November 30, 2021 22:47
nward added a commit to nward/telegraf that referenced this pull request Jan 12, 2022
phemmer added a commit to phemmer/telegraf that referenced this pull request Feb 18, 2022
* origin/master: (133 commits)
  chore: restart service if it is already running and upgraded via RPM (influxdata#9970)
  feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10237)
  fix: Handle duplicate registration of protocol-buffer files gracefully. (influxdata#10188)
  fix(http_listener_v2): fix panic on close (influxdata#10132)
  feat: add Vault input plugin (influxdata#10198)
  feat: support aws managed service for prometheus (influxdata#10202)
  fix: Make telegraf compile on Windows with golang 1.16.2 (influxdata#10246)
  Update changelog
  feat: Modbus add per-request tags (influxdata#10231)
  fix: Implement NaN and inf handling for elasticsearch output (influxdata#10196)
  feat: add nomad input plugin (influxdata#10106)
  fix: Print loaded plugins and deprecations for once and test (influxdata#10205)
  fix: eliminate MIB dependency for ifname processor (influxdata#10214)
  feat: Optimize locking for SNMP MIBs loading. (influxdata#10206)
  feat: Add SMART plugin concurrency configuration option, nvme-cli v1.14+ support and lint fixes. (influxdata#10150)
  feat: update configs (influxdata#10236)
  fix(inputs/kube_inventory): set TLS server name config properly (influxdata#9975)
  fix: Sudden close of Telegraf caused by OPC UA input plugin (influxdata#10230)
  fix: bump github.com/eclipse/paho.mqtt.golang from 1.3.0 to 1.3.5 (influxdata#9913)
  fix: json_v2 parser timestamp setting (influxdata#10221)
  fix: ensure graylog spec fields not prefixed with '_' (influxdata#10209)
  docs: remove duplicate links in CONTRIBUTING.md (influxdata#10218)
  fix: pool detection and metrics gathering for ZFS >= 2.1.x (influxdata#10099)
  fix: parallelism fix for ifname processor (influxdata#10007)
  chore: Forbids "log" package only for aggregators, inputs, outputs, parsers and processors (influxdata#10191)
  docs: address documentation gap when running telegraf in k8s (influxdata#10215)
  feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10211)
  fix: mqtt topic extracting no longer requires all three fields (influxdata#10208)
  fix: windows service - graceful shutdown of telegraf (influxdata#9616)
  feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10201)
  feat: Modbus support multiple slaves (gateway feature) (influxdata#9279)
  fix: Revert unintented corruption of the Makefile from influxdata#10200. (influxdata#10203)
  chore: remove triggering update-config bot in CI (influxdata#10195)
  Update changelog
  feat: Implement deprecation infrastructure (influxdata#10200)
  fix: extra lock on init for safety (influxdata#10199)
  fix: resolve influxdata#10027 (influxdata#10112)
  fix: register bigquery to output plugins influxdata#10177 (influxdata#10178)
  fix: sysstat use unique temp file vs hard-coded (influxdata#10165)
  refactor: snmp to use gosmi (influxdata#9518)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snmp feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
6 participants