Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(inputs.nvidia_smi): Add startup_error_behavior config option #14680

Merged
merged 15 commits into from
Feb 15, 2024

Conversation

smokhov
Copy link
Contributor

@smokhov smokhov commented Feb 5, 2024

Summary

As per #14603 adding a capability to ignore absence of GPUs e.g. in compute clusters that share a telegraf.conf file across multiple nodes via an optional startup_error_behavior parameter. The default is to error out as before.

Checklist

  • No AI generated code was used in this PR

Related issues

addresses NVIDIA part of #14603
Once accepted the AMDGPU one will follow

@telegraf-tiger
Copy link
Contributor

telegraf-tiger bot commented Feb 5, 2024

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@telegraf-tiger telegraf-tiger bot added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Feb 5, 2024
@smokhov
Copy link
Contributor Author

smokhov commented Feb 5, 2024

!signed-cla

@smokhov
Copy link
Contributor Author

smokhov commented Feb 5, 2024

@powersj @srebhan -- is this a correct approach, somewhat?

@powersj
Copy link
Contributor

powersj commented Feb 5, 2024

Sven and I will be in person this week and will chat about this and get back to you. Thanks!

@smokhov smokhov marked this pull request as ready for review February 6, 2024 01:15
@powersj
Copy link
Contributor

powersj commented Feb 12, 2024

Hi @smokhov,

Thanks again for the PR.

Last week we met face-to-face and disucssed a number of larger issues that we want to resolve. Allowing users to avoid errors like this was one such topic. We wish to add a more generic way to manage these types of scenarios, but we are not opposed to taking this PR.

What we do want to see is the common use of options that other plugins are already using. For example, the use of error or ignore which are found in the mongodb, sql, and vsphere plugins:

  ## Specifies plugin behavior regarding disconnected servers
  ## Available choices :
  ##   - error: telegraf will return an error on startup if one the servers is unreachable
  ##   - ignore: telegraf will ignore unreachable servers on both startup and gather
  # disconnected_servers_behavior = "error"

Obviously, we are not trying to connect to a disconnected server in this case, but instead ignore a missing binary. I would propose:

  ## Specifies plugin behavior regarding missing nvidia-smi binary
  ## Available choices:
  ##   - error: telegraf will return an error on startup
  ##   - ignore: telegraf will ignore this plugin
  # startup_error_behavior = "error"

@srebhan, are you good with this proposal?

@powersj powersj added the waiting for response waiting for response from contributor label Feb 12, 2024
@smokhov
Copy link
Contributor Author

smokhov commented Feb 13, 2024

Thank you for the comments and consideration. I am all for standardization; surely can rename the parameter to whatever is more approriate.

Separately, I've been also futzing with linting trying to get the CI tests to pass... The 3 that don't appear to be due to a long code line that has a long string (166 characters, not sure what's the actual limit) in one of the tests. How to break it apart to make the linter happy?

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Feb 13, 2024
@smokhov
Copy link
Contributor Author

smokhov commented Feb 13, 2024

Sorry I guess my comment just removed the label

@powersj
Copy link
Contributor

powersj commented Feb 13, 2024

plugins/inputs/nvidia_smi/nvidia_smi_test.go:23 lll line is 166 characters

require.NoError(t, plugin.Init(), "nvidia-smi not found in /random/non-existent/path and not in PATH; please make sure nvidia-smi is installed and/or is in PATH")

In this case, you can break the description string up across multiple lines.

@powersj
Copy link
Contributor

powersj commented Feb 13, 2024

ok and Sven and I chatted and want to go with startup_error_behavior I've updated my proposal above.

@powersj powersj added the waiting for response waiting for response from contributor label Feb 13, 2024
Update tests to hopefully make linker happy.
@smokhov smokhov changed the title feat(inputs.nvidia_smi): add if_not_found ignore option feat(inputs.nvidia_smi): Add startup_error_behavior config option Feb 14, 2024
@smokhov
Copy link
Contributor Author

smokhov commented Feb 14, 2024

Looks like it worked, over to you, @powersj @srebhan ! Thanks!

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Feb 14, 2024
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

I went through and made a couple changes locally, but can't seem to push my changes. Below is my patch. Essentially I made a few changes:

  1. Make sure in the Gather function we do actually ignore this plugin.
  2. Update the switch statement to check for unknown error behavior options
  3. Break out all the test cases into individual cases
Details

From 5f3def1ea63035b6ccfd0f7de9533331334103e0 Mon Sep 17 00:00:00 2001
From: Josh Powers <powersj@fastmail.com>
Date: Wed, 14 Feb 2024 08:09:41 -0700
Subject: [PATCH] Split out the test cases Ensure we ignore on each other Cover
 invalid case

---
 plugins/inputs/nvidia_smi/nvidia_smi.go      | 21 ++++----
 plugins/inputs/nvidia_smi/nvidia_smi_test.go | 50 +++++++++++++++-----
 2 files changed, 49 insertions(+), 22 deletions(-)

diff --git a/plugins/inputs/nvidia_smi/nvidia_smi.go b/plugins/inputs/nvidia_smi/nvidia_smi.go
index bafaf3df5..78953678a 100644
--- a/plugins/inputs/nvidia_smi/nvidia_smi.go
+++ b/plugins/inputs/nvidia_smi/nvidia_smi.go
@@ -32,7 +32,8 @@ type NvidiaSMI struct {
 	StartupErrorBehavior string          `toml:"startup_error_behavior"`
 	Log                  telegraf.Logger `toml:"-"`
 
-	once sync.Once
+	ignorePlugin bool
+	once         sync.Once
 }
 
 func (*NvidiaSMI) SampleConfig() string {
@@ -42,17 +43,16 @@ func (*NvidiaSMI) SampleConfig() string {
 func (smi *NvidiaSMI) Init() error {
 	if _, err := os.Stat(smi.BinPath); os.IsNotExist(err) {
 		binPath, err := exec.LookPath("nvidia-smi")
-
-		// fail-fast
 		if err != nil {
 			switch smi.StartupErrorBehavior {
 			case "ignore":
-				// Ignore the error and move on
-				smi.Log.Warnf("GPU SMI not found on the system, ignoring: %s", err)
+				smi.ignorePlugin = true
+				smi.Log.Warnf("nvidia-smi not found on the system, ignoring: %s", err)
 				return nil
 			case "", "error":
-			default:
 				return fmt.Errorf("nvidia-smi not found in %q and not in PATH; please make sure nvidia-smi is installed and/or is in PATH", smi.BinPath)
+			default:
+				return fmt.Errorf("unknown startup behavior setting: %s", smi.StartupErrorBehavior)
 			}
 		}
 		smi.BinPath = binPath
@@ -63,6 +63,10 @@ func (smi *NvidiaSMI) Init() error {
 
 // Gather implements the telegraf interface
 func (smi *NvidiaSMI) Gather(acc telegraf.Accumulator) error {
+	if smi.ignorePlugin {
+		return nil
+	}
+
 	// Construct and execute metrics query
 	data, err := internal.CombinedOutputTimeout(exec.Command(smi.BinPath, "-q", "-x"), time.Duration(smi.Timeout))
 	if err != nil {
@@ -123,9 +127,8 @@ func (smi *NvidiaSMI) parse(acc telegraf.Accumulator, data []byte) error {
 func init() {
 	inputs.Add("nvidia_smi", func() telegraf.Input {
 		return &NvidiaSMI{
-			BinPath:              "/usr/bin/nvidia-smi",
-			StartupErrorBehavior: "error",
-			Timeout:              config.Duration(5 * time.Second),
+			BinPath: "/usr/bin/nvidia-smi",
+			Timeout: config.Duration(5 * time.Second),
 		}
 	})
 }
diff --git a/plugins/inputs/nvidia_smi/nvidia_smi_test.go b/plugins/inputs/nvidia_smi/nvidia_smi_test.go
index c92986622..f08d6fb41 100644
--- a/plugins/inputs/nvidia_smi/nvidia_smi_test.go
+++ b/plugins/inputs/nvidia_smi/nvidia_smi_test.go
@@ -11,25 +11,49 @@ import (
 	"github.com/stretchr/testify/require"
 )
 
-func TestStartPluginIfGPUNotFound(t *testing.T) {
-	plugin := &NvidiaSMI{Log: &testutil.Logger{}}
-
-	plugin.StartupErrorBehavior = "error"
-	plugin.BinPath = "/usr/bin/nvidia-smi"
-	require.NoError(t, plugin.Init())
+func TestErrorBehaviorError(t *testing.T) {
+	// make sure we can't find nvidia-smi in $PATH somewhere
+	os.Unsetenv("PATH")
+	plugin := &NvidiaSMI{
+		BinPath:              "/random/non-existent/path",
+		Log:                  &testutil.Logger{},
+		StartupErrorBehavior: "error",
+	}
+	require.Error(t, plugin.Init())
+}
 
+func TestErrorBehaviorDefault(t *testing.T) {
 	// make sure we can't find nvidia-smi in $PATH somewhere
 	os.Unsetenv("PATH")
+	plugin := &NvidiaSMI{
+		BinPath: "/random/non-existent/path",
+		Log:     &testutil.Logger{},
+	}
+	require.Error(t, plugin.Init())
+}
 
-	plugin.StartupErrorBehavior = "ignore"
-	plugin.BinPath = "/random/non-existent/path"
+func TestErorBehaviorIgnore(t *testing.T) {
+	// make sure we can't find nvidia-smi in $PATH somewhere
+	os.Unsetenv("PATH")
+	plugin := &NvidiaSMI{
+		BinPath:              "/random/non-existent/path",
+		Log:                  &testutil.Logger{},
+		StartupErrorBehavior: "ignore",
+	}
 	require.NoError(t, plugin.Init())
+	acc := testutil.Accumulator{}
+	require.NoError(t, plugin.Gather(&acc))
+}
 
-	plugin.StartupErrorBehavior = "error"
-	plugin.BinPath = "/random/non-existent/path"
-	errMsg := "nvidia-smi not found in /random/non-existent/path and not in PATH; " +
-		"please make sure nvidia-smi is installed and/or is in PATH"
-	require.NoError(t, plugin.Init(), errMsg)
+func TestErrorBehaviorInvalidOption(t *testing.T) {
+	// make sure we can't find nvidia-smi in $PATH somewhere
+	os.Unsetenv("PATH")
+	plugin := &NvidiaSMI{
+		BinPath:              "/random/non-existent/path",
+		Log:                  &testutil.Logger{},
+		StartupErrorBehavior: "giveup",
+	}
+	require.Error(t, plugin.Init())
 }
 
 func TestGatherValidXML(t *testing.T) {
-- 
4.43.1

@powersj powersj added the waiting for response waiting for response from contributor label Feb 14, 2024
@smokhov
Copy link
Contributor Author

smokhov commented Feb 14, 2024

I went through and made a couple changes locally, but can't seem to push my changes.

Weird. I don't see (and don't recall seeing it when I made the PR) the checkbox to "allow maintainers to make changes" or something of that sort. Not sure what's up with that.

Below is my patch. Essentially I made a few changes:

Thank you, let me apply and push it shortly.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Feb 14, 2024
@telegraf-tiger
Copy link
Contributor

Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

🥳 This pull request decreases the Telegraf binary size by -3.50 % for linux amd64 (new size: 220.6 MB, nightly size 228.6 MB)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB RPM TAR GZ ZIP
amd64.deb aarch64.rpm darwin_amd64.tar.gz windows_amd64.zip
arm64.deb armel.rpm darwin_arm64.tar.gz windows_arm64.zip
armel.deb armv6hl.rpm freebsd_amd64.tar.gz windows_i386.zip
armhf.deb i386.rpm freebsd_armv7.tar.gz
i386.deb ppc64le.rpm freebsd_i386.tar.gz
mips.deb riscv64.rpm linux_amd64.tar.gz
mipsel.deb s390x.rpm linux_arm64.tar.gz
ppc64el.deb x86_64.rpm linux_armel.tar.gz
riscv64.deb linux_armhf.tar.gz
s390x.deb linux_i386.tar.gz
linux_mips.tar.gz
linux_mipsel.tar.gz
linux_ppc64le.tar.gz
linux_riscv64.tar.gz
linux_s390x.tar.gz

@smokhov
Copy link
Contributor Author

smokhov commented Feb 14, 2024

The two ci tests barfing out don't appear to be related to the plugin in question.

@smokhov smokhov requested a review from powersj February 14, 2024 20:08
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I agree the errors are unrelated, I've kicked them off again.

@powersj powersj removed their assignment Feb 14, 2024
@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Feb 14, 2024
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for your contribution @smokhov!

@srebhan srebhan merged commit 2c815e4 into influxdata:master Feb 15, 2024
26 checks passed
@github-actions github-actions bot added this to the v1.30.0 milestone Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants