-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(inputs.nvidia_smi): Add startup_error_behavior config option #14680
Conversation
Thanks so much for the pull request! |
!signed-cla |
Sven and I will be in person this week and will chat about this and get back to you. Thanks! |
Hi @smokhov, Thanks again for the PR. Last week we met face-to-face and disucssed a number of larger issues that we want to resolve. Allowing users to avoid errors like this was one such topic. We wish to add a more generic way to manage these types of scenarios, but we are not opposed to taking this PR. What we do want to see is the common use of options that other plugins are already using. For example, the use of
Obviously, we are not trying to connect to a disconnected server in this case, but instead ignore a missing binary. I would propose: ## Specifies plugin behavior regarding missing nvidia-smi binary
## Available choices:
## - error: telegraf will return an error on startup
## - ignore: telegraf will ignore this plugin
# startup_error_behavior = "error" @srebhan, are you good with this proposal? |
Thank you for the comments and consideration. I am all for standardization; surely can rename the parameter to whatever is more approriate. Separately, I've been also futzing with linting trying to get the CI tests to pass... The 3 that don't appear to be due to a long code line that has a long string (166 characters, not sure what's the actual limit) in one of the tests. How to break it apart to make the linter happy? |
Sorry I guess my comment just removed the label |
require.NoError(t, plugin.Init(), "nvidia-smi not found in /random/non-existent/path and not in PATH; please make sure nvidia-smi is installed and/or is in PATH") In this case, you can break the description string up across multiple lines. |
ok and Sven and I chatted and want to go with |
Update tests to hopefully make linker happy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I went through and made a couple changes locally, but can't seem to push my changes. Below is my patch. Essentially I made a few changes:
- Make sure in the Gather function we do actually ignore this plugin.
- Update the switch statement to check for unknown error behavior options
- Break out all the test cases into individual cases
Details
From 5f3def1ea63035b6ccfd0f7de9533331334103e0 Mon Sep 17 00:00:00 2001
From: Josh Powers <powersj@fastmail.com>
Date: Wed, 14 Feb 2024 08:09:41 -0700
Subject: [PATCH] Split out the test cases Ensure we ignore on each other Cover
invalid case
---
plugins/inputs/nvidia_smi/nvidia_smi.go | 21 ++++----
plugins/inputs/nvidia_smi/nvidia_smi_test.go | 50 +++++++++++++++-----
2 files changed, 49 insertions(+), 22 deletions(-)
diff --git a/plugins/inputs/nvidia_smi/nvidia_smi.go b/plugins/inputs/nvidia_smi/nvidia_smi.go
index bafaf3df5..78953678a 100644
--- a/plugins/inputs/nvidia_smi/nvidia_smi.go
+++ b/plugins/inputs/nvidia_smi/nvidia_smi.go
@@ -32,7 +32,8 @@ type NvidiaSMI struct {
StartupErrorBehavior string `toml:"startup_error_behavior"`
Log telegraf.Logger `toml:"-"`
- once sync.Once
+ ignorePlugin bool
+ once sync.Once
}
func (*NvidiaSMI) SampleConfig() string {
@@ -42,17 +43,16 @@ func (*NvidiaSMI) SampleConfig() string {
func (smi *NvidiaSMI) Init() error {
if _, err := os.Stat(smi.BinPath); os.IsNotExist(err) {
binPath, err := exec.LookPath("nvidia-smi")
-
- // fail-fast
if err != nil {
switch smi.StartupErrorBehavior {
case "ignore":
- // Ignore the error and move on
- smi.Log.Warnf("GPU SMI not found on the system, ignoring: %s", err)
+ smi.ignorePlugin = true
+ smi.Log.Warnf("nvidia-smi not found on the system, ignoring: %s", err)
return nil
case "", "error":
- default:
return fmt.Errorf("nvidia-smi not found in %q and not in PATH; please make sure nvidia-smi is installed and/or is in PATH", smi.BinPath)
+ default:
+ return fmt.Errorf("unknown startup behavior setting: %s", smi.StartupErrorBehavior)
}
}
smi.BinPath = binPath
@@ -63,6 +63,10 @@ func (smi *NvidiaSMI) Init() error {
// Gather implements the telegraf interface
func (smi *NvidiaSMI) Gather(acc telegraf.Accumulator) error {
+ if smi.ignorePlugin {
+ return nil
+ }
+
// Construct and execute metrics query
data, err := internal.CombinedOutputTimeout(exec.Command(smi.BinPath, "-q", "-x"), time.Duration(smi.Timeout))
if err != nil {
@@ -123,9 +127,8 @@ func (smi *NvidiaSMI) parse(acc telegraf.Accumulator, data []byte) error {
func init() {
inputs.Add("nvidia_smi", func() telegraf.Input {
return &NvidiaSMI{
- BinPath: "/usr/bin/nvidia-smi",
- StartupErrorBehavior: "error",
- Timeout: config.Duration(5 * time.Second),
+ BinPath: "/usr/bin/nvidia-smi",
+ Timeout: config.Duration(5 * time.Second),
}
})
}
diff --git a/plugins/inputs/nvidia_smi/nvidia_smi_test.go b/plugins/inputs/nvidia_smi/nvidia_smi_test.go
index c92986622..f08d6fb41 100644
--- a/plugins/inputs/nvidia_smi/nvidia_smi_test.go
+++ b/plugins/inputs/nvidia_smi/nvidia_smi_test.go
@@ -11,25 +11,49 @@ import (
"github.com/stretchr/testify/require"
)
-func TestStartPluginIfGPUNotFound(t *testing.T) {
- plugin := &NvidiaSMI{Log: &testutil.Logger{}}
-
- plugin.StartupErrorBehavior = "error"
- plugin.BinPath = "/usr/bin/nvidia-smi"
- require.NoError(t, plugin.Init())
+func TestErrorBehaviorError(t *testing.T) {
+ // make sure we can't find nvidia-smi in $PATH somewhere
+ os.Unsetenv("PATH")
+ plugin := &NvidiaSMI{
+ BinPath: "/random/non-existent/path",
+ Log: &testutil.Logger{},
+ StartupErrorBehavior: "error",
+ }
+ require.Error(t, plugin.Init())
+}
+func TestErrorBehaviorDefault(t *testing.T) {
// make sure we can't find nvidia-smi in $PATH somewhere
os.Unsetenv("PATH")
+ plugin := &NvidiaSMI{
+ BinPath: "/random/non-existent/path",
+ Log: &testutil.Logger{},
+ }
+ require.Error(t, plugin.Init())
+}
- plugin.StartupErrorBehavior = "ignore"
- plugin.BinPath = "/random/non-existent/path"
+func TestErorBehaviorIgnore(t *testing.T) {
+ // make sure we can't find nvidia-smi in $PATH somewhere
+ os.Unsetenv("PATH")
+ plugin := &NvidiaSMI{
+ BinPath: "/random/non-existent/path",
+ Log: &testutil.Logger{},
+ StartupErrorBehavior: "ignore",
+ }
require.NoError(t, plugin.Init())
+ acc := testutil.Accumulator{}
+ require.NoError(t, plugin.Gather(&acc))
+}
- plugin.StartupErrorBehavior = "error"
- plugin.BinPath = "/random/non-existent/path"
- errMsg := "nvidia-smi not found in /random/non-existent/path and not in PATH; " +
- "please make sure nvidia-smi is installed and/or is in PATH"
- require.NoError(t, plugin.Init(), errMsg)
+func TestErrorBehaviorInvalidOption(t *testing.T) {
+ // make sure we can't find nvidia-smi in $PATH somewhere
+ os.Unsetenv("PATH")
+ plugin := &NvidiaSMI{
+ BinPath: "/random/non-existent/path",
+ Log: &testutil.Logger{},
+ StartupErrorBehavior: "giveup",
+ }
+ require.Error(t, plugin.Init())
}
func TestGatherValidXML(t *testing.T) {
--
4.43.1
Weird. I don't see (and don't recall seeing it when I made the PR) the checkbox to "allow maintainers to make changes" or something of that sort. Not sure what's up with that.
Thank you, let me apply and push it shortly. |
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. 🥳 This pull request decreases the Telegraf binary size by -3.50 % for linux amd64 (new size: 220.6 MB, nightly size 228.6 MB) 📦 Click here to get additional PR build artifactsArtifact URLs |
The two ci tests barfing out don't appear to be related to the plugin in question. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I agree the errors are unrelated, I've kicked them off again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks for your contribution @smokhov!
Summary
As per #14603 adding a capability to ignore absence of GPUs e.g. in compute clusters that share a
telegraf.conf
file across multiple nodes via an optionalstartup_error_behavior
parameter. The default is to error out as before.Checklist
Related issues
addresses NVIDIA part of #14603
Once accepted the AMDGPU one will follow