Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf 1.19 - Windows smart plugin does not work correctly with adaptec raid #9417

Closed
marianob85 opened this issue Jun 23, 2021 · 1 comment · Fixed by #10150
Closed

Telegraf 1.19 - Windows smart plugin does not work correctly with adaptec raid #9417

marianob85 opened this issue Jun 23, 2021 · 1 comment · Fixed by #10150
Labels
area/smart bug unexpected problem or unintended behavior platform/windows

Comments

@marianob85
Copy link
Contributor

marianob85 commented Jun 23, 2021

Relevant telegraf.conf:

 # Read metrics from storage devices supporting S.M.A.R.T.
 [[inputs.smart]]
   ## Optionally specify the path to the smartctl executable
   path_smartctl = "C:/Program Files/smartmontools/bin/smartctl.exe"

   ## Optionally specify the path to the nvme-cli executable
   # path_nvme = "/usr/bin/nvme"

   ## Optionally specify if vendor specific attributes should be propagated for NVMe disk case
   ## ["auto-on"] - automatically find and enable additional vendor specific disk info
   ## ["vendor1", "vendor2", ...] - e.g. "Intel" enable additional Intel specific disk info
   # enable_extensions = ["auto-on"]

   ## On most platforms used cli utilities requires root access.
   ## Setting 'use_sudo' to true will make use of sudo to run smartctl or nvme-cli.
   ## Sudo must be configured to allow the telegraf user to run smartctl or nvme-cli
   ## without a password.
   use_sudo = false

   ## Skip checking disks in this power mode. Defaults to
   ## "standby" to not wake up disks that have stopped rotating.
   ## See --nocheck in the man pages for smartctl.
   ## smartctl version 5.41 and 5.42 have faulty detection of
   ## power mode and might require changing this value to
   ## "never" depending on your disks.
   nocheck = "never"

   ## Gather all returned S.M.A.R.T. attribute metrics and the detailed
   ## information from each drive into the 'smart_attribute' measurement.
   attributes = false

   ## Optionally specify devices to exclude from reporting if disks auto-discovery is performed.
   # excludes = [ "/dev/pass6" ]

   ## Optionally specify devices and device type, if unset
   ## a scan (smartctl --scan and smartctl --scan -d nvme) for S.M.A.R.T. devices will be done
   ## and all found will be included except for the excluded in excludes.
   # devices = [ "/dev/ada0 -d atacam", "/dev/nvme0"]
   devices = [ "/dev/sda -d aacraid,0,0,0", "/dev/sda -d aacraid,0,0,1", "/dev/sda -d aacraid,0,0,2", "/dev/sda -d aacraid,0,0,3", "/dev/csmi1,2", "/dev/csmi1,3"]

   ## Timeout for the cli command to complete.
   timeout = "30s"

System info:

Windows Server 2019
smartctl 7.2 2020-12-30 r5155 [x86_64-w64-mingw32-2019] (sf-7.2-1)
Adaptec RAID 2405 ( 4 drive with RAID 0 )

Steps to reproduce:

Run telegraf: ( powershell ) with above smart config:
& 'C:\Program Files\telegraf\telegraf.exe' --test --debug --input-filter smart

Expected behavior:

smart_device,device=sda,enabled=Enabled,host=gcf-server-b,model=WD5000AAKX-0,power=ACTIVE,serial_no=WD-WMAYUW891053 exit_status=0i,health_ok=true,temp_c=0i 1624436434000000000
smart_device,device=sda,enabled=Enabled,host=gcf-server-b,model=ST3500418AS,power=ACTIVE,serial_no=5VMCLCS4 exit_status=0i,health_ok=true,temp_c=0i 1624436434000000000
smart_device,device=sda,enabled=Enabled,host=gcf-server-b,model=ST3500418AS,power=ACTIVE,serial_no=9VMKW2FV exit_status=8i,health_ok=false,temp_c=0i 1624436435000000000
smart_device,device=sda,enabled=Enabled,host=gcf-server-b,model=WD5000AAKX-0,power=ACTIVE,serial_no=WD-WMAYUW767828 exit_status=0i,health_ok=true,temp_c=0i 1624436435000000000
smart_device,device=csmi1,2,enabled=Enabled,host=gcf-server-b,model=KINGSTON\ SVP200S360G,power=ACTIVE,serial_no=50026B7223013010,wwn=50026b7223013010 exit_status=0i,health_ok=true,read_error_rate=0i,temp_c=29i 1624436435000000000
smart_device,device=csmi1,3,enabled=Enabled,host=gcf-server-b,model=KINGSTON\ SVP200S360G,power=ACTIVE,serial_no=50026B7223013805,wwn=50026b7223013805 exit_status=0i,health_ok=true,read_error_rate=0i,temp_c=29i 1624436435000000000

Actual behavior:

smart_device,device=sda,host=gcf-server-b exit_status=2i 1624435449000000000
smart_device,device=sda,host=gcf-server-b exit_status=2i 1624435449000000000
smart_device,device=sda,host=gcf-server-b exit_status=2i 1624435449000000000
smart_device,device=csmi1,3,enabled=Enabled,host=gcf-server-b,model=KINGSTON\ SVP200S360G,power=ACTIVE,serial_no=50026B7223013805,wwn=50026b7223013805 exit_status=0i,health_ok=true,read_error_rate=0i,temp_c=29i 1624435449000000000
smart_device,device=csmi1,2,enabled=Enabled,host=gcf-server-b,model=KINGSTON\ SVP200S360G,power=ACTIVE,serial_no=50026B7223013010,wwn=50026b7223013010 exit_status=0i,health_ok=true,read_error_rate=0i,temp_c=29i 1624435449000000000
smart_device,device=sda,enabled=Enabled,host=gcf-server-b,model=WD5000AAKX-0,power=ACTIVE,serial_no=WD-WMAYUW767828 exit_status=0i,health_ok=true,temp_c=0i 1624435449000000000

Additional info:

The problem is that function getAttributes from smart.go file use asynchronous call of function gatherDisk. Unfortunately parallel access to raid controller causes errors, end the only way to get it work as point in "Actual behavior" is to remove completely async functionlity from getAttributes ( smart.go ).

@marianob85 marianob85 added the bug unexpected problem or unintended behavior label Jun 23, 2021
@marianob85 marianob85 changed the title Telegraf 1.19 - Windows smart does not work correctly with adaptec raid Telegraf 1.19 - Windows smart plugin does not work correctly with adaptec raid Jun 23, 2021
@zak-pawel
Copy link
Collaborator

Seems that for now you can find workaround here: #8684

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/smart bug unexpected problem or unintended behavior platform/windows
Projects
None yet
2 participants