Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[input.smart]: Specifying multiple drives in the devices= parameter fails when the '-d' parameter is used to differentiate drives within a hardware RAID array #8684

Closed
douginoz opened this issue Jan 13, 2021 · 7 comments · Fixed by #10150
Labels
area/smart bug unexpected problem or unintended behavior

Comments

@douginoz
Copy link

Relevant telegraf.conf:

[[inputs.smart]]
path_smartctl = "/usr/sbin/smartctl"
path_nvme = "/usr/sbin/nvme"
enable_extensions = ["auto-on"]
use_sudo = true
attributes = true
#
#  List all 24 drives within the Areca RAID controller:
devices = [ "dev/sg3 -d areca,2/2", "/dev/sg3 -d areca,3/2"]
#
#  Note that the above was concatenated for brevity.  The remainder of that line is below for all drives:
#, "/dev/sg3 -d areca,3/2", "/dev/sg3 -d areca,4/2", "/dev/sg3 -d areca,5/2", "/dev/sg3 -d areca,6/2", "/dev/sg3 -d areca,7/2", "/dev/sg3 -d areca,8/2", "/dev/sg3 -d areca,9/2", "/dev/sg3 -d areca,10/2", "/dev/sg3 -d areca,11/2", "/dev/sg3 -d areca,12/2", "/dev/sg3 -d areca,13/2", "/dev/sg3 -d areca,14/2", "/dev/sg3 -d areca,15/2", "/dev/sg3 ->

System info:

Telegraf 1.17.0 (git: HEAD 3f7a54c)
Linux sophie 5.4.0-58-generic #64-Ubuntu SMP (Ubuntu 20.04)
smartmontools release 7.1 dated 2019-12-30 at 15:00:11 UTC
smartmontools SVN rev 5022 dated 2019-12-30 at 15:00:49
smartmontools build host: x86_64-pc-linux-gnu

Docker

Docker not being used.

Steps to reproduce:

  1. Trying to specify the individual drives within the Areca RAID array from within the [input.smart] section of telegraf.conf fails
  2. Any one of the following lines works:
devices = [ "/dev/sg3 -d areca,2/2" ]             # A single drive from the Areca array, indicated via 'areca,2/2'
devices = [ "/dev/sg3 -d areca,7/2" ]           # A different drive from the Areca array - 'areca 7/2'
devices = [ "/dev/sg3 -d areca,1/2", "/dev/sdd" ]   # Multiple drives, so long as only one is from the Areca array
devices = [ "/dev/sdd", "/dev/sdd"]                # Duplicating the same non-Areca drive seems to work without error

  1. But the following doesn't work:
    devices = [ "/dev/sg3 -d areca,2/2", "/dev/sg3 -d areca,3/2" ] # Two drives from within the Areca array

Expected behaviour:

Smartctl can retrieve data from individual drives within a hardware array provided the '--device=' parameter is correct.
For Areca arrays, the following is correct syntax:

smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief --device=areca,1/2 /dev/sg3

This returns comprehensive data about the specific drive within the array:

root@sophie:/etc/telegraf# smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief --device=areca,1/2 /dev/sg3
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-58-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HDN721010ALE604
Serial Number:    1SJS3J5Z
LU WWN Device Id: 5 000cca 26be6b0c3
Firmware Version: 83XN
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jan 13 01:13:50 2021 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Power mode is:    ACTIVE or IDLE

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  --S---   135   135   054    -    92
  3 Spin_Up_Time            POS---   175   175   024    -    406 (Average 344)
  4 Start_Stop_Count        -O--C-   100   100   000    -    99
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         -O-R--   100   100   067    -    0
  8 Seek_Time_Performance   --S---   128   128   020    -    18
  9 Power_On_Hours          -O--C-   098   098   000    -    16232
 10 Spin_Retry_Count        -O--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    99
 22 Unknown_Attribute       PO---K   100   100   025    -    100
192 Power-Off_Retract_Count -O--CK   100   100   000    -    823
193 Load_Cycle_Count        -O--C-   100   100   000    -    823
194 Temperature_Celsius     -O----   166   166   000    -    36 (Min/Max 17/59)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0

When a different value is used, Smartctl correctly retrieves the data for the different disk within the array:

root@sophie:/etc/telegraf# smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief --device=areca,7/2 /dev/sg3
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-58-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HDN721010ALE604
Serial Number:    1SJTUGKZ
LU WWN Device Id: 5 000cca 26be77786
Firmware Version: 83XN
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jan 13 01:14:54 2021 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Power mode is:    ACTIVE or IDLE

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  --S---   134   134   054    -    96
  3 Spin_Up_Time            POS---   178   178   024    -    316 (Average 424)
  4 Start_Stop_Count        -O--C-   100   100   000    -    109
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         -O-R--   100   100   067    -    0
  8 Seek_Time_Performance   --S---   128   128   020    -    18
  9 Power_On_Hours          -O--C-   098   098   000    -    20286
 10 Spin_Retry_Count        -O--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    109
 22 Unknown_Attribute       PO---K   100   100   025    -    100
192 Power-Off_Retract_Count -O--CK   100   100   000    -    1020
193 Load_Cycle_Count        -O--C-   100   100   000    -    1020
194 Temperature_Celsius     -O----   166   166   000    -    36 (Min/Max 16/62)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0

The [input.smart] plugin allows for specifying device and device type:

#   ## Optionally specify devices and device type, if unset
#   ## a scan (smartctl --scan and smartctl --scan -d nvme) for S.M.A.R.T. devices will be done
#   ## and all found will be included except for the excluded in excludes.
#   # devices = [ "/dev/ada0 -d atacam", "/dev/nvme0"]
devices = [ "/dev/sg3 -d areca,12/2" ]

The above works correctly:

root@sophie:/etc/telegraf# nano telegraf.conf
root@sophie:/etc/telegraf# telegraf  --test| grep smart
2021-01-13T09:17:43Z I! Starting Telegraf 1.17.0
2021-01-13T09:17:43Z I! Using config file: /etc/telegraf/telegraf.conf
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=POSR--,host=sophie,id=1,model=ST10000NM0086-2AA101,name=Raw_Read_Error_Rate,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=344184i,threshold=44i,value=100i,worst=64i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=PO----,host=sophie,id=3,model=ST10000NM0086-2AA101,name=Spin_Up_Time,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=0i,threshold=0i,value=93i,worst=92i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--CK,host=sophie,id=4,model=ST10000NM0086-2AA101,name=Start_Stop_Count,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=51i,threshold=20i,value=100i,worst=100i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=PO--CK,host=sophie,id=5,model=ST10000NM0086-2AA101,name=Reallocated_Sector_Ct,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=48i,threshold=10i,value=100i,worst=100i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=POSR--,host=sophie,id=7,model=ST10000NM0086-2AA101,name=Seek_Error_Rate,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=427333388i,threshold=45i,value=86i,worst=60i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--CK,host=sophie,id=9,model=ST10000NM0086-2AA101,name=Power_On_Hours,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=13568i,threshold=0i,value=85i,worst=85i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=PO--C-,host=sophie,id=10,model=ST10000NM0086-2AA101,name=Spin_Retry_Count,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=0i,threshold=97i,value=100i,worst=100i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--CK,host=sophie,id=12,model=ST10000NM0086-2AA101,name=Power_Cycle_Count,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=53i,threshold=20i,value=100i,worst=100i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--CK,host=sophie,id=184,model=ST10000NM0086-2AA101,name=End-to-End_Error,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=0i,threshold=99i,value=100i,worst=100i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--CK,host=sophie,id=187,model=ST10000NM0086-2AA101,name=Reported_Uncorrect,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=0i,threshold=0i,value=100i,worst=100i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--CK,host=sophie,id=188,model=ST10000NM0086-2AA101,name=Command_Timeout,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=0i,threshold=0i,value=100i,worst=100i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O-RCK,host=sophie,id=189,model=ST10000NM0086-2AA101,name=High_Fly_Writes,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=68i,threshold=0i,value=32i,worst=32i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=Past,flags=-O---K,host=sophie,id=190,model=ST10000NM0086-2AA101,name=Airflow_Temperature_Cel,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=35i,threshold=40i,value=65i,worst=39i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--CK,host=sophie,id=191,model=ST10000NM0086-2AA101,name=G-Sense_Error_Rate,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=11417i,threshold=0i,value=95i,worst=95i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--CK,host=sophie,id=192,model=ST10000NM0086-2AA101,name=Power-Off_Retract_Count,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=555i,threshold=0i,value=100i,worst=100i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--CK,host=sophie,id=193,model=ST10000NM0086-2AA101,name=Load_Cycle_Count,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=605i,threshold=0i,value=100i,worst=100i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O---K,host=sophie,id=194,model=ST10000NM0086-2AA101,name=Temperature_Celsius,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=35i,threshold=0i,value=35i,worst=61i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O-RC-,host=sophie,id=195,model=ST10000NM0086-2AA101,name=Hardware_ECC_Recovered,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=344184i,threshold=0i,value=100i,worst=64i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--C-,host=sophie,id=197,model=ST10000NM0086-2AA101,name=Current_Pending_Sector,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=0i,threshold=0i,value=100i,worst=100i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=----C-,host=sophie,id=198,model=ST10000NM0086-2AA101,name=Offline_Uncorrectable,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=0i,threshold=0i,value=100i,worst=100i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-OSRCK,host=sophie,id=199,model=ST10000NM0086-2AA101,name=UDMA_CRC_Error_Count,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=0i,threshold=0i,value=200i,worst=200i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=PO---K,host=sophie,id=200,model=ST10000NM0086-2AA101,name=Multi_Zone_Error_Rate,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=0i,threshold=1i,value=100i,worst=100i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=------,host=sophie,id=240,model=ST10000NM0086-2AA101,name=Head_Flying_Hours,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=48344677i,threshold=0i,value=100i,worst=253i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=------,host=sophie,id=241,model=ST10000NM0086-2AA101,name=Total_LBAs_Written,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=140759859579i,threshold=0i,value=100i,worst=253i 1610529464000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=------,host=sophie,id=242,model=ST10000NM0086-2AA101,name=Total_LBAs_Read,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,raw_value=399844847220i,threshold=0i,value=100i,worst=253i 1610529464000000000
> smart_device,capacity=10000831348736,device=sg3,enabled=Enabled,host=sophie,model=ST10000NM0086-2AA101,serial_no=ZA29PM9W,wwn=5000c500b375b078 exit_status=32i,health_ok=true,read_error_rate=344184i,seek_error_rate=427333388i,temp_c=35i,udma_crc_errors=0i 1610529464000000000

As do any of the examples previously documented above.
However, the following doesn't work:

devices = [ "/dev/sg3 -d areca,1/2", "/dev/sg3 -d areca,7/2" ]

Actual behavior:

The following is the output from specifying more than one drive within the areca array:

root@sophie:/etc/telegraf# telegraf  --test| grep smart
2021-01-13T09:20:36Z I! Starting Telegraf 1.17.0
2021-01-13T09:20:36Z I! Using config file: /etc/telegraf/telegraf.conf
> smart_device,device=sg3,host=sophie exit_status=2i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=PO-R--,host=sophie,id=1,model=HGST\ HDN721010ALE604,name=Raw_Read_Error_Rate,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=0i,threshold=16i,value=100i,worst=100i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=--S---,host=sophie,id=2,model=HGST\ HDN721010ALE604,name=Throughput_Performance,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=92i,threshold=54i,value=135i,worst=135i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=POS---,host=sophie,id=3,model=HGST\ HDN721010ALE604,name=Spin_Up_Time,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=406i,threshold=24i,value=175i,worst=175i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--C-,host=sophie,id=4,model=HGST\ HDN721010ALE604,name=Start_Stop_Count,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=99i,threshold=0i,value=100i,worst=100i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=PO--CK,host=sophie,id=5,model=HGST\ HDN721010ALE604,name=Reallocated_Sector_Ct,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=0i,threshold=5i,value=100i,worst=100i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O-R--,host=sophie,id=7,model=HGST\ HDN721010ALE604,name=Seek_Error_Rate,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=0i,threshold=67i,value=100i,worst=100i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=--S---,host=sophie,id=8,model=HGST\ HDN721010ALE604,name=Seek_Time_Performance,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=18i,threshold=20i,value=128i,worst=128i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--C-,host=sophie,id=9,model=HGST\ HDN721010ALE604,name=Power_On_Hours,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=16232i,threshold=0i,value=98i,worst=98i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--C-,host=sophie,id=10,model=HGST\ HDN721010ALE604,name=Spin_Retry_Count,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=0i,threshold=60i,value=100i,worst=100i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--CK,host=sophie,id=12,model=HGST\ HDN721010ALE604,name=Power_Cycle_Count,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=99i,threshold=0i,value=100i,worst=100i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=PO---K,host=sophie,id=22,model=HGST\ HDN721010ALE604,name=Unknown_Attribute,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=100i,threshold=25i,value=100i,worst=100i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--CK,host=sophie,id=192,model=HGST\ HDN721010ALE604,name=Power-Off_Retract_Count,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=823i,threshold=0i,value=100i,worst=100i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--C-,host=sophie,id=193,model=HGST\ HDN721010ALE604,name=Load_Cycle_Count,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=823i,threshold=0i,value=100i,worst=100i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O----,host=sophie,id=194,model=HGST\ HDN721010ALE604,name=Temperature_Celsius,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=36i,threshold=0i,value=166i,worst=166i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O--CK,host=sophie,id=196,model=HGST\ HDN721010ALE604,name=Reallocated_Event_Count,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=0i,threshold=0i,value=100i,worst=100i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O---K,host=sophie,id=197,model=HGST\ HDN721010ALE604,name=Current_Pending_Sector,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=0i,threshold=0i,value=100i,worst=100i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=---R--,host=sophie,id=198,model=HGST\ HDN721010ALE604,name=Offline_Uncorrectable,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=0i,threshold=0i,value=100i,worst=100i 1610529637000000000
> smart_attribute,capacity=10000831348736,device=sg3,enabled=Enabled,fail=-,flags=-O-R--,host=sophie,id=199,model=HGST\ HDN721010ALE604,name=UDMA_CRC_Error_Count,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,raw_value=0i,threshold=0i,value=200i,worst=200i 1610529637000000000
> smart_device,capacity=10000831348736,device=sg3,enabled=Enabled,host=sophie,model=HGST\ HDN721010ALE604,serial_no=1SJS3J5Z,wwn=5000cca26be6b0c3 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1610529637000000000

Only the first drive gets reported on. Any subsequent drives within that same array don't, with only the 'exit_status=2i" message returned.

Strangely, the following works:
devices = [ "/dev/sdf", "/dev/sdf", "/dev/sdf", "/dev/sdf" ]
so it doesn't appear to be a problem with handling duplicate drives.

Additional info:

Issue #4720 was logged by @sachaz Sept 19,2018 which was identical, except he was using an HP smart array instead of an Areca, but the principle is the same - you must use the '-d' parameter to specify individual drives within the array.

@douginoz douginoz added the bug unexpected problem or unintended behavior label Jan 13, 2021
@douginoz
Copy link
Author

Still plugging away at it. Looking at similar issues previously logged, I tried their (failed) workaround attempts and found the same result. The following should work according to documentation, but it still returns the same error:

[[inputs.smart]]
  name_override = "raid_disk1"
  path_smartctl = "/usr/sbin/smartctl"
  path_nvme = "/usr/sbin/nvme"
  enable_extensions = ["auto-on"]
  use_sudo = true
  attributes = false
  devices = [ "/dev/sg3 --device=areca,1/2" ]

[[inputs.smart]]
  name_override = "raid_disk2"
  path_smartctl = "/usr/sbin/smartctl"
  path_nvme = "/usr/sbin/nvme"
  enable_extensions = ["auto-on"]
  use_sudo = true
  attributes = false
  devices = [ "/dev/sg3 --device=areca,2/2" ]
  [inputs.smart.tags]
    tag2 = "2"

@douginoz
Copy link
Author

After leaving things overnight, I was surprised to come back and see that my grafana charts show data for all the drives in my array (24 of them!). It appears that the error that occurs happens to all the disks except 1 random one, each time the query is made. So, over time, all drives are queried. Just not all of them, all the time.
This can be demonstrated by doing a manual test query:
$ telegraf --input-filter smart --test

and repeating it a few times. Every time it runs it produces the errors for all the drives, except one, and it's always a different one:


root@sophie:/home/moa/scripts# telegraf --input-filter smart --test
2021-01-17T03:44:24Z I! Starting Telegraf 1.17.0
2021-01-17T03:44:24Z I! Using config file: /etc/telegraf/telegraf.conf
> smart_device,device=sg3,host=sophie,volume=000.06 exit_status=2i 1610855064000000000
> smart_device,device=sg3,host=sophie,volume=PassThr.01 exit_status=2i 1610855064000000000
> smart_device,device=sg3,host=sophie,volume=000.09 exit_status=2i 1610855064000000000
> smart_device,device=sg3,host=sophie,volume=000.13 exit_status=2i 1610855064000000000
> smart_device,device=sg3,host=sophie,volume=000.15 exit_status=2i 1610855064000000000
...   [same for 23 out of 24 drives]
> smart_device,capacity=10000831348736,device=sg3,enabled=Enabled,host=sophie,model=HGST\ HDN721010ALE604,serial_no=1SbJ2arZ,volume=000.02,wwn=5000cca26be37f54 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=32i,udma_crc_errors=0i 1610855064000000000

and, repeated, shows a different drive's results:

root@sophie:/home/moa/scripts# telegraf --input-filter smart --test
2021-01-17T03:44:24Z I! Starting Telegraf 1.17.0
2021-01-17T03:44:24Z I! Using config file: /etc/telegraf/telegraf.conf
> smart_device,device=sg3,host=sophie,volume=000.06 exit_status=2i 1610855064000000000
> smart_device,device=sg3,host=sophie,volume=PassThr.01 exit_status=2i 1610855064000000000
> smart_device,device=sg3,host=sophie,volume=000.09 exit_status=2i 1610855064000000000
...
> smart_device,capacity=10000831348736,device=sg3,enabled=Enabled,host=sophie,model=HGST\ HDN721010ALE604,serial_no=1SJJ2SSZ,volume=000.02,wwn=5000cca26be37f54 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=32i,udma_crc_errors=0i 1610855064000000000

So, while not operating correctly, at least I can get some data.

Here's an extract from [input_smart]. I'm only displaying the 1st 2 drives but there's 24 in total:

[[inputs.smart]]
  path_smartctl = "/usr/sbin/smartctl"
  path_nvme = "/usr/sbin/nvme"
  enable_extensions = ["auto-on"]
  use_sudo = true
  attributes = true
  devices = [ "/dev/sg3 --device=areca,1/2" ]
  [inputs.smart.tags]
    volume = "000.01"

[[inputs.smart]]
  path_smartctl = "/usr/sbin/smartctl"
  path_nvme = "/usr/sbin/nvme"
  enable_extensions = ["auto-on"]
  use_sudo = true
  attributes = true
  devices = [ "/dev/sg3 --device=areca,2/2" ]
  [inputs.smart.tags]
    volume = "000.02"

@p-zak
Copy link
Collaborator

p-zak commented Jan 18, 2021

@douginoz
Can you try putting into Telegraf's config your devices using format suggested by smartmontools?

For example:

devices = [ "-d areca,2/2 /dev/sg3", "-d areca,3/2 /dev/sg3" ]

@douginoz
Copy link
Author

Same result:
telegraf.conf:
devices = [ "-d areca,1/2 /dev/sg3", "-d areca,2/2 /dev/sg3", "-d areca,3/2 /dev/sg3", "-d areca,4/2 /dev/sg3", "-d areca,5/2 /dev/sg3", "-d areca,6/2 /dev/sg3" ]

root@sophie:/etc/telegraf# telegraf --input-filter smart --test
2021-01-19T03:47:30Z I! Starting Telegraf 1.17.0
2021-01-19T03:47:30Z I! Using config file: /etc/telegraf/telegraf.conf
> smart_device,device=-d,host=sophie exit_status=2i 1611028051000000000
> smart_device,device=-d,host=sophie exit_status=2i 1611028051000000000
> smart_device,device=-d,host=sophie exit_status=2i 1611028051000000000
> smart_device,device=-d,host=sophie exit_status=2i 1611028051000000000
> smart_device,device=-d,host=sophie exit_status=2i 1611028051000000000
> smart_attribute,capacity=10000831348736,device=-d,enabled=Enabled,fail=-,flags=PO-R--,host=sophie,id=1,model=HGST\ HDN721010ALE604,name=Raw_Read_Error_Rate,serial_no=1SH3TMEZ,wwn=5000cca26bcfd110 exit_status=0i,raw_value=0i,threshold=16i,value=100i,worst=100i 1611028051000000000

My best guess right now is that the code considers "device=/dev/sda" different from "device=/dev/sdb", but doesn't consider "/dev/sg3 -d areca,2/2" different than "/dev/sg3 -d areca,3/2" and gets some sort of collision.

@KubaTrojan
Copy link
Contributor

Hello @douginoz!

We are assuming that your issue with areca RAID may be located on the driver's side.
Currently, SMART plugin does aggregation simultaneously for all specified devices, because every call to smartctl tool is done in a separate thread (goroutines) for better performance. That behavior can cause a problem when parallel reading for each drive within areca raid controller is prohibited for some reason. Your mentioned workaround (failed) with two instances of plugin, each with a different areca drive, strengthens us in such argument.

To check if the above is true, I kindly ask you to change one line in your local telegraf smart plugin code located in your_path_to_telegraf/telegraf/plugins/inputs/smart/smart.go file.

In lines 505 - 507 plugin iterates over devices and then proceeds aggregation in goroutines. You just need to remove "go" prefix.

for _, device := range devices {
	go gatherDisk(acc, m.Timeout, m.UseSudo, m.Attributes, m.PathSmartctl, m.Nocheck, device, &wg)
}

So change line 506 from:

go gatherDisk(acc, m.Timeout, m.UseSudo, m.Attributes, m.PathSmartctl, m.Nocheck, device, &wg)

To:

gatherDisk(acc, m.Timeout, m.UseSudo, m.Attributes, m.PathSmartctl, m.Nocheck, device, &wg)

And then build your local telegraf by make command, run you configuration with new telegraf binary and check if the areca problem still exists.

If this approach is correct, maybe there should be added some configuration options about concurrency.

@douginoz
Copy link
Author

douginoz commented Feb 4, 2021

Thanks for this.
[edit: including my process for others to follow in future]

I'm struggling to follow your instructions though. I think you're assuming that I normally run telegraf etc. from my own compiled source code. I don't. I simply downloaded the ubuntu telegraf and influx etc. from the website and installed them as per the website instructions.
I've now found the git source and did a 'git clone' into /opt/telegraf and made the change the line in smart.go. I needed to install 'go' before the 'make' command worked (apt install golang-go'). But the 'make' command then fails:

go build -ldflags " -X main.commit=3b8df55b -X main.branch=master -X main.goos=linux -X main.goarch=amd64" ./cmd/telegraf
build github.com/influxdata/telegraf/cmd/telegraf: cannot load hash/maphash: malformed module path "hash/maphash": missing dot in first path element
make[1]: *** [Makefile:90: telegraf] Error 1
make: *** [Makefile:67: all] Error 2

I removed the installed version of go (# apt remove golang-go;rm -R /usr/local/go) then manually installed the latest (v15) one, which works:

root@sophie:/opt/telegraf# make
go mod download
go build -ldflags " -X main.commit=3b8df55b -X main.branch=master -X main.goos=linux -X main.goarch=amd64" ./cmd/telegraf
root@sophie:/opt/telegraf#

@douginoz
Copy link
Author

douginoz commented Feb 5, 2021

I've made the change from
go gatherDisk(acc, m.Timeout, m.UseSudo, m.Attributes, m.PathSmartctl, m.Nocheck, device, &wg)
to
gatherDisk(acc, m.Timeout, m.UseSudo, m.Attributes, m.PathSmartctl, m.Nocheck, device, &wg)

And it appears to be working. Doing a test run
/opt/telegraf/telegraf --config=/etc/telegraf/telegraf.conf --input-filter smart --test

now produces a LOT of data, for each drive in the array.

I assume this change has a performance penalty but it's one that I'm happy to put up with!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/smart bug unexpected problem or unintended behavior
Projects
None yet
4 participants