Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart Input Plugin Extra Attributes #6079

Merged
merged 14 commits into from
Jul 12, 2019
Merged

Smart Input Plugin Extra Attributes #6079

merged 14 commits into from
Jul 12, 2019

Conversation

GeorgeMac
Copy link
Contributor

@GeorgeMac GeorgeMac commented Jul 6, 2019

This PR attempts to deliver most of #6041

Currently this PR supports:

  • Media and Data Integrity Errors (count)
  • Error Information Log Entries (count)
  • Available Spare (percentage)
  • Critical Warning (integer bit set)

Outstanding Questions:

  1. Documentation for smart attributes seem to be a little elusive to me. Can anyone point me in the direction of the attributes id's for all the above attributes. Perhaps you know @coccobill1 ?

  2. Critical Warning appears to be a bit flag printed as a hex value. I have struggled to find decent documentation around the meaning of all these attributes. Maybe someone can point me in the best direction of better docs. I have this https://www.intel.com/content/dam/support/us/en/documents/solid-state-drives/Intel_SSD_Smart_Attrib_for_PCIe.pdf for example.

Given we know the definitive meaning (and that there is one) for what each flag means, do we have any suggestions on how to communicate that best to telegraf? cc @danielnelson

What I mean is, for example, would we want to communicate each flags as individual distinct fields. Each field with a boolean value (on or off)?

e.g. The reading 0x01 given the first bit in the byte represents available spare is below threshold (as I have currently taken from the shared intel doc above) would lead to a field like Critical_Warning_Available_Spare_Below_Threshold with a raw_value=1?

  1. I couldn't find any decent examples of the extra fields to build a sensible test from or have a clear idea on the format of the value in the output. So I haven't taken them on yet. These are:
  • Reallocated Sector Count
  • Wear Leveling Count
  • Uncorrectable Errors Count

I could implement them all as just counts and fabricate a test and hope they exist as described here if people are happy with this.

I could be completely missing the point here 😂 so any advice welcomed.

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated. (there currently doesn't seem to exist any documentation at this fidelity)
  • Has appropriate unit tests.

@GeorgeMac GeorgeMac changed the title Gm/smart input extra attrs Smart Input Plugin Extra Attributes Jul 6, 2019
@GeorgeMac
Copy link
Contributor Author

Small update: https://nvmexpress.org/wp-content/uploads/NVM-Express-Management-Interface-1.1-Ratified.pdf

This document seems to correlate well r.e. the meanings of the critical warning flags for nvme.

@coccobill1
Copy link

First off I'd like to say that a lot of paid software don't have support like this even when showing the money. Greatly appreciated guys!

I'm certainly no expert on s.m.a.r.t. data, but from my experience they are quite a wild west, a lot of manufacturer specific stuff and seems impossible to find up to date information on them. On Samsung's page the only document I can find is this, and it's clearly outdated for the NVME drives:

http://downloadcenter.samsung.com/content/UM/201711/20171115115128872/SSD_Application_Note_SMART_final.pdf

That NVME specification is probably as good as it gets for a standardized model, but assuming all vendors stick to it might be optimistic. Smartctl doesn't display the ID's for the attributes on my NVME drive.

Regarding 3. I have no idea whether those attributes are used anymore with NVME drives, but they used to be key attributes to monitor with sata/ide drives, so if they are still valid they would be a great addition.

@GeorgeMac
Copy link
Contributor Author

Hey @coccobill1 👋 Thank you for the kind words. That was my first week at Influx and first thing contributed as an employee. So it meant a lot 🙇

That document you linked it super helpful as well. Cross-referencing all these different docs helps to find some sensible answer.
I had the same issue with my macbook nvme drive. No attribute IDs. Actually looking at the smart input plugin implementation the IDs for nvme are hard-coded rather than gathered from the output of smartctl. So just need to make sure I am hard-coded it correctly.
I will get another opinion from some influxers on what to do with those extra attributes as well and get back to you.

@GeorgeMac
Copy link
Contributor Author

Update: refactored the collection in order to make it more extensible. New matches for SAS and NVME attributes can be quickly added via the sasNvmeAttributes map. This also reduces N regexp's per line down to one for smart attribute extraction.

plugins/inputs/smart/smart.go Outdated Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
plugins/inputs/smart/smart.go Outdated Show resolved Hide resolved
plugins/inputs/smart/smart.go Outdated Show resolved Hide resolved
@danielnelson danielnelson added this to the 1.12.0 milestone Jul 8, 2019
@danielnelson danielnelson added area/smart feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin labels Jul 8, 2019
@GeorgeMac
Copy link
Contributor Author

Update: I added the following critical warning fields with boolean values:

  • Critical_Warning_Spare_Treshold
  • Critical_Warning_Temperature_Above_or_Under_Threshold
  • Critical_Warning_Reliability_Degraded
  • Critical_Warning_Read_Only
  • Critical_Warning_Volative_Memory_Backup_Failed

continue
}
acc.AddFields("smart_attribute", map[string]interface{}{
"raw_value": flags&flag > 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't be add this as a bool since we already have it as an integer, and InfluxDB won't allow us to have mixed types. Since we are attempting to model the NVMe log as smart_attributes, I would just do it like this and document or link to the values in the README (see net_response for an example):

smart_attribute,name=critical_warning raw_value=0i

Often with enum style data we do something like this with a tag + integer field, but I don't think we should do it here since it doesn't fit into the existing model very well:

smart_attribute,name=critical\ warning,warning=reliability\ degraded warning_code=0i

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielnelson ahh I see what you mean! Do you think having it as an integer is a little hard to interpret? Since it represents so many states / combinations of flags? i.e. 8 means just read only, but 9 means read only and available spare threshold exceeded. I suppose no one is screaming for this warning fidelity at the moment, so I am probably just splitting hairs 😂 . They can atleast do something like alert when any critical warning occurs with a simple >0 check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, this isn't an enum it's a bit set, so it wouldn't really work to pair it with a human readable name=foo tag. I also agree, it's pretty difficult to deal with a large list of integers. Still because this is a smart_attribute, I think we should start with just the single field, but read on for more discussion of booleans and bit sets.

One reason we have often favored integer values over booleans is that Grafana/Chronograf aren't very good at displaying booleans outside of tables. UI support is always changing though, so it may be worth looking into what support there is now. Many of the outputs also do not support booleans, and require users to use the enum processor.

I think there are 2 things that a user would be interested in with a bitset: is the item in an error state (non-zero), and what flags are set. Checking for an error state is not always something that makes sense depending on the semantics of the bit set, but I think it often does. If we were going to encode a bit set for InfluxDB, we could do it as multiple fields, one for the bit set encoded an integer, and one for each flag in the bit set:

foo critical_warning=3i,critical_warning_spare_threshold=true,critical_warning_temperature_above_or_under_threshold=true

However, this might be tricky to query, since you would need to grab all bit set fields and check each one to see if it is set. Probably only something you want to do if you are checking for a particular flag. It occurs to me that the set type idea for tags we discussed could help here, imagine:

foo,critical_warning=spare_threshold,critical_warning=temperature_above_or_under_threshold critical_warning=3i

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I like it. I will simplify critical warning to just an int for now then.

@danielnelson danielnelson merged commit 43c16aa into influxdata:master Jul 12, 2019
@GeorgeMac GeorgeMac deleted the gm/smart-input-extra-attrs branch July 15, 2019 09:18
bitcharmer pushed a commit to bitcharmer/telegraf that referenced this pull request Oct 18, 2019
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
idohalevi pushed a commit to idohalevi/telegraf that referenced this pull request Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/smart feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants