-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redfish plugin for hardware monitoring #7082
Conversation
Signed-off-by: admin account <vzwadmin@tpa_poc_influxdb_1-tampa.vici.verizon.com>
Signed-off-by: admin account <vzwadmin@tpa_poc_influxdb_1-tampa.vici.verizon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I've included a list of recommendations, as well we may want to remove the .dell_power.json.swp file as it doesn't seem to be relevant to the tests.
plugins/inputs/redfish/redfish.go
Outdated
func getThermal(host, username, password, id string) error { | ||
r := fmt.Sprint(host, "/redfish/v1/Chassis/", id, "/Thermal") | ||
client := &http.Client{} | ||
http.DefaultTransport.(*http.Transport).TLSClientConfig = &tls.Config{InsecureSkipVerify: true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple issues here:
- client should be set globally and reused; it's thread safe (goroutine safe) by default.
- We probably don't want to use InsecureSkipVerify by default, as it doesn't validate anything about where it's connecting to. If possible we want to default to something a little more secure and give users the option to set this, with similar scary naming.
- Rather than modifying the TLS config on the default transport, you want to set TLS config on the client's transport that you're creating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably made this a little less secure by turning TLS off by default. Do we know if Redfish typically has https available by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the docs I read and by some practical knowledge I understood that redfish only supports https by default and even python redfish library only supports https protocol, I have changed the code accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The produced data model doesn't align very closely with what I see in the Redfish specification, I'd like to mirror the types as I think this will make it easier to extend to new types and understand for users familiar with Redfish. The most straightforward way to me seems to be to one metric for each supported resource type. For example we could have one type for Thermal, one for Power.
plugins/inputs/redfish/README.md
Outdated
|
||
### Metrics for Dell Servers | ||
|
||
- cpu_temperatures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W should name these redfish_thermal
, redfish_power
which will fit in well with the output of other plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The produced data model doesn't align very closely with what I see in the Redfish specification, I'd like to mirror the types as I think this will make it easier to extend to new types and understand for users familiar with Redfish. The most straightforward way to me seems to be to one metric for each supported resource type. For example we could have one type for Thermal, one for Power.
We have two type data coming from each resource(thermal,power) , if we send all the data from each resource to a single measurement it would be a problem as in the thermal resource the number of fans and CPUs temperature varies, for example there might be 50 CPUs but only 7 Fans, Similar for Power resource as well. It would be appropriate to store them in different measurements like redfish_thermal_temperatures, redfish_thermal_fans,redfish_power_powersupplies, redfish_power_voltages. What would you think of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're pretty much there. A few small comments.
plugins/inputs/redfish/redfish.go
Outdated
State string | ||
Health string | ||
} | ||
type speed struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct casing isn't consistent. Use Capital first letter for public structs, lowercase for private.
plugins/inputs/redfish/redfish.go
Outdated
} | ||
|
||
} else { | ||
return fmt.Errorf("received status code %d (%s), expected 200", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stops processing as soon as it finds a url that doesn't return a 200. This is probably okay, but you might want to return whatever data you can manage, instead of no data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be appropriate to stop the process when it does not return 200, in some scenarios for example user might give wrong credentials or ID, the process will stop at 1st API call and also if there is a proxy or any connectivity problem the process will be stopped at 1st API call itself. there might be rare situations where we receive data for one API and not for other APIs.
plugins/inputs/redfish/redfish.go
Outdated
Status CpuStatus `json:"Status"` | ||
} | ||
type Payload struct { | ||
Temperatures []*Cpu `json:",omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to use slices of structs instead of slices of pointers to structs. This way you don't need to worry about the nil pointer case
plugins/inputs/redfish/redfish.go
Outdated
} | ||
if (len(r.ClientConfig.TLSCA) == 0) || (len(r.ClientConfig.TLSCert) == 0 && len(r.ClientConfig.TLSKey) == 0) { | ||
var insecuretls tls.ClientConfig | ||
insecuretls.InsecureSkipVerify = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the tls.ClientConfig
type to provide the insecure option and feed whatever configuration it provides into the http.Transport. Change the host
option to be a base url so that it will include the scheme:
address = "https://127.0.0.1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change it as suggested by you but by default redfish only supports HTTPS protocol, if we just feed tls.ClientConfig config to http.Transport , users might not provide TLS configuration in conf file and errors might occur (errors like cannot validate certificate, etc might occur.)
plugins/inputs/redfish/redfish.go
Outdated
if len(r.Host) == 0 || len(r.BasicAuthUsername) == 0 || len(r.BasicAuthPassword) == 0 { | ||
return fmt.Errorf("Did not provide IP or username and password") | ||
} | ||
if len(r.Id) == 0 { | ||
return fmt.Errorf("Did not provide all the ID of the resource") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check for Host and ID in the Init() function. Don't check for the username/password, but set it on the request only if one of the options is non-empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to check Host, username/password and ID in the Init() function(included username/password check as well because for redfish APIs the authorisation is mandatory)
plugins/inputs/redfish/redfish.go
Outdated
resp, err := r.client.Do(req) | ||
if err != nil { | ||
return err | ||
} | ||
if resp.StatusCode == 200 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to close the response body anytime Do returns without error. This will ensure that connections are properly reused and not "leaked". The easiest way to do this is to use defer, but to do so you will need to split the request code into it's own function, since defer code runs when the function ends.
plugins/inputs/redfish/redfish.go
Outdated
if resp.StatusCode == 200 { | ||
body, err := ioutil.ReadAll(resp.Body) | ||
if err != nil { | ||
return fmt.Errorf("%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return err
plugins/inputs/redfish/redfish.go
Outdated
tags["room"] = payload.Location.PostalAddress.Room | ||
tags["rack"] = payload.Location.Placement.Rack | ||
tags["row"] = payload.Location.Placement.Row | ||
tags["severity"] = Severity(j.UpperThresholdCritical, j.UpperThresholdFatal, j.ReadingCelsius) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the severity tag, we avoid adding computed values since they are best calculated at query time or optionally in a processor. Especially since it seems that this is a new concept, not part of the redfish specification.
plugins/inputs/redfish/README.md
Outdated
|
||
### Example Output For Dell | ||
``` | ||
redfish_thermal_temperatures,source=test-hostname,name=CPU1\ Temp,source_ip=http://190.0.0.1,host=test-telegraf,datacenter="Tampa",health="OK",rack="12",room="tbc",row="3",state="Enabled",severity="OK" temperature=41,upper_threshold_critical=59,upper_threshold_fatal=64 1582114112000000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we talked about this earlier, but I'm having second thoughts about the source_ip
tag. It seems that we don't necessarily have an IP Address, it could just as easily be a hostname. The source_ip
also doesn't necessarily correspond to the source
tag, so I feel the name doesn't fit well.
How about we call it address
instead?
Let's also make sure it is only the hostname, not a URL.
@@ -0,0 +1,329 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename dell_computer_system.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed it as dell_systems.json as it sounds more appropriate.
@@ -0,0 +1,187 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename dell_chassis
@@ -0,0 +1,319 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename hp_computer_system.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed it as hp_systems.json as it sounds more appropriate.
plugins/inputs/redfish/redfish.go
Outdated
|
||
func (r *Redfish) Gather(acc telegraf.Accumulator) error { | ||
var url []string | ||
var payload Payload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing all the data into a single Payload type makes it more difficult to determine where the various bits of data are sourced from. It also increases the chances of value conflicts across resources and will make it difficult to support multiple Power or Thermal resources per ComputerSystem. Instead of using this shared type we should parse each resource into its own dedicated type.
I suggest we relay out of the Gather flow as follows:
-
Create a
func get(url, type struct) error
that will do requests and parse the json into the provided struct. Having this function will help with the response body closing mentioned in the review. -
In Gather Get the ComputerSystem, Chassis resources. In the future it would be nice to follow the @odata.i links in case a system has multiple Thermal or Power resources, or the ID is not shared for all resources, but it is not required now.
-
Next get the Thermal resource, then I suggest another function that creates the tags and fields using it and the Chassis and ComputerSystem types, and call AddFields.
-
Finally get the Power resource, also in another function create the tags and fields using it and the Chassis and ComputerSystem types, and call AddFields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About parsing data into a single Payload was suggested by Steven the other reviewer, previously the json data was parsed into different structs, What shall we do now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, don't worry about doing this refactor.
plugins/inputs/redfish/README.md
Outdated
password = "test" | ||
|
||
## Resource Id for redfish APIs | ||
id="System.Embedded.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ComputerSystem naming is taken from Redfish Resource Guid, do you have a link to where the ServerSystem is described? Also, don't forget to update this README.
plugins/inputs/redfish/redfish.go
Outdated
} | ||
tlsCfg, err := r.ClientConfig.TLSConfig() | ||
if err != nil { | ||
return fmt.Errorf("%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return err
, use this pattern throughout the PR
plugins/inputs/redfish/redfish.go
Outdated
|
||
func (r *Redfish) Init() error { | ||
if len(r.Address) == 0 || len(r.BasicAuthUsername) == 0 || len(r.BasicAuthPassword) == 0 { | ||
return fmt.Errorf("Did not provide IP or username and password") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lowercase error message: https://github.com/golang/go/wiki/CodeReviewComments#error-strings
plugins/inputs/redfish/README.md
Outdated
|
||
### Example Output | ||
``` | ||
redfish_thermal_temperatures,source=test-hostname,name=CPU1\ Temp,address=http://190.0.0.1,host=test-telegraf,datacenter="Tampa",health="OK",rack="12",room="tbc",row="3",state="Enabled" reading_celsius=41,upper_threshold_critical=59,upper_threshold_fatal=64 1582114112000000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to update this section
Merged for 1.15.0, excellent work @sarvanikonda |
Hey @sarvanikonda, I'm trying to get the plugin working with this simulator, and I noticed that the HP test examples don't have a file for the Chassis resource. Do these exist, there is a link listed in the System resource to |
Hey @danielnelson from the Chassis resource I am fetching Location details, for HP the /redfish/v1/Chassis/1/ exists but the location details in the resource are not provided by HP redfish, I used the existing the hp_power.json resource file in testdata as input to the /redfish/v1/Chassis/1/ endpoint in redfish_test.go file as the Location data is anyway not available if this is going to confuse people I will add the Chassis resource file in test data and change the redfish_test.go file. |
Got it, yeah let's add this file for completeness when you have some free time. What I found with this simulator is that the ID is not necessarily the same between the System and the Chassis endpoints. I am making a small change to the plugin to start with the System and use it's Chassis link. This requires a Chassis response but for now I can just toss in a placeholder. |
Hi @danielnelson I was also planning to add LowerThresholdCritical and LowerThresholdFatal values for temperatures,fans and voltages, I see that a PR has been raised by you #7722, would you like to add these values in 7722 PR ? or shall I wait to 7722 PR be merged to master and then add the values and raise a PR. |
I added these values to #7722 and also adjusted some field types match what is listed in the schema. This will probably require you to drop some measurements from your database but better now than after the release. |
Hey @danielnelson could you be more specific about dropping some measurements? and thank you very much for adding the lower threshold values, but in the code these values are also added in structs but not added in gather function? |
On the fields that I switched to be a float, you will get "type conflict error" when you insert into InfluxDB. To clear this up you have a few options but the most straightforward way is to delete the old metrics in the
Uh oh, me trying to move to fast and failing badly. |
oh, got it now, when you mentioned to drop measurements I thought you were thinking to delete some measurements in the code. |
This PR consists of redfish APIs for hardware monitoring of DELL and HP servers.
Required for all PRs: