Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Power Measurement State to DCMI collector #160

Merged
merged 1 commit into from
Aug 30, 2023

Conversation

k0ste
Copy link
Contributor

@k0ste k0ste commented Jun 12, 2023

Added DCMI metric to definitely determine avail of Power Measurement feature

In some cases, for example with cheap Power Supply Units power consumption may be not available at all, to control this added ipmi_dcmi_power_measurement_state metric. The value is 1 for power measure is available on, and 0 otherwise.

The main point for this changes is a try to knowledge: the measurements isn't accessed due PSU broken (FW/cable/etc issues) or not avail at all on system level (and system know about that)

The host with "normal PSU":

root@host# curl -Ss "http://127.0.0.1:9291/ipmi?module=default&target=10.249.0.9" | grep -i dcmi
# HELP ipmi_dcmi_power_consumption_watts Current power consumption in Watts.
# TYPE ipmi_dcmi_power_consumption_watts gauge
ipmi_dcmi_power_consumption_watts 353
# HELP ipmi_dcmi_power_measurement_state Availability of Power Measurement (0=false, 1=true).
# TYPE ipmi_dcmi_power_measurement_state gauge
ipmi_dcmi_power_measurement_state 1
ipmi_up{collector="dcmi"} 1

The host with "strange PSU's"

root@host# curl -Ss "http://127.0.0.1:9291/ipmi?module=default&target=10.249.0.6" | grep -i dcmi
# HELP ipmi_dcmi_power_consumption_watts Current power consumption in Watts.
# TYPE ipmi_dcmi_power_consumption_watts gauge
ipmi_dcmi_power_consumption_watts 0
# HELP ipmi_dcmi_power_measurement_state Availability of Power Measurement (0=false, 1=true).
# TYPE ipmi_dcmi_power_measurement_state gauge
ipmi_dcmi_power_measurement_state 0
ipmi_up{collector="dcmi"} 1

@k0ste
Copy link
Contributor Author

k0ste commented Jun 24, 2023

Ping @bitfehler

@bitfehler
Copy link
Contributor

Hi @k0ste, sorry and thanks for the ping. Thanks for the patch! In general, this is a welcome addition. One issue though: I'd much prefer the metric to simply not be there if it is unsupported. Could this be done, or is there a reason why the additional metric is needed? For example, is it possible for the ipmi_dcmi_power_measurement_state to change in the lifetime of a PSU? From what I understand it just depends on the hardware and never changes? If so, I'd prefer using the result of the availability check to simply drop the power consumption metric completely if not available.

@k0ste
Copy link
Contributor Author

k0ste commented Jun 29, 2023

Hi, @bitfehler! When I was imagine, how's better to handle this, I'm also thinked to this side, but:

  1. Zero watts is also a metric, i.e. 0 != Nan
  2. This will be breaking changes - some teams may use ipmi_dcmi_power_consumption_watts == 0 as query, and this queries should be rewritten (to unless(), I guess)
  3. If we drop the ipmi_dcmi_power_consumption_watts, when the measurement is zero, this will not help to debug: why the power metrics was absent or lowered:
    • This because the "Intel Cloud Redundancy" is not setup'ed? [by default, Intel servers use power only from one PSU, we use raw ipmi command for turn on "shared mode"
    • This because something bad between PSU and Motherboard? In case, when ipmi_dcmi_power_measurement_state == 1 & ipmi_dcmi_power_consumption_watts == 0
    • This because measurement impossible on system level? In case, when ipmi_dcmi_power_measurement_state == 0 & ipmi_dcmi_power_consumption_watts == 0

Thanks!

@bitfehler
Copy link
Contributor

Zero watts is also a metric, i.e. 0 != Nan

Absolutely!

3. If we drop the ipmi_dcmi_power_consumption_watts, when the measurement is zero

I was actually thinking more like this: do perform the test for active/not available that you added, if "active" output metric regardless of value, if not available drop metric. Shouldn't that do?

@k0ste
Copy link
Contributor Author

k0ste commented Jun 30, 2023

I was actually thinking more like this: do perform the test for active/not available that you added, if "active" output metric regardless of value, if not available drop metric. Shouldn't that do?

The 3rd point ☝️ exactly about this idea 🌚

@bitfehler
Copy link
Contributor

I am not sure I understand. Your 3rd point says

If we drop the ipmi_dcmi_power_consumption_watts, when the measurement is zero

but my suggestion is to only drop it if the new test you added in this PR says not available (i.e. when ipmi_dcmi_power_measurement_state would be 0).

As I understand it, with your current patch, the decision table would be:

ipmi_dcmi_power_consumption_watts ipmi_dcmi_power_measurement_state Deduction
0 0 BMC/PSU don't support it
0 1 something is amiss, investigation needed
>0 0 impossible
>0 1 metric as intended

Or am I missing something?

My suggestion would simply be to translate this into the following decision table:

ipmi_dcmi_power_consumption_watts Deduction
N/A BMC/PSU don't support it
0 something is amiss, investigation needed
> 0 metric as intended

Note the absence of an "impossible" state, hence I think it would make for a better solution. Sorry if I am misunderstanding s/t, maybe you can elaborate on your third point?

@k0ste
Copy link
Contributor Author

k0ste commented Jul 14, 2023

Note the absence of an "impossible" state, hence I think it would make for a better solution. Sorry if I am misunderstanding s/t, maybe you can elaborate on your third point?

You understanding is totally correctly. (But) If we translate "my table" to "your table" - this will be breaking changes in current how ipmi_exporter performed, where ipmi_dcmi_power_consumption_watts is always present

@k0ste
Copy link
Contributor Author

k0ste commented Jul 15, 2023

If the project okay for small breaking changes, I can prepare the release 1.7 with notes to CHANGELOG 🙂

@bitfehler
Copy link
Contributor

@k0ste a breaking change is totally fine. can you update the patch accordingly? thanks a lot!

@k0ste
Copy link
Contributor Author

k0ste commented Aug 10, 2023

@bitfehler updated! The checks:

Host with Power Measurement

root@host# curl -Ss "http://localhost:9291/ipmi?module=default&target=10.249.0.9" | grep -i power
# HELP ipmi_chassis_power_state Current power state (1=on, 0=off).
# TYPE ipmi_chassis_power_state gauge
ipmi_chassis_power_state 1
# HELP ipmi_dcmi_power_consumption_watts Current power consumption in Watts (no value if Power Measurement is not avail)
# TYPE ipmi_dcmi_power_consumption_watts gauge
ipmi_dcmi_power_consumption_watts 372
ipmi_sensor_state{id="4024",name="PS1 Status",type="Power Supply"} 0
ipmi_sensor_state{id="4091",name="PS2 Status",type="Power Supply"} 0
ipmi_sensor_value{id="4024",name="PS1 Status",type="Power Supply"} NaN
ipmi_sensor_value{id="4091",name="PS2 Status",type="Power Supply"} NaN

Host without Power Measurement

root@host# curl -Ss "http://localhost:9291/ipmi?module=default&target=10.249.0.6" | grep -i power
# HELP ipmi_chassis_power_state Current power state (1=on, 0=off).
# TYPE ipmi_chassis_power_state gauge
ipmi_chassis_power_state 1
ipmi_sensor_state{id="4158",name="PS1 Status",type="Power Supply"} 0
ipmi_sensor_state{id="4225",name="PS2 Status",type="Power Supply"} 0
ipmi_sensor_value{id="4158",name="PS1 Status",type="Power Supply"} NaN
ipmi_sensor_value{id="4225",name="PS2 Status",type="Power Supply"} NaN

@k0ste
Copy link
Contributor Author

k0ste commented Aug 22, 2023

ping @bitfehler

Copy link
Contributor

@bitfehler bitfehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! Some details to iron out, but should be easy!

collector_dcmi.go Outdated Show resolved Hide resolved
collector_dcmi.go Outdated Show resolved Hide resolved
docs/metrics.md Outdated Show resolved Hide resolved
freeipmi/freeipmi.go Outdated Show resolved Hide resolved
Measurement feature is not present. Before this change - the value was zero

Signed-off-by: Konstantin Shalygin <k0ste@k0ste.ru>
@k0ste
Copy link
Contributor Author

k0ste commented Aug 30, 2023

Changes maden 🙏

Copy link
Contributor

@bitfehler bitfehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks a bunch!

@bitfehler bitfehler merged commit 34ff637 into prometheus-community:master Aug 30, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants