Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hostmetrics] change cpu.load.average metric units #29915

Merged
merged 1 commit into from
Jan 5, 2024

Conversation

povilasv
Copy link
Contributor

Description:

The system.cpu.load_average.5m uses 1 as unit, which is reserved for utilization metrics - https://opentelemetry.io/docs/specs/semconv/general/metrics/#instrument-units

CPU Load Average does not track utilization and can go above

Link to tracking Issue:

#29914

Testing:

Documentation:

  • generated

@povilasv povilasv marked this pull request as ready for review December 15, 2023 13:15
@povilasv povilasv requested a review from dmitryax as a code owner December 15, 2023 13:15
@povilasv povilasv requested a review from a team December 15, 2023 13:15
@dmitryax
Copy link
Member

Given the definition

CPU load is the number of processes which are being executed by CPU or waiting to be executed by CPU. So CPU load average is the average number of processes being or waiting executed over past 1, 5 and 15 minutes.

Should the unit be {cpu}? @braydonk @mx-psi WDYT?

@bogdandrutu
Copy link
Member

"default unit 1" it is default for all the metrics not only for utilization. See https://github.com/open-telemetry/semantic-conventions/blob/b2e2c64b3f16b3d70b867cbaefc01936f23c1f4c/docs/general/metrics.md?plain=1#L195 and can read also https://unitsofmeasure.org/ucum.

@bogdandrutu
Copy link
Member

@dmitryax we can have it as "{cpu}" but we should not have as empty string.

@braydonk
Copy link
Contributor

braydonk commented Dec 15, 2023

Should the unit be {cpu}?

My instinct is probably not? {cpu} is used elsewhere to represent a number of CPUs, which doesn't feel appropriate for load average.

1 actually would make sense as a unit in a system with a single CPU (a good explanation is in the uptime manpage). The reason this can exceed 1 is because it can go up to 1 for every CPU on the system. This has similarities to a utilization metric, but it's not exactly the same. I wonder if there are other scenarios like this, it feels unique and nothing in the linked semantic conventions docs are standing out to me as a rule to follow here.

I am not sure how to handle this scenario as this is a strange metric. If I were to suggest something now, I would say it should be its own unit {load_avg}. I'm curious to hear other suggestions on this.

@povilasv
Copy link
Contributor Author

povilasv commented Dec 18, 2023

Linux Load averages don't measure CPU, I've seen load averages in 1000s and stuff like that, when there a lot of processes in D state, due to waiting for i/o or release of kernel lock.

Good article about this is -> https://tanelpoder.com/posts/high-system-load-low-cpu-utilization-on-linux/

Also found a good post that explain what actually Linux Load Averages measure -> https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

Linux load averages are "system load averages" that show the running thread (task) demand on the system as an average number of running plus waiting threads. This measures demand, which can be greater than what the system is currently processing. Most tools show three averages, for 1, 5, and 15 minutes:

So the unit here seems to be threads?

For Windows, we seem to compute the average using Processor Queue Length, which is:

The Processor Queue Length is the number of threads that are ready to run on the processor but cannot because of another active thread.

Unit also seems to be threads.

So my suggestion for unit is {thread}.

WDYT?

@mx-psi
Copy link
Member

mx-psi commented Dec 18, 2023

+1 to doing {thread} after reading @povilasv's links

@braydonk
Copy link
Contributor

braydonk commented Dec 18, 2023

Thanks for the additional documentation, it seems I misunderstood what the load average really meant.

{thread} is already a unit in process metrics for the process.threads metric.* In that metric, the unit refers to an actual exact number of threads, whereas this metric refers to an average. So similarly to {cpu} as a unit, using {thread} here would be a clash with the other usage of that as a unit.

* The unit there is {threads} plural, which I'm pretty sure it's not supposed to be. That will probably be changed in semantic conventions updates along with the name of the metric to process.thread.count.

@povilasv
Copy link
Contributor Author

povilasv commented Dec 19, 2023

I don't understand the issue with clashing units? Why it's an issue?

Taking simple example if one metric gives my count of current threads running, the unit is {thread}, if another metrics gives me average amount of threads in 5 minutes window the unit is still {thread}?

Ref: https://stats.stackexchange.com/questions/187918/does-average-have-units

@braydonk
Copy link
Contributor

Maybe it's fine then. I can't see any other example in semantic conventions of this case ever occurring (using the same unit for a specific number or average) but if that's not against the rules then {thread} is fine as a unit.

@povilasv
Copy link
Contributor Author

FYI I've changed the unit to {thread}, let me know what you think @dmitryax :)

Copy link
Contributor

github-actions bot commented Jan 5, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 5, 2024
@dmitryax dmitryax removed the Stale label Jan 5, 2024
Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{thread} sounds good to me

@dmitryax dmitryax merged commit 72dd3f8 into open-telemetry:main Jan 5, 2024
89 checks passed
@github-actions github-actions bot added this to the next release milestone Jan 5, 2024
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this pull request Jan 10, 2024
)

**Description:** 
The system.cpu.load_average.5m uses 1 as unit, which is reserved for
utilization metrics -
https://opentelemetry.io/docs/specs/semconv/general/metrics/#instrument-units

CPU Load Average does not track utilization and can go above 

**Link to tracking Issue:** open-telemetry#29914
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants