Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixin: Fix cpu usage graph #3109

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 1 addition & 6 deletions docs/node-mixin/lib/prom-mixin.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -66,17 +66,12 @@ local table = grafana70.panel.table;
datasource='$datasource',
span=6,
format='percentunit',
max=1,
min=0,
stack=true,
)
.addTarget(prometheus.target(
|||
(
(1 - sum without (mode) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode=~"idle|iowait|steal", instance="$instance", %(clusterLabel)s="$cluster"}[$__rate_interval])))
/ ignoring(cpu) group_left
count without (cpu, mode) (node_cpu_seconds_total{%(nodeExporterSelector)s, mode="idle", instance="$instance", %(clusterLabel)s="$cluster"})
)
(1 - sum without (mode) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode=~"idle|iowait|steal", instance="$instance", %(clusterLabel)s="$cluster"}[$__rate_interval])))
Copy link
Member

@SuperQ SuperQ Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we don't want to include iowait and steal here.

If we want this to be stacked CPU utilization, we probably want this:

Suggested change
(1 - sum without (mode) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode=~"idle|iowait|steal", instance="$instance", %(clusterLabel)s="$cluster"}[$__rate_interval])))
clamp(
avg without (mode) (
1-rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode="idle", instance="$instance", %(clusterLabel)s="$cluster"}[$__rate_interval])
),
0,
1
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember a discussion about steal. But I don't remember the details. (But it feels like I have included steal for a reason.)

Why would you also toss iowait? Isn't the whole idea here to find out if you don't utilize your CPUs for whatever reason, including being stuck waiting for IO?

Copy link
Member

@SuperQ SuperQ Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iowait is not CPU time. It's accounted for in CPU metrics but it's actually phantom time when nothing is done. You can have both 100% idle and 100% iowait Leading to -100% CPU use in this calculation.

I think this was an attempt to do what the kubernetes-mixin is doing, but it has the regexp inverted.

See: kubernetes-mixin/rules/node.libsonnet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/sched/cputime.c?id=HEAD#n222 looks more like the wait time is either added to idle or to iowait. But I'm just shooting in the dark here. Maybe that's not the relevant code, or it gets changed later on its way to node_exporter.

||| % config,
legendFormat='{{cpu}}',
intervalFactor=5,
Expand Down