Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Node exporter monitoring mixin #941

Closed
wants to merge 13 commits into from

Conversation

tomwilkie
Copy link
Member

@tomwilkie tomwilkie commented May 9, 2018

This is currently a work in progress.

Takes a bunch of the non-kubernetes-specific alerts and rules from the kubernetes-mixin and combines them with alerts from various places.

This is not supposed to be complete, just a starting point. For instructions on how to use it, best see the kubernetes-mixin README.

@tomwilkie tomwilkie changed the title [WIPNode exporte [WIP] Node exporter monitoring mixin May 9, 2018
Copy link
Contributor

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm behind the idea, this set of rules is inconsistent and does not follow our best practices in a few areas. I'd suggest a read of https://prometheus.io/docs/practices/rules/

node-mixin/alerts/alerts.libsonnet Show resolved Hide resolved
node-mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved
node-mixin/dashboards/node.libsonnet Outdated Show resolved Hide resolved
node-mixin/dashboards/node.libsonnet Outdated Show resolved Hide resolved
node-mixin/dashboards/node.libsonnet Outdated Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Outdated Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Outdated Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Outdated Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
@discordianfish
Copy link
Member

Nice! Can you add a few words/link to the readme how to use them?

@tomwilkie
Copy link
Member Author

Sure! Still a bit WIP, I need to come back and update lots of things for the latest release. Instructions will be mostly same as the kubernetes-mixin, so I'm thinking of centralising them to prevent problems keeping everything up to date.

node-mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved
node-mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved
node-mixin/alerts/alerts.libsonnet Show resolved Hide resolved
node-mixin/config.libsonnet Show resolved Hide resolved
node-mixin/config.libsonnet Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
node-mixin/rules/rules.libsonnet Show resolved Hide resolved
@metalmatze
Copy link
Member

I'd be happy to continue with this PR, if @tomwilkie allows and it somehow is possible for me to push changes. ☺️

@discordianfish
Copy link
Member

I'd really like to see this revived too! @tomwilkie What is the state of this? Can @metalmatze help with this?

tomwilkie and others added 11 commits November 19, 2018 16:46
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
@metalmatze
Copy link
Member

I've updated the rules and alerts for v0.16. Additionally I've rebased my commits to sign them off.

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
@beorn7
Copy link
Member

beorn7 commented Jul 1, 2019

@metalmatze what's your state on this? As I have recently worked on (prometheus/prometheus#4474 , I would give this PR a similar treatment (respond to open comments, general grooming, add README, Makefile, etc.). I just would like to avoid the two of us working on it concurrently. So please let me know if it is OK if I take the lock here. And if you have any work on this not yet pushed, now would be a good time to do so.
Thanks!

@metalmatze
Copy link
Member

Other than the few commits that I've pushed, I did not put more time since then into it. Feel free to go ahead! 😊

@beorn7
Copy link
Member

beorn7 commented Jul 2, 2019

As you can see from the issues referencing this above, this PR is getting quite a bit of popularity.
My plan is to first clean up formal concerns in here and then take comments about more alerts to add and how to improve alerts herein.

It might also be more workable to merge this in a not-blatantly-wrong-but-still-WIP state and then let several other people create PRs affecting subsets of the alerts/dashboards.

Copy link
Contributor

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small things I missed last time.

// Full utilisation would be all disks on each node spending an average of
// 1 sec per second doing I/O, normalize by node count for stacked charts
g.queryPanel(|||
instance:node_disk_utilisation:sum_irate / scalar(sum(up{%(nodeExporterSelector)s}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum of irate is just :irate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's still the sum over all devices, right? Let's discuss in my follow-up PR. This rule name has changed completely anyway.

)
.addPanel(
g.panel('CPU Saturation (Load1)') +
g.queryPanel('instance:node_cpu_saturation_load1:{instance="$instance"}', 'Saturation') +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing an operation, and load is not a useful saturation indication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the name already, and I've added a TODO to the dashboard where this is used, questioning if load average dashboards are useful at all.

@brian-brazil
Copy link
Contributor

Yeah, let's fix the various problems this already has before expanding further.

@beorn7
Copy link
Member

beorn7 commented Jul 2, 2019

To clarify: I'm working on the formal and fundamental problems right now. Will drop a note here when done, and then we can decide if we want to refine things in this PR or better in later, separate PRs.

@metalmatze
Copy link
Member

Sounds all good to me! Thanks for picking up this PR! 🎉

@beorn7
Copy link
Member

beorn7 commented Jul 16, 2019

FYI: Since I cannot update Tom's branch, I'll create a new PR based on this one, with all the changes I have made included. Stay tuned…

@beorn7
Copy link
Member

beorn7 commented Jul 16, 2019

New PR with all comments either completely addressed or at least acknowledged with a TODO is here: #1429 . I'll close this .

@beorn7 beorn7 closed this Jul 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants