-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Node exporter monitoring mixin #941
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm behind the idea, this set of rules is inconsistent and does not follow our best practices in a few areas. I'd suggest a read of https://prometheus.io/docs/practices/rules/
Nice! Can you add a few words/link to the readme how to use them? |
Sure! Still a bit WIP, I need to come back and update lots of things for the latest release. Instructions will be mostly same as the kubernetes-mixin, so I'm thinking of centralising them to prevent problems keeping everything up to date. |
I'd be happy to continue with this PR, if @tomwilkie allows and it somehow is possible for me to push changes. |
I'd really like to see this revived too! @tomwilkie What is the state of this? Can @metalmatze help with this? |
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
I've updated the rules and alerts for v0.16. Additionally I've rebased my commits to sign them off. |
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
@metalmatze what's your state on this? As I have recently worked on (prometheus/prometheus#4474 , I would give this PR a similar treatment (respond to open comments, general grooming, add README, Makefile, etc.). I just would like to avoid the two of us working on it concurrently. So please let me know if it is OK if I take the lock here. And if you have any work on this not yet pushed, now would be a good time to do so. |
Other than the few commits that I've pushed, I did not put more time since then into it. Feel free to go ahead! 😊 |
As you can see from the issues referencing this above, this PR is getting quite a bit of popularity. It might also be more workable to merge this in a not-blatantly-wrong-but-still-WIP state and then let several other people create PRs affecting subsets of the alerts/dashboards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small things I missed last time.
// Full utilisation would be all disks on each node spending an average of | ||
// 1 sec per second doing I/O, normalize by node count for stacked charts | ||
g.queryPanel(||| | ||
instance:node_disk_utilisation:sum_irate / scalar(sum(up{%(nodeExporterSelector)s})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sum of irate is just :irate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's still the sum over all devices, right? Let's discuss in my follow-up PR. This rule name has changed completely anyway.
) | ||
.addPanel( | ||
g.panel('CPU Saturation (Load1)') + | ||
g.queryPanel('instance:node_cpu_saturation_load1:{instance="$instance"}', 'Saturation') + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing an operation, and load is not a useful saturation indication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the name already, and I've added a TODO to the dashboard where this is used, questioning if load average dashboards are useful at all.
Yeah, let's fix the various problems this already has before expanding further. |
To clarify: I'm working on the formal and fundamental problems right now. Will drop a note here when done, and then we can decide if we want to refine things in this PR or better in later, separate PRs. |
Sounds all good to me! Thanks for picking up this PR! 🎉 |
FYI: Since I cannot update Tom's branch, I'll create a new PR based on this one, with all the changes I have made included. Stay tuned… |
New PR with all comments either completely addressed or at least acknowledged with a TODO is here: #1429 . I'll close this . |
This is currently a work in progress.
Takes a bunch of the non-kubernetes-specific alerts and rules from the kubernetes-mixin and combines them with alerts from various places.
This is not supposed to be complete, just a starting point. For instructions on how to use it, best see the kubernetes-mixin README.