-
Couldn't load subscription status.
- Fork 630
add prometheus rule for OOMKilled pods #800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: DavidSpek <vanderspek.david@gmail.com>
|
/cc @paulfantom @arajkumar @povilasv since you reviewed and approved a similar PR recently. |
|
Wouldn't this be covered by https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet#L15-L26 this alert? |
reason="CrashLoopBackOff" will not cover reason="OOMKilled. They are 2 different scenarios. A pod could be in crashloopbackoff for many reasons, eg. failing health probes. An OOMKilled is specific to a healthy pod being killed when it reaches the memory limits. |
|
This looks good @davidspek, we have the need for this alert aswell, and would be great to have it bundled and not having to add it as a customization. |
Co-authored-by: Retna <76952128+Retna-Gjensidige@users.noreply.github.com>
Co-authored-by: Edward Grönroos <edward@gronroos.se>
|
But also Pod could be crash looping due to too many OOMKills, so it does cover it, no? |
|
It can be difficult to detect that a pod crashed due to OOMKilled after the fact. I think it warrants a dedicated alert so you can more easily take appropriate action (like increasing the replicas or memory limit). Bunching it up with other crash loops, which can happen for any number or reasons, makes it difficult to find pods that have memory limits set too low. |
Yes as a whole crash loop will "cover" OOMKills, but you have to ask yourself what is the point of having alerts? You want the receiving party to quickly identify the issue and resolve it. The more specific we are with the alerts, the better. As mentioned before crash loop is generic and could be many reasons for it, there is no mistaking a OOMKill alert. |
|
Reread this: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit Two Points from the above document that I don't like about this alert:
|
Thanks for the link. It was a good read 👍 When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier Speaking from my own experience, i am in an oncall rotation. This alert is set as warning, so it will not trigger a page/sms/email 😅. Symptoms are a better way to capture more problems more comprehensively and robustly with less effort. Personally, I consider OOMKilled a symptom just as crash loop is another symptom. |
|
The main reason for this is also that an OOMKilled pod can be somewhat seen like an event that you wouldn't want to have noticed, but it might not cause a crash loop. Even without a crash loop happening, you probably still want to know an OOMKilled occurred. The alert might not be perfectly setup for this, but I'm open do discuss what this should look like. |
|
This PR has been automatically marked as stale because it has not The next time this stale check runs, the stale label will be Thank you for your contributions! |
Signed-off-by: DavidSpek vanderspek.david@gmail.com
This PR adds a prometheus rule to catch pods that have been OOMKilled.
Happy to improve or change the expression if anybody has suggestions.