add prometheus rule for OOMKilled pods #800

davidspek · 2022-10-20T19:10:44Z

Signed-off-by: DavidSpek vanderspek.david@gmail.com

This PR adds a prometheus rule to catch pods that have been OOMKilled.
Happy to improve or change the expression if anybody has suggestions.

Signed-off-by: DavidSpek <vanderspek.david@gmail.com>

davidspek · 2022-10-20T19:17:04Z

/cc @paulfantom @arajkumar @povilasv since you reviewed and approved a similar PR recently.

povilasv · 2022-10-26T10:15:42Z

Wouldn't this be covered by https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet#L15-L26 this alert?

alerts/apps_alerts.libsonnet

Retna-Gjensidige · 2022-10-27T07:27:05Z

Wouldn't this be covered by https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet#L15-L26 this alert?

reason="CrashLoopBackOff" will not cover reason="OOMKilled. They are 2 different scenarios. A pod could be in crashloopbackoff for many reasons, eg. failing health probes. An OOMKilled is specific to a healthy pod being killed when it reaches the memory limits.

edwardgronroos · 2022-10-27T07:29:16Z

This looks good @davidspek, we have the need for this alert aswell, and would be great to have it bundled and not having to add it as a customization.

Co-authored-by: Retna <76952128+Retna-Gjensidige@users.noreply.github.com>

Co-authored-by: Edward Grönroos <edward@gronroos.se>

davidspek · 2022-10-27T07:31:48Z

Wouldn't this be covered by https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet#L15-L26 this alert?
It is important to know a container was OOMKilled specifically so you can adjust memory limits accordingly.

povilasv · 2022-10-28T07:18:42Z

Wouldn't this be covered by https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet#L15-L26 this alert?

reason="CrashLoopBackOff" will not cover reason="OOMKilled. They are 2 different scenarios. A pod could be in crashloopbackoff for many reasons, eg. failing health probes. An OOMKilled is specific to a healthy pod being killed when it reaches the memory limits.

But also Pod could be crash looping due to too many OOMKills, so it does cover it, no?

davidspek · 2022-10-29T06:17:52Z

It can be difficult to detect that a pod crashed due to OOMKilled after the fact. I think it warrants a dedicated alert so you can more easily take appropriate action (like increasing the replicas or memory limit). Bunching it up with other crash loops, which can happen for any number or reasons, makes it difficult to find pods that have memory limits set too low.

Retna-Gjensidige · 2022-11-03T08:07:56Z

Wouldn't this be covered by https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet#L15-L26 this alert?

reason="CrashLoopBackOff" will not cover reason="OOMKilled. They are 2 different scenarios. A pod could be in crashloopbackoff for many reasons, eg. failing health probes. An OOMKilled is specific to a healthy pod being killed when it reaches the memory limits.

But also Pod could be crash looping due to too many OOMKills, so it does cover it, no?

Yes as a whole crash loop will "cover" OOMKills, but you have to ask yourself what is the point of having alerts? You want the receiving party to quickly identify the issue and resolve it. The more specific we are with the alerts, the better. As mentioned before crash loop is generic and could be many reasons for it, there is no mistaking a OOMKill alert.

povilasv · 2022-11-03T19:01:59Z

Reread this: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

Two Points from the above document that I don't like about this alert:

Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
Does it detect an otherwise undetected condition that is urgent, actionable and actively or imminently user-visible.

Retna-Gjensidige · 2022-11-04T10:17:09Z

Reread this: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

Two Points from the above document that I don't like about this alert:

Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.

Does it detect an otherwise undetected condition that is urgent, actionable and actively or imminently user-visible.

Thanks for the link. It was a good read 👍
The points you mention are under:

When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier

Speaking from my own experience, i am in an oncall rotation. This alert is set as warning, so it will not trigger a page/sms/email 😅.
That said, this alert does address the following point:

Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.

Personally, I consider OOMKilled a symptom just as crash loop is another symptom.

davidspek · 2022-11-07T16:25:15Z

The main reason for this is also that an OOMKilled pod can be somewhat seen like an event that you wouldn't want to have noticed, but it might not cause a crash loop. Even without a crash loop happening, you probably still want to know an OOMKilled occurred. The alert might not be perfectly setup for this, but I'm open do discuss what this should look like.

github-actions · 2024-09-19T00:23:33Z

This PR has been automatically marked as stale because it has not
had any activity in the past 30 days.

The next time this stale check runs, the stale label will be
removed if there is new activity. The issue will be closed in 7
days if there is no new activity.

Thank you for your contributions!

add prometheus rule for OOMKilled pods

f7d5666

Signed-off-by: DavidSpek <vanderspek.david@gmail.com>

davidspek mentioned this pull request Oct 20, 2022

[kube-prometheus-stack] Add OOMKilled alert rule prometheus-community/helm-charts#2593

Closed

3 tasks

Retna-Gjensidige reviewed Oct 27, 2022

View reviewed changes

alerts/apps_alerts.libsonnet Outdated Show resolved Hide resolved

edwardgronroos reviewed Oct 27, 2022

View reviewed changes

alerts/apps_alerts.libsonnet Outdated Show resolved Hide resolved

davidspek and others added 2 commits October 27, 2022 09:30

Update alerts/apps_alerts.libsonnet

b48aa81

Co-authored-by: Retna <76952128+Retna-Gjensidige@users.noreply.github.com>

Update alerts/apps_alerts.libsonnet

8ec54dc

Co-authored-by: Edward Grönroos <edward@gronroos.se>

wking mentioned this pull request Nov 13, 2022

Enable alerting on OOMs #112

Closed

mac-chaffee mentioned this pull request Jan 29, 2023

Create alert for OOMKill events inside containers #822

Closed

github-actions bot added the stale label Sep 19, 2024

github-actions bot closed this Sep 26, 2024

Uh oh!

add prometheus rule for OOMKilled pods #800

add prometheus rule for OOMKilled pods #800

Uh oh!

Conversation

davidspek commented Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidspek commented Oct 20, 2022

Uh oh!

povilasv commented Oct 26, 2022

Uh oh!

Uh oh!

Uh oh!

Retna-Gjensidige commented Oct 27, 2022

Uh oh!

edwardgronroos commented Oct 27, 2022

Uh oh!

davidspek commented Oct 27, 2022

Uh oh!

povilasv commented Oct 28, 2022

Uh oh!

davidspek commented Oct 29, 2022

Uh oh!

Retna-Gjensidige commented Nov 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

povilasv commented Nov 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Retna-Gjensidige commented Nov 4, 2022

Uh oh!

davidspek commented Nov 7, 2022

Uh oh!

github-actions bot commented Sep 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

davidspek commented Oct 20, 2022 •

edited

Loading

Retna-Gjensidige commented Nov 3, 2022 •

edited

Loading

povilasv commented Nov 3, 2022 •

edited

Loading