Skip to content

Conversation

@davidspek
Copy link

@davidspek davidspek commented Oct 20, 2022

Signed-off-by: DavidSpek vanderspek.david@gmail.com

This PR adds a prometheus rule to catch pods that have been OOMKilled.
Happy to improve or change the expression if anybody has suggestions.

Signed-off-by: DavidSpek <vanderspek.david@gmail.com>
@davidspek
Copy link
Author

/cc @paulfantom @arajkumar @povilasv since you reviewed and approved a similar PR recently.

@povilasv
Copy link
Contributor

@Retna-Gjensidige
Copy link

Wouldn't this be covered by https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet#L15-L26 this alert?

reason="CrashLoopBackOff" will not cover reason="OOMKilled. They are 2 different scenarios. A pod could be in crashloopbackoff for many reasons, eg. failing health probes. An OOMKilled is specific to a healthy pod being killed when it reaches the memory limits.

@edwardgronroos
Copy link

This looks good @davidspek, we have the need for this alert aswell, and would be great to have it bundled and not having to add it as a customization.

davidspek and others added 2 commits October 27, 2022 09:30
Co-authored-by: Retna <76952128+Retna-Gjensidige@users.noreply.github.com>
Co-authored-by: Edward Grönroos <edward@gronroos.se>
@davidspek
Copy link
Author

Wouldn't this be covered by https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet#L15-L26 this alert?
It is important to know a container was OOMKilled specifically so you can adjust memory limits accordingly.

@povilasv
Copy link
Contributor

Wouldn't this be covered by https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet#L15-L26 this alert?

reason="CrashLoopBackOff" will not cover reason="OOMKilled. They are 2 different scenarios. A pod could be in crashloopbackoff for many reasons, eg. failing health probes. An OOMKilled is specific to a healthy pod being killed when it reaches the memory limits.

But also Pod could be crash looping due to too many OOMKills, so it does cover it, no?

@davidspek
Copy link
Author

It can be difficult to detect that a pod crashed due to OOMKilled after the fact. I think it warrants a dedicated alert so you can more easily take appropriate action (like increasing the replicas or memory limit). Bunching it up with other crash loops, which can happen for any number or reasons, makes it difficult to find pods that have memory limits set too low.

@Retna-Gjensidige
Copy link

Retna-Gjensidige commented Nov 3, 2022

Wouldn't this be covered by https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet#L15-L26 this alert?

reason="CrashLoopBackOff" will not cover reason="OOMKilled. They are 2 different scenarios. A pod could be in crashloopbackoff for many reasons, eg. failing health probes. An OOMKilled is specific to a healthy pod being killed when it reaches the memory limits.

But also Pod could be crash looping due to too many OOMKills, so it does cover it, no?

Yes as a whole crash loop will "cover" OOMKills, but you have to ask yourself what is the point of having alerts? You want the receiving party to quickly identify the issue and resolve it. The more specific we are with the alerts, the better. As mentioned before crash loop is generic and could be many reasons for it, there is no mistaking a OOMKill alert.

@povilasv
Copy link
Contributor

povilasv commented Nov 3, 2022

Reread this: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

Two Points from the above document that I don't like about this alert:

  • Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
  • Does it detect an otherwise undetected condition that is urgent, actionable and actively or imminently user-visible.

@Retna-Gjensidige
Copy link

Reread this: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

Two Points from the above document that I don't like about this alert:

  • Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
  • Does it detect an otherwise undetected condition that is urgent, actionable and actively or imminently user-visible.

Thanks for the link. It was a good read 👍
The points you mention are under:

When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier

Speaking from my own experience, i am in an oncall rotation. This alert is set as warning, so it will not trigger a page/sms/email 😅.
That said, this alert does address the following point:

Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.

Personally, I consider OOMKilled a symptom just as crash loop is another symptom.

@davidspek
Copy link
Author

The main reason for this is also that an OOMKilled pod can be somewhat seen like an event that you wouldn't want to have noticed, but it might not cause a crash loop. Even without a crash loop happening, you probably still want to know an OOMKilled occurred. The alert might not be perfectly setup for this, but I'm open do discuss what this should look like.

@github-actions
Copy link

This PR has been automatically marked as stale because it has not
had any activity in the past 30 days.

The next time this stale check runs, the stale label will be
removed if there is new activity. The issue will be closed in 7
days if there is no new activity.

Thank you for your contributions!

@github-actions github-actions bot added the stale label Sep 19, 2024
@github-actions github-actions bot closed this Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants