Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixin: Add and update alerts #2644

Merged
merged 19 commits into from
Jul 4, 2023
Merged

Conversation

v-zhuravlev
Copy link
Contributor

@v-zhuravlev v-zhuravlev commented Mar 27, 2023

  • e1f3b37 Add %(nodeExporterSelector)s to Network and conntrack alerts
  • af23597 Add NodeDiskIOSaturation alert
  • e2e7864 Set 'at' everywhere as preposition for instance
  • 282c129 Decrease NodeNetwork*Errs pending period
  • 94535a9 Add failed systemd service alert
  • f685865 Add CPU and memory alerts
  • 4d19e84 Decrease NodeFilesystem pending time to 15m
  • 0336d13 Add mountpoint to NodeFilesystem alerts

@v-zhuravlev v-zhuravlev changed the title Mixin: Alerts: Add mountpoint to NodeFilesystem alerts Mixin: Add and update alerts Mar 27, 2023
@v-zhuravlev v-zhuravlev force-pushed the mixin_alerts branch 7 times, most recently from 7b78643 to 4a83ed2 Compare March 27, 2023 22:58
@@ -309,6 +309,102 @@
description: 'File descriptors limit at {{ $labels.instance }} is currently at {{ printf "%.2f" $value }}%.',
},
},
{
alert: 'NodeCPUHighUsage',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High CPU usage is not a problem and can just be an indicator or properly utilizing your machine, so I'd remove these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, as long as we can alert on high system load(saturation).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, CPU usage is good. :) I mean, this would be a case for the "info" level alerts that I like to promote, but I don't think we have them here in the mixin.

(Info level alerts notify nobody, but you could look at the alerts page while troubleshooting. They point to things that are not problems per se and might be OK, but which you might be interested while there is an actual incident happening.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd be fine with a 'info' level severity. No reason to now just introduce that now that we're on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it an info according to this guideline:

info for alerts that do not require any action by itself but mark something as “out of the ordinary”. Those alerts aren’t usually routed anywhere, but can be inspected during troubleshooting.

Copy link

@jcpunk jcpunk May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a warning if the usage stays above 98% for 1h would be viable? That would be a case where the host is at capacity and scheduling more tasks there would result in performance degradation. It is a risk folks can accept but something that should be considered as part of the capacity plan.

},
},
{
alert: 'NodeSystemSaturation',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this, this miiight make sense but I also leaning towards not doing this. @SuperQ wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is is really helpful to detect system performance degradation, https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is are there in pragmatic reality scenarios where a high load is nothing to warn about. And the article explains how other, non cpu related saturation will also increase this. (But thanks for the link, super interesting to see the actual patch that introduced this confusion :))

Copy link
Contributor Author

@v-zhuravlev v-zhuravlev Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this alert triggers on load per core. There are always exceptions, in such situations 'silence' could also help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is are there in pragmatic reality scenarios where a high load is nothing to warn about. And the article explains how other, non cpu related saturation will also increase this. (But thanks for the link, super interesting to see the actual patch that introduced this confusion :))

that's why this alert is called NodeSystemSaturation, not CPUsaturation, btw :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally remember scenarios where the load metric was high in legitimate use cases without being a problem, but that's long ago, and a lot has changed in the kernel since then. I heard opinions that load average is now somewhat useful as a metric, others state the opposite, and I don't feel qualified to make the call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also info severity? Alerting/paging on this kinda goes against alerting on actual impact (as oppose to alert on response times of services running on the overloaded node)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the load metric here actually tells us about actual saturation, then I would say "warning" is fine. It is an actionable alert then, at least for the many scenarios where you don't want to run your systems over-saturated all the time. It's just not urgent enough to wake someone up.

IMHO "info" level alerts are for conditions that are completely fine on their own but could be hints towards a possible cause while an incident is happening.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah lets convert it to warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, decreased to warning, agree

@discordianfish
Copy link
Member

Thanks! I think these need a bit discussion. @SuperQ @pgier @beorn7 wdyt about these?

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff. But for some alerts, I just don't feel qualified to judge, see comments.

I have a few punctuation nits, and more importantly by making some of the new parameters configurable, we can avoid any trouble for those that would like to keep the old ones or don't like the new alerts.

docs/node-mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved
docs/node-mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved
@@ -53,13 +53,13 @@
node_filesystem_readonly{%(nodeExporterSelector)s,%(fsSelector)s,%(fsMountpointSelector)s} == 0
)
||| % $._config,
'for': '30m',
'for': '15m',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about the exact time here. Maybe it's worth making it configurable, similar to fsSpaceAvailableWarningThreshold? In a way, the one depends on the other. If your fsSpaceAvailableWarningThreshold is conservative, you might be more relaxed about the for time and vice versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if it is really necessary to tune for timings as well, if you really need to tune it you can patch with jsonnnet. So I would keep it static.

AFAIK, the idea of for is to be sure that problem is permanent and won't resolve itself. At the same time, if it is a real problem, 30mins feels too long period to wait (especially for actually actionable alerts like 'Filesystem is at critical level and could fill up any minute' Plus, if disk is rather small and filling up fast, it may fill up to 100% faster than 30minutes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I agree with your reasoning, but you can use the same argument to set it to 5m or an hour… the idea here is to find a good middle ground between noisy alerts and alerting too late. I don't know if the original value was based on a lot of research, and I was hoping you could present numbers like e.g. "With 15m in our prod setup, we got just 5% more false positives, but 40% less false negatives".

@@ -71,13 +71,13 @@
node_filesystem_readonly{%(nodeExporterSelector)s,%(fsSelector)s,%(fsMountpointSelector)s} == 0
)
||| % $._config,
'for': '30m',
'for': '15m',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

@@ -129,13 +129,13 @@
node_filesystem_readonly{%(nodeExporterSelector)s,%(fsSelector)s,%(fsMountpointSelector)s} == 0
)
||| % $._config,
'for': '1h',
'for': '15m',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. Curious why we had this different from the file space.

Copy link
Contributor Author

@v-zhuravlev v-zhuravlev Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the idea of 'for' is to be sure that problem is permanent and won't resolve itself. At the same time, if it is a real actionable problem, 1h could be too long period to wait. Plus, if disk is rather small and filling up fast, it may fill up to 100% faster than 1h.

docs/node-mixin/alerts/alerts.libsonnet Show resolved Hide resolved
@@ -309,6 +309,102 @@
description: 'File descriptors limit at {{ $labels.instance }} is currently at {{ printf "%.2f" $value }}%.',
},
},
{
alert: 'NodeCPUHighUsage',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, CPU usage is good. :) I mean, this would be a case for the "info" level alerts that I like to promote, but I don't think we have them here in the mixin.

(Info level alerts notify nobody, but you could look at the alerts page while troubleshooting. They point to things that are not problems per se and might be OK, but which you might be interested while there is an actual incident happening.)

},
},
{
alert: 'NodeSystemSaturation',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally remember scenarios where the load metric was high in legitimate use cases without being a problem, but that's long ago, and a lot has changed in the kernel since then. I heard opinions that load average is now somewhat useful as a metric, others state the opposite, and I don't feel qualified to make the call.

docs/node-mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved
docs/node-mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved
@v-zhuravlev
Copy link
Contributor Author

Ok, I made all new alerts thresholds configurable, please let me know what you think.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this looks good to me. As said, I don't feel strongly about the for timing. From my point of view, I cannot say if the one is better than the other (but I do like that this PR makes them consistent – that one higher for duration for FDs was weird).

I also cannot really comment on the alert based on load. @discordianfish and @SuperQ I leave it to you to make the call.

Everything else is LGTM from my side.

@discordianfish
Copy link
Member

Beside this one comment, LGTM. I'm also fine with the 'for' changes. 1h/30m feel unusual long. Let's go with the proposed 'for' times for now, we can make it configurable later.

@discordianfish
Copy link
Member

Requires rebase

@v-zhuravlev
Copy link
Contributor Author

Rebased

@SuperQ
Copy link
Member

SuperQ commented May 24, 2023

Sorry, this needs another rebase in order to pick up some unrelated bug fixes.

@v-zhuravlev
Copy link
Contributor Author

np, rebased just now

v-zhuravlev and others added 19 commits June 29, 2023 23:26
This helps to identify alerting filesystem.

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
30m is too long and there is a risk of running out of disk space/inodes completely if something is filling up disk very fast (like log file).

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
@discordianfish
Copy link
Member

@v-zhuravlev Let me know when this is ready to get reviewed

@v-zhuravlev
Copy link
Contributor Author

v-zhuravlev commented Jul 1, 2023 via email

Copy link
Member

@discordianfish discordianfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM!

Copy link
Member

@discordianfish discordianfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM!

@SuperQ SuperQ merged commit ed57c15 into prometheus:master Jul 4, 2023
v-zhuravlev added a commit to grafana/node_exporter that referenced this pull request Jul 15, 2023
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants