Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: node exporter mixin large update #2665

Closed
wants to merge 15 commits into from

Conversation

v-zhuravlev
Copy link
Contributor

@v-zhuravlev v-zhuravlev commented Apr 21, 2023

  1. This update introduces three tier view of linux nodes:
  • TOP: Fleet view: see group of your linux instances at once
  • Overview of the specific node: see specific node at a glance
  • Drill down: Set of dashboards for deep analysis with advanced metrics

Links and data links are provided for better navigation between views.

Checklist:

  • Convert graph panels to timeseries panels with default style (opacity, tooltip, legend position, etc).
  • Add info row to overview dashboard
    image
  • Add linux network dashboard
    image
    • Add interfaces overview panel
    • Add oper status timeline
    • Add Sockstat/Netstat metrics to network dashboard
  • Add Advanced CPU and system dash
    image
  • Add Advanced Memory dash
    image
  • Add Fleet overview dash
    image
    • Add overview fleet table
    • Add common CPU graph (top25)
    • Add common memory graph (top25)
    • Add common network graph (top25)
    • Add common Disk / FS graph (top25)
  • Add annotations
    • Reboot detected
    • Kernel change detected
    • OOM kill detected
  • Add job/cluster variable support for additional grouping

Various dashboards improvements:

  • Change 'logical core' line style to dotted
  • Update Disk I/O time metric to dots
  • Move dashboards parameters to _config, such as tags, timezone
  • Convert gauges to stat for memory usage panel
  • Add CPU usage stat panel
  • Add dashboards and data links to navigate between dashboards

@discordianfish
Copy link
Member

Very nice! This needs a thorough review though. For this, can you first clean up the commit history and add DCO sign-off, then ping us when you're ready to get this reviewed?

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a DCO sign-off. You can use git commit -s --amend to add it.

@v-zhuravlev
Copy link
Contributor Author

v-zhuravlev commented May 24, 2023

yes, will clean this up after alerts #2644 is merged.

rgeyer and others added 7 commits July 15, 2023 10:51
Add UIDs to all dashboards.
Add units and descriptions to all panels which were missing them.
Modify alerts descriptions and summaries as needed for linting.

Signed-off-by: Ryan J. Geyer <me@ryangeyer.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
* Add mountpoint to NodeFilesystem alerts
This helps to identify alerting filesystem.

* Decrease NodeFilesystem pending time to 15m
30m is too long and there is a risk of running out of disk space/inodes completely if something is filling up disk very fast (like log file).

* Add CPU and memory alerts
* Add failed systemd service alert
* Decrease NodeNetwork*Errs pending period
* Set 'at' everywhere as preposition for instance
* Add NodeDiskIOSaturation alert
* Add %(nodeExporterSelector)s to Network and conntrack alerts
* Add diskDevice selector
* Fix NodeMemoryHighUtilization alert
* Add NodeSystemSaturation and NodeMemoryMajorPagesFaults
* Decrease NodeSystemdServiceFailed severity to warning
* Extend alert description
* Add comma after 'mounted on'
* Add thresholds for memory alerts
* Add thresholds for memory, disk and system alerts
* Set severity to NodeCPUHighUsage to info
* Convert graph panels to timeseries panel
...With default style (opacity, tooltip etc).
Also:
Change 'logical core' line style to dotted
Update Disk I/O time metric to dots
* Move dashboard paramaters to config
* Add overview row
* Add Cpu Usage stat panel
* Add network dash
- Add interfaces overview panel
- Add oper status timeline
- Add common lib with reused elements (templates, queries)
- Add common panels with shared style to be used accross this mixin
* Remove external panels lib
* Add fleet dashboard
* Update fleet dash
* Add CPU and memory to fleet
* Add common cpu/memory/disk/network panels on fleet
* add network errors panel as points
* Fix alerts column in fleet table
* Add support for multiple group and instance labels
* Add sockstat to network dashboard
* Add netstat to network dashboard
* Change span to gridPod. Make overview row smaller.
- gridPos supports tiny panels height.
* add reboot annotation
* Add system dashboard
* add filesystem row
* Add disk and fs dashboard
* Add memory dashboard
* Add memory generic counters to memory dashboard
* Update common lib
* Update OOM killer panel
* Add common annotations: kernelChange, OOMkill
* Add mountpoint to NodeFilesystem alerts
- This helps to identify alerting filesystem.
* Add CPU and memory alerts
* Add failed systemd service alert
* Decrease NodeNetwork*Errs pending period
* Set 'at' everywhere as preposition for instance
* Add NodeDiskIOSaturation alert
* Add %(nodeExporterSelector)s to Network and conntrack alerts
* Add diskDevice selector
* Fix NodeMemoryHighUtilization alert
* Add NodeSystemSaturation and NodeMemoryMajorPagesFaults
* Decrease NodeSystemdServiceFailed severity to warning
* Remove unused import
* Add ability to set custom dashboardUID
* Add mountpoint to NodeFilesystem alerts
* Add failed systemd service alert
* Remove systemd panel
- systemd collector is disabled by default
* Add some lint exclusions.
- Add UIDs to all dashboards.
- Add units and descriptions to all panels which were missing them.
- Modify alerts descriptions and summaries as needed for linting.
* Add multi-cluster dashboard lint exclusions
* Extend alert description
* Add thresholds for memory, disk and system alerts
* Set severity to NodeCPUHighUsage to info
* Fix broken diskSpaceUsage link
* Fix cpuIdle panel units
* Change cpuUsage to use $__rate_interval
* Fix cpu usage (replace with nodeQuerySelector)
* Fix units (seconds->s)
* Fix iops units
* Add %(nodeQuerySelector)s to alerts queries
* Add support for multi in job
* Fix Pagesout metric
* Add total and available memory metrics
* Update context switches description
* Add network descriptions
* Change pipe to | from / in AxisLabel
* Update network descriptions
* Add timezone metric

---------

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Ryan J. Geyer <me@ryangeyer.com>
Instead, one can redefine grafanaDashboardIDs in _config

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
to stay under mimir's default limit of 20 alerts per group.

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
This fixes an issue with selecting a node, given a specific datasource, and the link not using said datasource thus showing no data

Signed-off-by: Emily Ahlstrand Rager <emily.rager@grafana.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
@v-zhuravlev
Copy link
Contributor Author

v-zhuravlev commented Jul 15, 2023

@discordianfish , @SuperQ , Hi!
Rebased and DCO signed.

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
@v-zhuravlev v-zhuravlev requested a review from SuperQ July 24, 2023 19:57
* Add node-observ-lib

* Remove trends support (not in 10.0 schema)

* Make filteringSelector for logs dashboard configurable

* Temp change dependency (until PR is merged for commonlib)

* Refactor config

* Update jsonnetfile.json

* Update README

* Add separate loki example

* Add sep file example
* Add gitignore to node-observ-lib

* Fix typo in node default filteringSelector

* Prep alert group names for macos

* Add macos-observ-lib

* Change overview dashboard:
show networkErrorsAndDroppedPerSec instead of networkErrorPerSec for Linux/MacOS

* Add more alerts

* Move alerts to sep file

* Breaking: Update layout

To allow to locally import linux from macos

* Bring back NodeFilesystemAlmostOutOfFiles alert

* Show only errors when they occur

* Only show network interfaces that had traffic change at least once during selected dashboard interval
@v-zhuravlev
Copy link
Contributor Author

v-zhuravlev commented Nov 29, 2023

Closing in favor of #2861
It is much cleaner now thanks to grafonnet. (IMHO)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants