Skip to content

CP-8403 Adding Telegraf-based metric collection. #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 28, 2022

Conversation

scottmdlpx
Copy link
Contributor

@scottmdlpx scottmdlpx commented Apr 15, 2022

Initial addition of configuration and control files to enable performance metric collection using the Telegraf agent.
See also IDEA-2835 : Improving Support Bundle Performance Metrics

Includes:

  • Service definition and startup script for "delphix-telegraf"
  • Modified version of "estat" adding JSON output via a "-j" option
  • A "perf_playbook" wrapper script to enable/disable enhanced collection
  • Configuration file sections (combined on startup)
  • Simple wrappers to facilitate parsing of "nfs_threads", "zpool iostat -o",
    and "zcache stats -a" outputs

The service starts with a "base" set of metrics, but will include Object Storage
metrics when it is detected, and will include Performance Playbook commands
if that has been enabled (manually). The config is reassembled each startup.

File paths intended:

/opt/delphix/server/bin/delphix-telegraf-service
/lib/systemd/system/delphix-telegraf.service
/usr/bin/estat
/etc/telegraf/nfs-threads.sh
/opt/delphix/server/bin/perf_playbook
/etc/telegraf/telegraf.base
/etc/telegraf/telegraf.inputs.dose
/etc/telegraf/telegraf.inputs.playbook
/etc/telegraf/zcache-stats.sh
/etc/telegraf/zpool-iostat-o.sh

This configuration records 3 output files (rotated on size) for main metrics,
aggregate statistics (min,max,mean,stddev) and Playbook outputs to enable
independent retention periods.

Copy link
Contributor

@jwk404 jwk404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy to see this. Thanks for adding it.

@scottmdlpx scottmdlpx requested a review from jwk404 April 16, 2022 13:03
@scottmdlpx scottmdlpx requested a review from jwk404 June 13, 2022 15:06
@scottmdlpx scottmdlpx changed the title Initial addition of configuration and control files to enable Adding Telegraf-based metric collection. Jun 21, 2022
@scottmdlpx scottmdlpx changed the title Adding Telegraf-based metric collection. CP-8403 Adding Telegraf-based metric collection. Jun 23, 2022
@scottmdlpx
Copy link
Contributor Author

ab-pre-push Success:
http://selfservice.jenkins.delphix.com/job/appliance-build-orchestrator-pre-push/2284/

Cloned a VM from the build and:

$ sudo systemctl status delphix-telegraf
● delphix-telegraf.service - Delphix Telegraf Metric Collection Agent
     Loaded: loaded (/lib/systemd/system/delphix-telegraf.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-06-27 08:18:48 UTC; 4h 13min ago
       Docs: https://github.com/influxdata/telegraf
   Main PID: 4201 (delphix-telegra)
      Tasks: 12 (limit: 8751)
     Memory: 91.5M
     CGroup: /system.slice/delphix-telegraf.service
             ├─4201 /bin/bash /usr/bin/delphix-telegraf-service
             └─4472 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf

Output files are being logged under /var/log/telegraf:

$ pwd
/var/log/telegraf
$ ls
metric_aggregates.json                        metrics_estat.json
metrics.json                                  metrics_zfs.json

I also pulled a Support Bundle and verified these are included.

Enabling the performance playbook scripts:

$ sudo perf_playbook enable
Mon 27 Jun 2022 12:32:56 PM UTC
Enabling Performance Playbook Metrics
delphix@ip-10-110-236-17:~$ sudo systemctl status delphix-telegraf
● delphix-telegraf.service - Delphix Telegraf Metric Collection Agent
     Loaded: loaded (/lib/systemd/system/delphix-telegraf.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-06-27 12:32:56 UTC; 3s ago
       Docs: https://github.com/influxdata/telegraf
   Main PID: 40001 (delphix-telegra)
      Tasks: 29 (limit: 8751)
     Memory: 282.2M
     CGroup: /system.slice/delphix-telegraf.service
             ├─40001 /bin/bash /usr/bin/delphix-telegraf-service
             ├─40117 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf
             ├─40124 python3 /usr/bin/estat nfs -jm 10
             ├─40125 python3 /usr/bin/estat iscsi -jm 10
             ├─40126 python3 /usr/bin/estat zpl -jm 10
             ├─40127 python3 /usr/bin/estat backend-io -jm 10
             ├─40130 python3 /usr/bin/estat zvol -jm 10
             ├─40131 python3 /usr/bin/estat zio -jm 10
             ├─40135 /bin/sh /etc/telegraf/nfs-threads.sh
             ├─40136 python3 /usr/bin/nfs_threads
             └─40137 grep -E --line-buffered -v thr

and disabling them again:

$ sudo perf_playbook disable
Mon 27 Jun 2022 12:33:09 PM UTC
Disabling Performance Playbook Metrics
$ sudo systemctl status delphix-telegraf
● delphix-telegraf.service - Delphix Telegraf Metric Collection Agent
     Loaded: loaded (/lib/systemd/system/delphix-telegraf.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-06-27 12:33:09 UTC; 3s ago
       Docs: https://github.com/influxdata/telegraf
   Main PID: 40215 (delphix-telegra)
      Tasks: 9 (limit: 8751)
     Memory: 24.0M
     CGroup: /system.slice/delphix-telegraf.service
             ├─40215 /bin/bash /usr/bin/delphix-telegraf-service
             └─40323 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf

I also tested streaming to InfluxDB directly by uncommenting the relevant config and this works.

Initial addition of configuration and control files to enable
performance metric collection using the Telegraf agent.
See also IDEA-2835 : Improving Support Bundle Performance Metrics

Includes:
- Service definition and startup script for "delphix-telegraf"
- Modified version of "estat" adding JSON output via a "-j" option
- A "perf_playbook" wrapper script to enable/disable enhanced collection
- Configuration file sections (combined on startup)
- Simple wrappers to facilitate parsing of "nfs_threads", "zpool iostat -o",
and "zcache stats -a" outputs

The service starts with a "base" set of metrics, but will include Object Storage
metrics when it is detected, and will include Performance Playbook commands
if that has been enabled (manually). The config is reassembled each startup.

File paths intended:

/opt/delphix/server/bin/delphix-telegraf-service
/lib/systemd/system/delphix-telegraf.service
/etc/telegraf/nfs-threads.sh
/opt/delphix/server/bin/perf_playbook
/etc/telegraf/telegraf.base
/etc/telegraf/telegraf.inputs.dose
/etc/telegraf/telegraf.inputs.playbook
/etc/telegraf/zcache-stats.sh
/etc/telegraf/zpool-iostat-o.sh

This configuration records 4 output files (rotated on size) for main metrics,
aggregate statistics (min,max,mean,stddev) and Playbook outputs to enable
independent retention periods.
@scottmdlpx scottmdlpx merged commit 92b0686 into delphix:master Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants