From 01dd9590d825fb85ff9f432a3c50b3d28f07692e Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Tue, 21 Jul 2020 15:21:16 +0200 Subject: [PATCH 01/82] README: add info for the new backfill metrics --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index 1d1938b..2036339 100644 --- a/README.md +++ b/README.md @@ -59,6 +59,9 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/ * **(Backfill) Last cycle**: Time in microseconds of last backfilling cycle. * **(Backfill) Mean cycle**: Mean of backfilling scheduling cycles in microseconds since last reset. * **(Backfill) Depth mean**: Mean of processed jobs during backfilling scheduling cycles since last reset. +* **(Backfill) Total Backfilled Jobs** (since last slurm start): number of jobs started thanks to backfilling since last Slurm start. +* **(Backfill) Total Backfilled Jobs** (since last stats cycle start): number of jobs started thanks to backfilling since last time stats where reset. +* **(Backfill) Total backfilled heterogeneous Job components**: number of heterogeneous job components started thanks to backfilling since last Slurm start. [Information extracted from the SLURM **sdiag** command](https://slurm.schedmd.com/sdiag.html) From 905a083a3511341da4b4d149fb49f1432c84f851 Mon Sep 17 00:00:00 2001 From: jamesbeedy Date: Sun, 16 Aug 2020 18:39:23 +0000 Subject: [PATCH 02/82] add snap packaging and docs --- .gitignore | 1 + packaging/snap/README.md | 261 +++++++++++++++++++++++++++++++++++++++ snap/snapcraft.yaml | 27 ++++ 3 files changed, 289 insertions(+) create mode 100644 packaging/snap/README.md create mode 100644 snap/snapcraft.yaml diff --git a/.gitignore b/.gitignore index e660fd9..9a5346d 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,2 @@ bin/ +*.snap diff --git a/packaging/snap/README.md b/packaging/snap/README.md new file mode 100644 index 0000000..7d25d95 --- /dev/null +++ b/packaging/snap/README.md @@ -0,0 +1,261 @@ +# Building the prometheus-slurm-exporter snap + +### Prereqs +* [snapcraft](https://snapcraft.io) + ```bash + sudo snap install snapcraft --classic + ``` +* [lxd](https://linuxcontainers.org/) + ```bash + sudo snap install lxd + ``` + +### Build +From the root of this project: +```bash +snapcraft --use-lxd +``` + +### Install locally built snap +```bash +sudo snap install prometheus-slurm-exporter_`git describe --tags`_amd64.snap +``` + +### Verify install +Curl the metrics endpoint to verify things are working. +```bash +$ curl 127.0.0.1:8080/metrics +# HELP go_gc_duration_seconds A summary of the GC invocation durations. +# TYPE go_gc_duration_seconds summary +go_gc_duration_seconds{quantile="0"} 0 +go_gc_duration_seconds{quantile="0.25"} 0 +go_gc_duration_seconds{quantile="0.5"} 0 +go_gc_duration_seconds{quantile="0.75"} 0 +go_gc_duration_seconds{quantile="1"} 0 +go_gc_duration_seconds_sum 0 +go_gc_duration_seconds_count 0 +# HELP go_goroutines Number of goroutines that currently exist. +# TYPE go_goroutines gauge +go_goroutines 11 +# HELP go_info Information about the Go environment. +# TYPE go_info gauge +go_info{version="go1.14.7"} 1 +# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use. +# TYPE go_memstats_alloc_bytes gauge +go_memstats_alloc_bytes 2.639e+06 +# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed. +# TYPE go_memstats_alloc_bytes_total counter +go_memstats_alloc_bytes_total 2.639e+06 +# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. +# TYPE go_memstats_buck_hash_sys_bytes gauge +go_memstats_buck_hash_sys_bytes 3698 +# HELP go_memstats_frees_total Total number of frees. +# TYPE go_memstats_frees_total counter +go_memstats_frees_total 1668 +# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started. +# TYPE go_memstats_gc_cpu_fraction gauge +go_memstats_gc_cpu_fraction 0 +# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. +# TYPE go_memstats_gc_sys_bytes gauge +go_memstats_gc_sys_bytes 3.436808e+06 +# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use. +# TYPE go_memstats_heap_alloc_bytes gauge +go_memstats_heap_alloc_bytes 2.639e+06 +# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. +# TYPE go_memstats_heap_idle_bytes gauge +go_memstats_heap_idle_bytes 6.2619648e+07 +# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. +# TYPE go_memstats_heap_inuse_bytes gauge +go_memstats_heap_inuse_bytes 3.899392e+06 +# HELP go_memstats_heap_objects Number of allocated objects. +# TYPE go_memstats_heap_objects gauge +go_memstats_heap_objects 17262 +# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. +# TYPE go_memstats_heap_released_bytes gauge +go_memstats_heap_released_bytes 6.258688e+07 +# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. +# TYPE go_memstats_heap_sys_bytes gauge +go_memstats_heap_sys_bytes 6.651904e+07 +# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection. +# TYPE go_memstats_last_gc_time_seconds gauge +go_memstats_last_gc_time_seconds 0 +# HELP go_memstats_lookups_total Total number of pointer lookups. +# TYPE go_memstats_lookups_total counter +go_memstats_lookups_total 0 +# HELP go_memstats_mallocs_total Total number of mallocs. +# TYPE go_memstats_mallocs_total counter +go_memstats_mallocs_total 18930 +# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. +# TYPE go_memstats_mcache_inuse_bytes gauge +go_memstats_mcache_inuse_bytes 13888 +# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. +# TYPE go_memstats_mcache_sys_bytes gauge +go_memstats_mcache_sys_bytes 16384 +# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. +# TYPE go_memstats_mspan_inuse_bytes gauge +go_memstats_mspan_inuse_bytes 89624 +# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. +# TYPE go_memstats_mspan_sys_bytes gauge +go_memstats_mspan_sys_bytes 98304 +# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. +# TYPE go_memstats_next_gc_bytes gauge +go_memstats_next_gc_bytes 4.473924e+06 +# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. +# TYPE go_memstats_other_sys_bytes gauge +go_memstats_other_sys_bytes 1.771662e+06 +# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator. +# TYPE go_memstats_stack_inuse_bytes gauge +go_memstats_stack_inuse_bytes 589824 +# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. +# TYPE go_memstats_stack_sys_bytes gauge +go_memstats_stack_sys_bytes 589824 +# HELP go_memstats_sys_bytes Number of bytes obtained from system. +# TYPE go_memstats_sys_bytes gauge +go_memstats_sys_bytes 7.243572e+07 +# HELP go_threads Number of OS threads created. +# TYPE go_threads gauge +go_threads 10 +# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. +# TYPE process_cpu_seconds_total counter +process_cpu_seconds_total 0.22 +# HELP process_max_fds Maximum number of open file descriptors. +# TYPE process_max_fds gauge +process_max_fds 1024 +# HELP process_open_fds Number of open file descriptors. +# TYPE process_open_fds gauge +process_open_fds 16 +# HELP process_resident_memory_bytes Resident memory size in bytes. +# TYPE process_resident_memory_bytes gauge +process_resident_memory_bytes 1.179648e+07 +# HELP process_start_time_seconds Start time of the process since unix epoch in seconds. +# TYPE process_start_time_seconds gauge +process_start_time_seconds 1.59760273488e+09 +# HELP process_virtual_memory_bytes Virtual memory size in bytes. +# TYPE process_virtual_memory_bytes gauge +process_virtual_memory_bytes 1.33564416e+09 +# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes. +# TYPE process_virtual_memory_max_bytes gauge +process_virtual_memory_max_bytes -1 +# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served. +# TYPE promhttp_metric_handler_requests_in_flight gauge +promhttp_metric_handler_requests_in_flight 1 +# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code. +# TYPE promhttp_metric_handler_requests_total counter +promhttp_metric_handler_requests_total{code="200"} 1 +promhttp_metric_handler_requests_total{code="500"} 0 +promhttp_metric_handler_requests_total{code="503"} 0 +# HELP slurm_cpus_alloc Allocated CPUs +# TYPE slurm_cpus_alloc gauge +slurm_cpus_alloc 0 +# HELP slurm_cpus_idle Idle CPUs +# TYPE slurm_cpus_idle gauge +slurm_cpus_idle 8 +# HELP slurm_cpus_other Mix CPUs +# TYPE slurm_cpus_other gauge +slurm_cpus_other 0 +# HELP slurm_cpus_total Total CPUs +# TYPE slurm_cpus_total gauge +slurm_cpus_total 8 +# HELP slurm_nodes_alloc Allocated nodes +# TYPE slurm_nodes_alloc gauge +slurm_nodes_alloc 0 +# HELP slurm_nodes_comp Completing nodes +# TYPE slurm_nodes_comp gauge +slurm_nodes_comp 0 +# HELP slurm_nodes_down Down nodes +# TYPE slurm_nodes_down gauge +slurm_nodes_down 0 +# HELP slurm_nodes_drain Drain nodes +# TYPE slurm_nodes_drain gauge +slurm_nodes_drain 0 +# HELP slurm_nodes_err Error nodes +# TYPE slurm_nodes_err gauge +slurm_nodes_err 0 +# HELP slurm_nodes_fail Fail nodes +# TYPE slurm_nodes_fail gauge +slurm_nodes_fail 0 +# HELP slurm_nodes_idle Idle nodes +# TYPE slurm_nodes_idle gauge +slurm_nodes_idle 1 +# HELP slurm_nodes_maint Maint nodes +# TYPE slurm_nodes_maint gauge +slurm_nodes_maint 0 +# HELP slurm_nodes_mix Mix nodes +# TYPE slurm_nodes_mix gauge +slurm_nodes_mix 0 +# HELP slurm_nodes_resv Reserved nodes +# TYPE slurm_nodes_resv gauge +slurm_nodes_resv 0 +# HELP slurm_queue_cancelled Cancelled jobs in the cluster +# TYPE slurm_queue_cancelled gauge +slurm_queue_cancelled 0 +# HELP slurm_queue_completed Completed jobs in the cluster +# TYPE slurm_queue_completed gauge +slurm_queue_completed 0 +# HELP slurm_queue_completing Completing jobs in the cluster +# TYPE slurm_queue_completing gauge +slurm_queue_completing 0 +# HELP slurm_queue_configuring Configuring jobs in the cluster +# TYPE slurm_queue_configuring gauge +slurm_queue_configuring 0 +# HELP slurm_queue_failed Number of failed jobs +# TYPE slurm_queue_failed gauge +slurm_queue_failed 0 +# HELP slurm_queue_node_fail Number of jobs stopped due to node fail +# TYPE slurm_queue_node_fail gauge +slurm_queue_node_fail 0 +# HELP slurm_queue_pending Pending jobs in queue +# TYPE slurm_queue_pending gauge +slurm_queue_pending 0 +# HELP slurm_queue_pending_dependency Pending jobs because of dependency in queue +# TYPE slurm_queue_pending_dependency gauge +slurm_queue_pending_dependency 0 +# HELP slurm_queue_preempted Number of preempted jobs +# TYPE slurm_queue_preempted gauge +slurm_queue_preempted 0 +# HELP slurm_queue_running Running jobs in the cluster +# TYPE slurm_queue_running gauge +slurm_queue_running 0 +# HELP slurm_queue_suspended Suspended jobs in the cluster +# TYPE slurm_queue_suspended gauge +slurm_queue_suspended 0 +# HELP slurm_queue_timeout Jobs stopped by timeout +# TYPE slurm_queue_timeout gauge +slurm_queue_timeout 0 +# HELP slurm_scheduler_backfill_depth_mean Information provided by the Slurm sdiag command, scheduler backfill mean depth +# TYPE slurm_scheduler_backfill_depth_mean gauge +slurm_scheduler_backfill_depth_mean 0 +# HELP slurm_scheduler_backfill_last_cycle Information provided by the Slurm sdiag command, scheduler backfill last cycle time in (microseconds) +# TYPE slurm_scheduler_backfill_last_cycle gauge +slurm_scheduler_backfill_last_cycle 0 +# HELP slurm_scheduler_backfill_mean_cycle Information provided by the Slurm sdiag command, scheduler backfill mean cycle time in (microseconds) +# TYPE slurm_scheduler_backfill_mean_cycle gauge +slurm_scheduler_backfill_mean_cycle 481 +# HELP slurm_scheduler_backfilled_heterogeneous_total Information provided by the Slurm sdiag command, number of heterogeneous job components started thanks to backfilling since last Slurm start +# TYPE slurm_scheduler_backfilled_heterogeneous_total gauge +slurm_scheduler_backfilled_heterogeneous_total 0 +# HELP slurm_scheduler_backfilled_jobs_since_cycle_total Information provided by the Slurm sdiag command, number of jobs started thanks to backfilling since last time stats where reset +# TYPE slurm_scheduler_backfilled_jobs_since_cycle_total gauge +slurm_scheduler_backfilled_jobs_since_cycle_total 0 +# HELP slurm_scheduler_backfilled_jobs_since_start_total Information provided by the Slurm sdiag command, number of jobs started thanks to backfilling since last slurm start +# TYPE slurm_scheduler_backfilled_jobs_since_start_total gauge +slurm_scheduler_backfilled_jobs_since_start_total 0 +# HELP slurm_scheduler_cycle_per_minute Information provided by the Slurm sdiag command, number scheduler cycles per minute +# TYPE slurm_scheduler_cycle_per_minute gauge +slurm_scheduler_cycle_per_minute 1 +# HELP slurm_scheduler_dbd_queue_size Information provided by the Slurm sdiag command, length of the DBD agent queue +# TYPE slurm_scheduler_dbd_queue_size gauge +slurm_scheduler_dbd_queue_size 0 +# HELP slurm_scheduler_last_cycle Information provided by the Slurm sdiag command, scheduler last cycle time in (microseconds) +# TYPE slurm_scheduler_last_cycle gauge +slurm_scheduler_last_cycle 40 +# HELP slurm_scheduler_mean_cycle Information provided by the Slurm sdiag command, scheduler mean cycle time in (microseconds) +# TYPE slurm_scheduler_mean_cycle gauge +slurm_scheduler_mean_cycle 481 +# HELP slurm_scheduler_queue_size Information provided by the Slurm sdiag command, length of the scheduler queue +# TYPE slurm_scheduler_queue_size gauge +slurm_scheduler_queue_size 0 +# HELP slurm_scheduler_threads Information provided by the Slurm sdiag command, number of scheduler threads +# TYPE slurm_scheduler_threads gauge +slurm_scheduler_threads 4 +``` diff --git a/snap/snapcraft.yaml b/snap/snapcraft.yaml new file mode 100644 index 0000000..c9a5112 --- /dev/null +++ b/snap/snapcraft.yaml @@ -0,0 +1,27 @@ +name: prometheus-slurm-exporter +summary: Prometheus Slurm Exporter +description: | + Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system. + +adopt-info: prometheus-slurm-exporter + +grade: stable +confinement: classic + +base: core20 + +apps: + prometheus-slurm-exporter: + daemon: simple + environment: + PATH: $PATH:/snap/bin + command: bin/prometheus-slurm-exporter + +parts: + prometheus-slurm-exporter: + source: https://github.com/vpenso/prometheus-slurm-exporter.git + plugin: go + go-channel: 1.14/stable + override-build: | + snapcraftctl build + snapcraftctl set-version `git describe --tags` From 9e20cbb27d57b8abd386f8aa048d2b3d42cccd58 Mon Sep 17 00:00:00 2001 From: jamesbeedy Date: Sun, 16 Aug 2020 18:58:50 +0000 Subject: [PATCH 03/82] enhance readme --- packaging/snap/README.md | 229 ++++++--------------------------------- 1 file changed, 31 insertions(+), 198 deletions(-) diff --git a/packaging/snap/README.md b/packaging/snap/README.md index 7d25d95..db9a3e4 100644 --- a/packaging/snap/README.md +++ b/packaging/snap/README.md @@ -1,4 +1,7 @@ # Building the prometheus-slurm-exporter snap +Packaging and delivering the prometheus-slurm-exporter as a snap provides users of prometheus-slurm-exporter +a hardened, streamlined, and idempotent experience when consuming this software. See [snapcraft](https://snapcraft.io/) for more information on snaps. + ### Prereqs * [snapcraft](https://snapcraft.io) @@ -11,10 +14,15 @@ ``` ### Build -From the root of this project: +From the root of this project, build the snap: ```bash snapcraft --use-lxd ``` +Once the snap build has completed, list the current working directory to see the resultant snap artifact. +```bash +$ ls -la *.snap +-rw-r--r-- 1 bdx bdx 5562368 Aug 16 18:19 prometheus-slurm-exporter_0.11-1-g01dd959_amd64.snap +``` ### Install locally built snap ```bash @@ -22,138 +30,21 @@ sudo snap install prometheus-slurm-exporter_`git describe --tags`_amd64.snap ``` ### Verify install -Curl the metrics endpoint to verify things are working. +Use `ps` to verify the process is running. +```bash +$ ps aux | grep prometheus | head -1 +root 2271391 0.0 0.0 1453596 14012 ? SLsl 18:32 0:00 /snap/prometheus-slurm-exporter/x1/bin/prometheus-slurm-exporter +``` + +Use `netstat` to verify that the installed `prometheus-slurm-exporter` snap process is listening on port 8080. +```bash +$ sudo netstat -peanut | grep prometheus +tcp6 0 0 :::8080 :::* LISTEN 0 15042010 2271391/prometheus-slurm-exporter +``` + +Lastly, curl the metrics endpoint. ```bash $ curl 127.0.0.1:8080/metrics -# HELP go_gc_duration_seconds A summary of the GC invocation durations. -# TYPE go_gc_duration_seconds summary -go_gc_duration_seconds{quantile="0"} 0 -go_gc_duration_seconds{quantile="0.25"} 0 -go_gc_duration_seconds{quantile="0.5"} 0 -go_gc_duration_seconds{quantile="0.75"} 0 -go_gc_duration_seconds{quantile="1"} 0 -go_gc_duration_seconds_sum 0 -go_gc_duration_seconds_count 0 -# HELP go_goroutines Number of goroutines that currently exist. -# TYPE go_goroutines gauge -go_goroutines 11 -# HELP go_info Information about the Go environment. -# TYPE go_info gauge -go_info{version="go1.14.7"} 1 -# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use. -# TYPE go_memstats_alloc_bytes gauge -go_memstats_alloc_bytes 2.639e+06 -# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed. -# TYPE go_memstats_alloc_bytes_total counter -go_memstats_alloc_bytes_total 2.639e+06 -# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. -# TYPE go_memstats_buck_hash_sys_bytes gauge -go_memstats_buck_hash_sys_bytes 3698 -# HELP go_memstats_frees_total Total number of frees. -# TYPE go_memstats_frees_total counter -go_memstats_frees_total 1668 -# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started. -# TYPE go_memstats_gc_cpu_fraction gauge -go_memstats_gc_cpu_fraction 0 -# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. -# TYPE go_memstats_gc_sys_bytes gauge -go_memstats_gc_sys_bytes 3.436808e+06 -# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use. -# TYPE go_memstats_heap_alloc_bytes gauge -go_memstats_heap_alloc_bytes 2.639e+06 -# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. -# TYPE go_memstats_heap_idle_bytes gauge -go_memstats_heap_idle_bytes 6.2619648e+07 -# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. -# TYPE go_memstats_heap_inuse_bytes gauge -go_memstats_heap_inuse_bytes 3.899392e+06 -# HELP go_memstats_heap_objects Number of allocated objects. -# TYPE go_memstats_heap_objects gauge -go_memstats_heap_objects 17262 -# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. -# TYPE go_memstats_heap_released_bytes gauge -go_memstats_heap_released_bytes 6.258688e+07 -# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. -# TYPE go_memstats_heap_sys_bytes gauge -go_memstats_heap_sys_bytes 6.651904e+07 -# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection. -# TYPE go_memstats_last_gc_time_seconds gauge -go_memstats_last_gc_time_seconds 0 -# HELP go_memstats_lookups_total Total number of pointer lookups. -# TYPE go_memstats_lookups_total counter -go_memstats_lookups_total 0 -# HELP go_memstats_mallocs_total Total number of mallocs. -# TYPE go_memstats_mallocs_total counter -go_memstats_mallocs_total 18930 -# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. -# TYPE go_memstats_mcache_inuse_bytes gauge -go_memstats_mcache_inuse_bytes 13888 -# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. -# TYPE go_memstats_mcache_sys_bytes gauge -go_memstats_mcache_sys_bytes 16384 -# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. -# TYPE go_memstats_mspan_inuse_bytes gauge -go_memstats_mspan_inuse_bytes 89624 -# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. -# TYPE go_memstats_mspan_sys_bytes gauge -go_memstats_mspan_sys_bytes 98304 -# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. -# TYPE go_memstats_next_gc_bytes gauge -go_memstats_next_gc_bytes 4.473924e+06 -# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. -# TYPE go_memstats_other_sys_bytes gauge -go_memstats_other_sys_bytes 1.771662e+06 -# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator. -# TYPE go_memstats_stack_inuse_bytes gauge -go_memstats_stack_inuse_bytes 589824 -# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. -# TYPE go_memstats_stack_sys_bytes gauge -go_memstats_stack_sys_bytes 589824 -# HELP go_memstats_sys_bytes Number of bytes obtained from system. -# TYPE go_memstats_sys_bytes gauge -go_memstats_sys_bytes 7.243572e+07 -# HELP go_threads Number of OS threads created. -# TYPE go_threads gauge -go_threads 10 -# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. -# TYPE process_cpu_seconds_total counter -process_cpu_seconds_total 0.22 -# HELP process_max_fds Maximum number of open file descriptors. -# TYPE process_max_fds gauge -process_max_fds 1024 -# HELP process_open_fds Number of open file descriptors. -# TYPE process_open_fds gauge -process_open_fds 16 -# HELP process_resident_memory_bytes Resident memory size in bytes. -# TYPE process_resident_memory_bytes gauge -process_resident_memory_bytes 1.179648e+07 -# HELP process_start_time_seconds Start time of the process since unix epoch in seconds. -# TYPE process_start_time_seconds gauge -process_start_time_seconds 1.59760273488e+09 -# HELP process_virtual_memory_bytes Virtual memory size in bytes. -# TYPE process_virtual_memory_bytes gauge -process_virtual_memory_bytes 1.33564416e+09 -# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes. -# TYPE process_virtual_memory_max_bytes gauge -process_virtual_memory_max_bytes -1 -# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served. -# TYPE promhttp_metric_handler_requests_in_flight gauge -promhttp_metric_handler_requests_in_flight 1 -# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code. -# TYPE promhttp_metric_handler_requests_total counter -promhttp_metric_handler_requests_total{code="200"} 1 -promhttp_metric_handler_requests_total{code="500"} 0 -promhttp_metric_handler_requests_total{code="503"} 0 -# HELP slurm_cpus_alloc Allocated CPUs -# TYPE slurm_cpus_alloc gauge -slurm_cpus_alloc 0 -# HELP slurm_cpus_idle Idle CPUs -# TYPE slurm_cpus_idle gauge -slurm_cpus_idle 8 -# HELP slurm_cpus_other Mix CPUs -# TYPE slurm_cpus_other gauge -slurm_cpus_other 0 -# HELP slurm_cpus_total Total CPUs # TYPE slurm_cpus_total gauge slurm_cpus_total 8 # HELP slurm_nodes_alloc Allocated nodes @@ -177,67 +68,9 @@ slurm_nodes_fail 0 # HELP slurm_nodes_idle Idle nodes # TYPE slurm_nodes_idle gauge slurm_nodes_idle 1 -# HELP slurm_nodes_maint Maint nodes -# TYPE slurm_nodes_maint gauge -slurm_nodes_maint 0 -# HELP slurm_nodes_mix Mix nodes -# TYPE slurm_nodes_mix gauge -slurm_nodes_mix 0 -# HELP slurm_nodes_resv Reserved nodes -# TYPE slurm_nodes_resv gauge -slurm_nodes_resv 0 -# HELP slurm_queue_cancelled Cancelled jobs in the cluster -# TYPE slurm_queue_cancelled gauge -slurm_queue_cancelled 0 -# HELP slurm_queue_completed Completed jobs in the cluster -# TYPE slurm_queue_completed gauge -slurm_queue_completed 0 -# HELP slurm_queue_completing Completing jobs in the cluster -# TYPE slurm_queue_completing gauge -slurm_queue_completing 0 -# HELP slurm_queue_configuring Configuring jobs in the cluster -# TYPE slurm_queue_configuring gauge -slurm_queue_configuring 0 -# HELP slurm_queue_failed Number of failed jobs -# TYPE slurm_queue_failed gauge -slurm_queue_failed 0 -# HELP slurm_queue_node_fail Number of jobs stopped due to node fail -# TYPE slurm_queue_node_fail gauge -slurm_queue_node_fail 0 -# HELP slurm_queue_pending Pending jobs in queue -# TYPE slurm_queue_pending gauge -slurm_queue_pending 0 -# HELP slurm_queue_pending_dependency Pending jobs because of dependency in queue -# TYPE slurm_queue_pending_dependency gauge -slurm_queue_pending_dependency 0 -# HELP slurm_queue_preempted Number of preempted jobs -# TYPE slurm_queue_preempted gauge -slurm_queue_preempted 0 -# HELP slurm_queue_running Running jobs in the cluster -# TYPE slurm_queue_running gauge -slurm_queue_running 0 -# HELP slurm_queue_suspended Suspended jobs in the cluster -# TYPE slurm_queue_suspended gauge -slurm_queue_suspended 0 -# HELP slurm_queue_timeout Jobs stopped by timeout -# TYPE slurm_queue_timeout gauge -slurm_queue_timeout 0 -# HELP slurm_scheduler_backfill_depth_mean Information provided by the Slurm sdiag command, scheduler backfill mean depth -# TYPE slurm_scheduler_backfill_depth_mean gauge -slurm_scheduler_backfill_depth_mean 0 -# HELP slurm_scheduler_backfill_last_cycle Information provided by the Slurm sdiag command, scheduler backfill last cycle time in (microseconds) -# TYPE slurm_scheduler_backfill_last_cycle gauge -slurm_scheduler_backfill_last_cycle 0 -# HELP slurm_scheduler_backfill_mean_cycle Information provided by the Slurm sdiag command, scheduler backfill mean cycle time in (microseconds) -# TYPE slurm_scheduler_backfill_mean_cycle gauge -slurm_scheduler_backfill_mean_cycle 481 -# HELP slurm_scheduler_backfilled_heterogeneous_total Information provided by the Slurm sdiag command, number of heterogeneous job components started thanks to backfilling since last Slurm start -# TYPE slurm_scheduler_backfilled_heterogeneous_total gauge -slurm_scheduler_backfilled_heterogeneous_total 0 -# HELP slurm_scheduler_backfilled_jobs_since_cycle_total Information provided by the Slurm sdiag command, number of jobs started thanks to backfilling since last time stats where reset -# TYPE slurm_scheduler_backfilled_jobs_since_cycle_total gauge -slurm_scheduler_backfilled_jobs_since_cycle_total 0 -# HELP slurm_scheduler_backfilled_jobs_since_start_total Information provided by the Slurm sdiag command, number of jobs started thanks to backfilling since last slurm start + +... + # TYPE slurm_scheduler_backfilled_jobs_since_start_total gauge slurm_scheduler_backfilled_jobs_since_start_total 0 # HELP slurm_scheduler_cycle_per_minute Information provided by the Slurm sdiag command, number scheduler cycles per minute @@ -252,10 +85,10 @@ slurm_scheduler_last_cycle 40 # HELP slurm_scheduler_mean_cycle Information provided by the Slurm sdiag command, scheduler mean cycle time in (microseconds) # TYPE slurm_scheduler_mean_cycle gauge slurm_scheduler_mean_cycle 481 -# HELP slurm_scheduler_queue_size Information provided by the Slurm sdiag command, length of the scheduler queue -# TYPE slurm_scheduler_queue_size gauge -slurm_scheduler_queue_size 0 -# HELP slurm_scheduler_threads Information provided by the Slurm sdiag command, number of scheduler threads -# TYPE slurm_scheduler_threads gauge -slurm_scheduler_threads 4 +... +``` + +To uninstall the prometheus-slurm-exporter snap: +```bash +sudo snap remove prometheus-slurm-exporter ``` From 2654586d765b48c5520147e790fcbbf8a1a14729 Mon Sep 17 00:00:00 2001 From: jamesbeedy Date: Sun, 16 Aug 2020 19:09:30 +0000 Subject: [PATCH 04/82] enhance readme --- packaging/snap/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/packaging/snap/README.md b/packaging/snap/README.md index db9a3e4..5ebef11 100644 --- a/packaging/snap/README.md +++ b/packaging/snap/README.md @@ -26,8 +26,10 @@ $ ls -la *.snap ### Install locally built snap ```bash -sudo snap install prometheus-slurm-exporter_`git describe --tags`_amd64.snap +sudo snap install prometheus-slurm-exporter_`git describe --tags`_amd64.snap --classic --dangerous ``` +* `--classic` - this snap need runs in classic mode to allow it to find the slurm commands in the system. +* `--dangerous` - because we are installing this snap from a local resource and sha can't be verified by the snapstore. ### Verify install Use `ps` to verify the process is running. From 78a2bb25b5ec962a150057ce6b1e6d776dc83c0d Mon Sep 17 00:00:00 2001 From: jamesbeedy Date: Mon, 17 Aug 2020 15:51:12 +0000 Subject: [PATCH 05/82] cleanup docs --- packaging/snap/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packaging/snap/README.md b/packaging/snap/README.md index 5ebef11..c4cef12 100644 --- a/packaging/snap/README.md +++ b/packaging/snap/README.md @@ -28,7 +28,7 @@ $ ls -la *.snap ```bash sudo snap install prometheus-slurm-exporter_`git describe --tags`_amd64.snap --classic --dangerous ``` -* `--classic` - this snap need runs in classic mode to allow it to find the slurm commands in the system. +* `--classic` - this snap uses classic confinement to allow it to find the slurm commands in the system. * `--dangerous` - because we are installing this snap from a local resource and sha can't be verified by the snapstore. ### Verify install From 1277b79b2216a8cba6008de9292c20442d03625d Mon Sep 17 00:00:00 2001 From: Matthew Tse <66440247+mtpdt@users.noreply.github.com> Date: Tue, 25 Aug 2020 16:31:08 -0400 Subject: [PATCH 06/82] Utilize a faster node metric query method. --- nodes.go | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/nodes.go b/nodes.go index f20b2c9..f8001eb 100644 --- a/nodes.go +++ b/nodes.go @@ -22,6 +22,7 @@ import ( "os/exec" "regexp" "sort" + "strconv" "strings" ) @@ -67,7 +68,9 @@ func ParseNodesMetrics(input []byte) *NodesMetrics { for _, line := range lines_uniq { if strings.Contains(line, ",") { - state := strings.Split(line, ",")[1] + split := strings.Split(line, ",") + count, _ := strconv.ParseFloat(strings.TrimSpace(split[0]), 64) + state := split[1] alloc := regexp.MustCompile(`^alloc`) comp := regexp.MustCompile(`^comp`) down := regexp.MustCompile(`^down`) @@ -80,25 +83,25 @@ func ParseNodesMetrics(input []byte) *NodesMetrics { resv := regexp.MustCompile(`^res`) switch { case alloc.MatchString(state) == true: - nm.alloc++ + nm.alloc += count case comp.MatchString(state) == true: - nm.comp++ + nm.comp += count case down.MatchString(state) == true: - nm.down++ + nm.down += count case drain.MatchString(state) == true: - nm.drain++ + nm.drain += count case fail.MatchString(state) == true: - nm.fail++ + nm.fail += count case err.MatchString(state) == true: - nm.err++ + nm.err += count case idle.MatchString(state) == true: - nm.idle++ + nm.idle += count case maint.MatchString(state) == true: - nm.maint++ + nm.maint += count case mix.MatchString(state) == true: - nm.mix++ + nm.mix += count case resv.MatchString(state) == true: - nm.resv++ + nm.resv += count } } } @@ -107,7 +110,7 @@ func ParseNodesMetrics(input []byte) *NodesMetrics { // Execute the sinfo command and return its output func NodesData() []byte { - cmd := exec.Command("sinfo", "-h", "-o %n,%T") + cmd := exec.Command("sinfo", "-h", "-o %D,%T") stdout, err := cmd.StdoutPipe() if err != nil { log.Fatal(err) From 7a1021b28cfb1eb4fe8971932ad72f7c05162a82 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Sat, 29 Aug 2020 18:49:55 +0200 Subject: [PATCH 07/82] README: add note about Snap packaging --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 2036339..7d0936d 100644 --- a/README.md +++ b/README.md @@ -75,6 +75,10 @@ counted with this parameter almost always indicates three issues: Consult the [following document](packaging/rpm/README.md) under the ``packaging/rpm`` subdirectory. +## Distribute the exporter as a Snap package + +Consult the [following document](packaging/snap/README.md). **NOTE**: this method requires the use of [Snap](https://snapcraft.io), which is built by [Canonical](https://canonical.com). + ## How to build the exporter from the sources ### Debian From 23e5207ede0da9dc37d9b8ced193edae211a5b86 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Wed, 7 Oct 2020 08:56:34 +0200 Subject: [PATCH 08/82] simple example for setup of an development environment, remove some useless files --- CONTRIBUTORS.md | 5 ----- DEVELOPMENT.md | 37 +++++++++++++++++++++++++++++++++++++ source_me.sh | 43 ------------------------------------------- 3 files changed, 37 insertions(+), 48 deletions(-) delete mode 100644 CONTRIBUTORS.md create mode 100644 DEVELOPMENT.md delete mode 100644 source_me.sh diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md deleted file mode 100644 index ed1a43b..0000000 --- a/CONTRIBUTORS.md +++ /dev/null @@ -1,5 +0,0 @@ -# List of Contributors - -* [Victor Penso](https://github.com/vpenso) -* [Matteo Dessalvi](https://github.com/mtds) - diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md new file mode 100644 index 0000000..4843b83 --- /dev/null +++ b/DEVELOPMENT.md @@ -0,0 +1,37 @@ +Install Go from source: + +```bash +export VERSION=1.13 OS=linux ARCH=amd64 +wget https://dl.google.com/go/go$VERSION.$OS-$ARCH.tar.gz +tar -xzvf go$VERSION.$OS-$ARCH.tar.gz +export PATH=$PWD/go/bin:$PATH +``` + +Development: + +```bash +# clone the source code +git clone https://github.com/vpenso/prometheus-slurm-exporter.git +cd prometheus-slurm-exporter +# download dependencies +export GOPATH=$PWD/go/modules +go mod download +``` + +Build and executer the exporter: + +``` +# build the exporter +go build -o bin/prometheus-slurm-exporter {main,cpus,nodes,queue,scheduler}.go +# start the exporter (foreground) +bin/prometheus-slurm-exporter +... +# query all metrics (default port) +curl http://localhost:8080/metrics +``` + +Run all tests included in `_test.go` files: + +```bash +go test -v *.go +``` diff --git a/source_me.sh b/source_me.sh deleted file mode 100644 index 9640e02..0000000 --- a/source_me.sh +++ /dev/null @@ -1,43 +0,0 @@ -# -# Copyright 2012-2017 Victor Penso -# -# This program is free software: you can redistribute it and/or modify -# it under the terms of the GNU General Public License as published by -# the Free Software Foundation, either version 3 of the License, or -# (at your option) any later version. -# -# This program is distributed in the hope that it will be useful, -# but WITHOUT ANY WARRANTY; without even the implied warranty of -# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -# GNU General Public License for more details. -# -# You should have received a copy of the GNU General Public License -# along with this program. If not, see . -# - -# Find the correct path even if dereferenced by a link -__source=$0 - -if [[ "$__source" == *bash* ]]; then - __source=${BASH_SOURCE[0]} -fi - -__dir="$( dirname $__source )" -while [ -h $__source ] -do - __source="$( readlink "$__source" )" - [[ $__source != /* ]] && __source="$__dir/$__source" - __dir="$( cd -P "$( dirname "$__source" )" && pwd )" -done -__dir="$( cd -P "$( dirname "$__source" )" && pwd )" - -export SCRIPTS=$__dir - -unset __dir -unset __source - -export GOPATH=$SCRIPTS:/usr/share/gocode -export PATH=$SCRIPTS/bin:$PATH - -PATH=$(echo "$PATH" | awk -v RS=':' -v ORS=":" '!a[$1]++{if (NR > 1) printf ORS; printf $a[$1]}') - From 7b8c796e4360bb5019f3a1251a15e819d6bf4835 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Wed, 7 Oct 2020 09:04:40 +0200 Subject: [PATCH 09/82] elaborate a bit more on the development setup... --- DEVELOPMENT.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 4843b83..413840d 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -1,3 +1,8 @@ +## Development + +Setup the development environment on a node with access to the Slurm user +commnad-line interface, in particular with the `sinfo` and `squeue` commands. + Install Go from source: ```bash @@ -7,7 +12,10 @@ tar -xzvf go$VERSION.$OS-$ARCH.tar.gz export PATH=$PWD/go/bin:$PATH ``` -Development: +_Alternatively install Go from a package of your Linux distribution._ + +Use Git to clone the source code the exporter, and download all Go dependency +libraries: ```bash # clone the source code @@ -18,9 +26,9 @@ export GOPATH=$PWD/go/modules go mod download ``` -Build and executer the exporter: +### Build -``` +```bash # build the exporter go build -o bin/prometheus-slurm-exporter {main,cpus,nodes,queue,scheduler}.go # start the exporter (foreground) From be41c934b6e61d196d7c96f9c17ca016730865ff Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Wed, 7 Oct 2020 09:07:59 +0200 Subject: [PATCH 10/82] remove build instructions... --- README.md | 163 ------------------------------------------------------ 1 file changed, 163 deletions(-) diff --git a/README.md b/README.md index 7d0936d..80c763d 100644 --- a/README.md +++ b/README.md @@ -79,169 +79,6 @@ Consult the [following document](packaging/rpm/README.md) under the ``packaging/ Consult the [following document](packaging/snap/README.md). **NOTE**: this method requires the use of [Snap](https://snapcraft.io), which is built by [Canonical](https://canonical.com). -## How to build the exporter from the sources - -### Debian - -Install the Prometheus [Go client library](https://github.com/prometheus/client_golang) - - >>> apt install golang-github-prometheus-client-golang-dev - -Use the [Makefile](Makefile) to build and test the code. - -**Debian Jessie**: in this release, the Prometheus client library package was available only through the backport archives but the Debian maintainers discontinued it, as explained [here](https://lists.debian.org/debian-backports-announce/2018/07/msg00000.html). Now only __Debian Stretch__ is supported with the previous build method. - -### CentOS - -Under CentOS not all the GOlang dependencies are available as packages. - -**GOPATH**: Since ``go`` version _1.13_ it is better to host the modules in a separate directory otherwise this will generate an error message: _$GOPATH/go.mod exists but should not_ - -In order to use the [Makefile](Makefile) provided with this repository you can proceed as follows: - -1. Install the Golang compiler plus GIT and make: -```bash -yum install git golang-bin make -``` - -2. Clone this repo and change into the source directory: -```bash -git clone https://github.com/vpenso/prometheus-slurm-exporter.git -cd prometheus-slurm-exporter -``` - -3. Build a module cache to host the necessary Golang dependencies using the [Go modules](https://blog.golang.org/using-go-modules): -```bash -GOPATH=/tmp/go-modules-cache go mod download -go: finding github.com/alecthomas/template v0.0.0-20190718012654-fb15b899a751 -go: finding github.com/alecthomas/units v0.0.0-20190717042225-c3de453c63f4 -go: finding github.com/beorn7/perks v1.0.1 -go: finding github.com/cespare/xxhash/v2 v2.1.0 -go: finding github.com/davecgh/go-spew v1.1.1 -go: finding github.com/go-kit/kit v0.9.0 -go: finding github.com/go-logfmt/logfmt v0.4.0 -go: finding github.com/go-stack/stack v1.8.0 -go: finding github.com/gogo/protobuf v1.1.1 -go: finding github.com/golang/protobuf v1.3.2 -go: finding github.com/google/go-cmp v0.3.0 -go: finding github.com/google/gofuzz v1.0.0 -go: finding github.com/json-iterator/go v1.1.7 -go: finding github.com/julienschmidt/httprouter v1.2.0 -go: finding github.com/konsorten/go-windows-terminal-sequences v1.0.1 -go: finding github.com/kr/logfmt v0.0.0-20140226030751-b84e30acd515 -go: finding github.com/matttproud/golang_protobuf_extensions v1.0.1 -go: finding github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd -go: finding github.com/modern-go/reflect2 v1.0.1 -go: finding github.com/mwitkow/go-conntrack v0.0.0-20161129095857-cc309e4a2223 -go: finding github.com/pkg/errors v0.8.1 -go: finding github.com/pmezard/go-difflib v1.0.0 -go: finding github.com/prometheus/client_golang v1.2.1 -go: finding github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4 -go: finding github.com/prometheus/common v0.7.0 -go: finding github.com/prometheus/procfs v0.0.5 -go: finding github.com/sirupsen/logrus v1.4.2 -go: finding github.com/stretchr/objx v0.1.1 -go: finding github.com/stretchr/testify v1.3.0 -go: finding golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2 -go: finding golang.org/x/net v0.0.0-20190613194153-d28f0bde5980 -go: finding golang.org/x/sync v0.0.0-20181221193216-37e7f081c4d4 -go: finding golang.org/x/sys v0.0.0-20191010194322-b09406accb47 -go: finding golang.org/x/text v0.3.0 -go: finding gopkg.in/alecthomas/kingpin.v2 v2.2.6 -go: finding gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 -go: finding gopkg.in/yaml.v2 v2.2.2 -``` - -4. Build the executable binary: -```bash -go build -go: downloading github.com/prometheus/client_golang v1.2.1 -go: downloading github.com/prometheus/common v0.7.0 -go: extracting github.com/prometheus/common v0.7.0 -go: downloading github.com/sirupsen/logrus v1.4.2 -go: downloading gopkg.in/alecthomas/kingpin.v2 v2.2.6 -go: extracting github.com/prometheus/client_golang v1.2.1 -go: downloading github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4 -go: downloading github.com/beorn7/perks v1.0.1 -go: downloading github.com/prometheus/procfs v0.0.5 -go: downloading github.com/cespare/xxhash/v2 v2.1.0 -go: downloading github.com/golang/protobuf v1.3.2 -go: downloading github.com/matttproud/golang_protobuf_extensions v1.0.1 -go: extracting github.com/beorn7/perks v1.0.1 -go: extracting gopkg.in/alecthomas/kingpin.v2 v2.2.6 -go: downloading github.com/alecthomas/units v0.0.0-20190717042225-c3de453c63f4 -go: downloading github.com/alecthomas/template v0.0.0-20190718012654-fb15b899a751 -go: extracting github.com/sirupsen/logrus v1.4.2 -go: extracting github.com/cespare/xxhash/v2 v2.1.0 -go: downloading golang.org/x/sys v0.0.0-20191010194322-b09406accb47 -go: extracting github.com/alecthomas/units v0.0.0-20190717042225-c3de453c63f4 -go: extracting github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4 -go: extracting github.com/matttproud/golang_protobuf_extensions v1.0.1 -go: extracting github.com/prometheus/procfs v0.0.5 -go: extracting github.com/alecthomas/template v0.0.0-20190718012654-fb15b899a751 -go: extracting github.com/golang/protobuf v1.3.2 -go: extracting golang.org/x/sys v0.0.0-20191010194322-b09406accb47 -go: finding github.com/prometheus/client_golang v1.2.1 -go: finding github.com/prometheus/common v0.7.0 -go: finding github.com/sirupsen/logrus v1.4.2 -go: finding gopkg.in/alecthomas/kingpin.v2 v2.2.6 -go: finding github.com/beorn7/perks v1.0.1 -go: finding github.com/cespare/xxhash/v2 v2.1.0 -go: finding github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4 -go: finding github.com/golang/protobuf v1.3.2 -go: finding github.com/alecthomas/template v0.0.0-20190718012654-fb15b899a751 -go: finding golang.org/x/sys v0.0.0-20191010194322-b09406accb47 -go: finding github.com/alecthomas/units v0.0.0-20190717042225-c3de453c63f4 -go: finding github.com/prometheus/procfs v0.0.5 -go: finding github.com/matttproud/golang_protobuf_extensions v1.0.1 -``` - -5. Run the test ( **optional** ): if Slurm command line tools (``sinfo``, ``squeue``, etc.) are not available the test will fail! -```bash -GOPATH=/tmp/gopath-for-cache make test -=== RUN TestCPUsMetrics ---- PASS: TestCPUsMetrics (0.00s) - cpus_test.go:29: &{alloc:5725 idle:877 other:34 total:6636} -=== RUN TestCPUssGetMetrics ---- PASS: TestCPUssGetMetrics (0.01s) - cpus_test.go:33: &{alloc:18956 idle:7852 other:12408 total:39216} -=== RUN TestNodesMetrics ---- PASS: TestNodesMetrics (0.03s) - nodes_test.go:29: &{alloc:250 comp:0 down:67 drain:28 err:0 fail:1 idle:319 maint:0 mix:44 resv:0} -=== RUN TestNodesGetMetrics ---- PASS: TestNodesGetMetrics (0.10s) - nodes_test.go:33: &{alloc:328 comp:0 down:230 drain:66 err:0 fail:0 idle:53 maint:0 mix:71 resv:0} -=== RUN TestParseQueueMetrics ---- PASS: TestParseQueueMetrics (0.01s) - queue_test.go:29: &{pending:4 pending_dep:0 running:28 suspended:1 cancelled:1 completing:2 completed:1 configuring:1 failed:1 timeout:1 preempted:1 node_fail:1} -=== RUN TestQueueGetMetrics ---- PASS: TestQueueGetMetrics (0.28s) - queue_test.go:33: &{pending:8280 pending_dep:3 running:7132 suspended:0 cancelled:1 completing:0 completed:180 configuring:0 failed:245 timeout:2 preempted:0 node_fail:0} -=== RUN TestSchedulerMetrics ---- PASS: TestSchedulerMetrics (0.02s) - scheduler_test.go:29: &{threads:3 queue_size:0 last_cycle:97209 mean_cycle:74593 cycle_per_minute:63 backfill_last_cycle:1.94289e+06 backfill_mean_cycle:1.96082e+06 backfill_depth_mean:29324} -=== RUN TestSchedulerGetMetrics ---- PASS: TestSchedulerGetMetrics (0.03s) - scheduler_test.go:33: &{threads:3 queue_size:0 last_cycle:20982 mean_cycle:32874 cycle_per_minute:23 backfill_last_cycle:991389 backfill_mean_cycle:1.7385e+06 backfill_depth_mean:11320} -PASS -ok github.com/vpenso/prometheus-slurm-exporter 0.495s -``` - -## Command line options - -The following is the list of the command line options available on this exporter: - -```bash -:~$ prometheus-slurm-exporter -h -Usage of ./prometheus-slurm-exporter: - -listen-address string - The address to listen on for HTTP requests. (default ":8080") - -log.format value - Set the log target and format. Example: "logger:syslog?appname=bob&local=7" or "logger:stdout?json=true" (default "logger:stderr") - -log.level value - Only log messages with the given severity or above. Valid levels: [debug, info, warn, error, fatal] (default "info") -``` - ## Installation After successfully ran ``make``, you will have a binary called ``prometheus-slurm-exporter`` under the ``bin/`` subdirectory in your local copy of this repository. You can now copy this binary wherever you have installed the Slurm utilities (sinfo,squeue, sdiag) and then put it into execution, either interactively or through a Systemd unit (an example is available [here](lib/systemd/prometheus-slurm-exporter.service)). From d0812b7254a249b2012f3a444164f95511731283 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Wed, 7 Oct 2020 09:18:35 +0200 Subject: [PATCH 11/82] clean up documentation about packages --- README.md | 7 ------- packages/README.md | 8 ++++++++ {packaging => packages}/rpm/README.md | 0 .../rpm/prometheus-slurm-exporter.spec | 0 {packaging => packages}/snap/README.md | 0 5 files changed, 8 insertions(+), 7 deletions(-) create mode 100644 packages/README.md rename {packaging => packages}/rpm/README.md (100%) rename {packaging => packages}/rpm/prometheus-slurm-exporter.spec (100%) rename {packaging => packages}/snap/README.md (100%) diff --git a/README.md b/README.md index 80c763d..09d00a8 100644 --- a/README.md +++ b/README.md @@ -71,13 +71,6 @@ counted with this parameter almost always indicates three issues: * the database is either down or unreachable; * the status of the Slurm accounting DB may be inconsistent (e.g. ``sreport`` missing data, weird utilization of the cluster, etc.). -## How to build an RPM package from the relases - -Consult the [following document](packaging/rpm/README.md) under the ``packaging/rpm`` subdirectory. - -## Distribute the exporter as a Snap package - -Consult the [following document](packaging/snap/README.md). **NOTE**: this method requires the use of [Snap](https://snapcraft.io), which is built by [Canonical](https://canonical.com). ## Installation diff --git a/packages/README.md b/packages/README.md new file mode 100644 index 0000000..2764bb5 --- /dev/null +++ b/packages/README.md @@ -0,0 +1,8 @@ +# Packages + +* Build RPM packages from + [rpm/prometheus-slurm-exporter.spec](rpm/prometheus-slurm-exporter.spec) + following documentation in [rpm/README.md](rpm/README.md]). +* Build a [Snap](https://snapcraft.io) package from + [../snap/snapcraft.yaml](../snap/snapcraft.yaml) following documentation in + [snap/README.md](snap/README.md). diff --git a/packaging/rpm/README.md b/packages/rpm/README.md similarity index 100% rename from packaging/rpm/README.md rename to packages/rpm/README.md diff --git a/packaging/rpm/prometheus-slurm-exporter.spec b/packages/rpm/prometheus-slurm-exporter.spec similarity index 100% rename from packaging/rpm/prometheus-slurm-exporter.spec rename to packages/rpm/prometheus-slurm-exporter.spec diff --git a/packaging/snap/README.md b/packages/snap/README.md similarity index 100% rename from packaging/snap/README.md rename to packages/snap/README.md From 403a6156df8ce8fbf96f7cd0481d89009852d791 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Wed, 7 Oct 2020 09:30:26 +0200 Subject: [PATCH 12/82] working on development docs... --- DEVELOPMENT.md | 32 +++++++++++++++++++++++++------- README.md | 7 ------- 2 files changed, 25 insertions(+), 14 deletions(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 413840d..22e499d 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -1,7 +1,6 @@ -## Development - Setup the development environment on a node with access to the Slurm user -commnad-line interface, in particular with the `sinfo` and `squeue` commands. +command-line interface, in particular with the `sinfo`, `squeue`, and `sdiag` +commands. Install Go from source: @@ -14,8 +13,8 @@ export PATH=$PWD/go/bin:$PATH _Alternatively install Go from a package of your Linux distribution._ -Use Git to clone the source code the exporter, and download all Go dependency -libraries: +Use Git to clone the source code of the exporter, and download all Go dependency +modules: ```bash # clone the source code @@ -28,18 +27,37 @@ go mod download ### Build +Build the exporter: + ```bash -# build the exporter go build -o bin/prometheus-slurm-exporter {main,cpus,nodes,queue,scheduler}.go -# start the exporter (foreground) +``` + +Start the exporter (foreground), and query all metrics: + +```bash bin/prometheus-slurm-exporter ... # query all metrics (default port) curl http://localhost:8080/metrics ``` +### Tests + Run all tests included in `_test.go` files: ```bash go test -v *.go ``` + +### Development + +References: + +* [GOlang Package Documentation](https://godoc.org/github.com/prometheus/client_golang/prometheus) +* [Metric Types](https://prometheus.io/docs/concepts/metric_types/) +* [Writing Exporters](https://prometheus.io/docs/instrumenting/writing_exporters/) +* [Available Exporters](https://prometheus.io/docs/instrumenting/exporters/) + + + diff --git a/README.md b/README.md index 09d00a8..1aa8818 100644 --- a/README.md +++ b/README.md @@ -121,13 +121,6 @@ The following are screenshots of the dashboard: ![Status of the Jobs](images/Job_Status.png) ![SLURM Scheduler Information](images/Scheduler_Info.png) -## Prometheus references - -* [GOlang Package Documentation](https://godoc.org/github.com/prometheus/client_golang/prometheus) -* [Metric Types](https://prometheus.io/docs/concepts/metric_types/) -* [Writing Exporters](https://prometheus.io/docs/instrumenting/writing_exporters/) -* [Available Exporters](https://prometheus.io/docs/instrumenting/exporters/) - ## License From 08b5cc0f8a454087c2f6bc0b9e39a9161c112179 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Wed, 7 Oct 2020 09:36:47 +0200 Subject: [PATCH 13/82] rewrite installation instructions --- README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 1aa8818..74f7f27 100644 --- a/README.md +++ b/README.md @@ -74,7 +74,11 @@ counted with this parameter almost always indicates three issues: ## Installation -After successfully ran ``make``, you will have a binary called ``prometheus-slurm-exporter`` under the ``bin/`` subdirectory in your local copy of this repository. You can now copy this binary wherever you have installed the Slurm utilities (sinfo,squeue, sdiag) and then put it into execution, either interactively or through a Systemd unit (an example is available [here](lib/systemd/prometheus-slurm-exporter.service)). +Read [DEVELOPMENT.md](DEVELOPMENT.md) in order to build the Prometheus Slurm +Exporter. After a successful build copy the executable +`bin/prometheus-slurm-exporte` to a node with access to the Slurm command-line +interface. A [Systemd Unit][sdu] file to run the executable as service is +available in [lib/systemd/prometheus-slurm-exporter.service](lib/systemd/prometheus-slurm-exporter.service). ## Prometheus Configuration for the SLURM exporter From b3efb876a83fbeb4d765f2250461bd6f77ebb9e3 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Wed, 7 Oct 2020 09:41:01 +0200 Subject: [PATCH 14/82] cosmetics --- DEVELOPMENT.md | 13 ++++--------- README.md | 9 +++++---- 2 files changed, 9 insertions(+), 13 deletions(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 22e499d..1f0c138 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -42,14 +42,6 @@ bin/prometheus-slurm-exporter curl http://localhost:8080/metrics ``` -### Tests - -Run all tests included in `_test.go` files: - -```bash -go test -v *.go -``` - ### Development References: @@ -59,5 +51,8 @@ References: * [Writing Exporters](https://prometheus.io/docs/instrumenting/writing_exporters/) * [Available Exporters](https://prometheus.io/docs/instrumenting/exporters/) +Run all tests included in `_test.go` files: - +```bash +go test -v *.go +``` diff --git a/README.md b/README.md index 74f7f27..64ca918 100644 --- a/README.md +++ b/README.md @@ -117,18 +117,19 @@ Checking prometheus.yml ## Grafana Dashboard -A [dashboard](https://grafana.com/dashboards/4323) is available in order to visualize the exported metrics through [Grafana](https://grafana.com). - -The following are screenshots of the dashboard: +A [dashboard](https://grafana.com/dashboards/4323) is available in order to +visualize the exported metrics through [Grafana](https://grafana.com): ![Status of the Nodes](images/Node_Status.png) + ![Status of the Jobs](images/Job_Status.png) + ![SLURM Scheduler Information](images/Scheduler_Info.png) ## License -Copyright 2017 Victor Penso, Matteo Dessalvi +Copyright 2017-2020 Victor Penso, Matteo Dessalvi This is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. From 54e5e201bdf422e35412d010655b4927b53876ef Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Wed, 7 Oct 2020 09:42:27 +0200 Subject: [PATCH 15/82] add missing link to systemd service units --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 64ca918..18614d5 100644 --- a/README.md +++ b/README.md @@ -80,6 +80,8 @@ Exporter. After a successful build copy the executable interface. A [Systemd Unit][sdu] file to run the executable as service is available in [lib/systemd/prometheus-slurm-exporter.service](lib/systemd/prometheus-slurm-exporter.service). +[sdu]: https://www.freedesktop.org/software/systemd/man/systemd.service.html + ## Prometheus Configuration for the SLURM exporter It is strongly advisable to configure the Prometheus server with the following parameters: From af888b747f1962729734911e7c803d9524fbcc13 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Wed, 7 Oct 2020 11:17:18 +0200 Subject: [PATCH 16/82] start to work on collecting account metrics --- accounts.go | 55 ++++++++++++++++++++++++++++++++++++++++++++++++ accounts_test.go | 24 +++++++++++++++++++++ 2 files changed, 79 insertions(+) create mode 100644 accounts.go create mode 100644 accounts_test.go diff --git a/accounts.go b/accounts.go new file mode 100644 index 0000000..ead8a47 --- /dev/null +++ b/accounts.go @@ -0,0 +1,55 @@ +/* Copyright 2020 Victor Penso + +This program is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program. If not, see . */ + +package main + +import ( + "io/ioutil" + "os/exec" + "log" + "strings" +) + +func AccountsData() []byte { + cmd := exec.Command("squeue", "-o '%A|%a|%u|%T|%C'") + stdout, err := cmd.StdoutPipe() + if err != nil { + log.Fatal(err) + } + if err := cmd.Start(); err != nil { + log.Fatal(err) + } + out, _ := ioutil.ReadAll(stdout) + if err := cmd.Wait(); err != nil { + log.Fatal(err) + } + return out +} + +type AccountMetrics struct { + jobs float64 +} + +func ParseAccountsMetrics(input []byte) map[string]*AccountMetrics { + accounts := make(map[string]*AccountMetrics) + lines := strings.Split(string(input), "\n") + for _, line := range lines { + if strings.Contains(line,"|") { + log.Debug(line) + } + } + return accounts +} + diff --git a/accounts_test.go b/accounts_test.go new file mode 100644 index 0000000..6853ab3 --- /dev/null +++ b/accounts_test.go @@ -0,0 +1,24 @@ +/* Copyright 2020 Victor Penso + +This program is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program. If not, see . */ + +package main + +import ( + "testing" +) + +func TestParseAccountsMetrics(t *testing.T) { + t.Logf("%+v", ParseAccountsMetrics(AccountsData())) +} From 1cde386b4c149d78d4710407f9c0b65532d5eed9 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Wed, 7 Oct 2020 14:39:54 +0200 Subject: [PATCH 17/82] parsing slurm account data --- accounts.go | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/accounts.go b/accounts.go index ead8a47..660af35 100644 --- a/accounts.go +++ b/accounts.go @@ -23,7 +23,7 @@ import ( ) func AccountsData() []byte { - cmd := exec.Command("squeue", "-o '%A|%a|%u|%T|%C'") + cmd := exec.Command("squeue", "-h", "-o '%A|%a|%T'") stdout, err := cmd.StdoutPipe() if err != nil { log.Fatal(err) @@ -39,15 +39,30 @@ func AccountsData() []byte { } type AccountMetrics struct { - jobs float64 + resv float64 } -func ParseAccountsMetrics(input []byte) map[string]*AccountMetrics { - accounts := make(map[string]*AccountMetrics) +func ParseAccountsMetrics(input []byte) map[string]map[string]int { + accounts := make(map[string]map[string]int) lines := strings.Split(string(input), "\n") for _, line := range lines { if strings.Contains(line,"|") { - log.Debug(line) + log.Print(line) + + account := strings.Split(line,"|")[1] + _,key := accounts[account] + if !key { + accounts[account] = make(map[string]int) + } + + state := strings.Split(line,"|")[2] + state = strings.ToLower(state) + _,key = accounts[account][state] + if !key { + accounts[account][state] = 1 + } else { + accounts[account][state] += 1 + } } } return accounts From 7327cb31450559d6011910127eaa8e6fa3694ad7 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Thu, 8 Oct 2020 12:50:12 +0200 Subject: [PATCH 18/82] frist working prototype --- accounts.go | 54 ++++++++++++++++++++++++++++++++++++------------ accounts_test.go | 1 + main.go | 3 ++- 3 files changed, 44 insertions(+), 14 deletions(-) diff --git a/accounts.go b/accounts.go index 660af35..70886aa 100644 --- a/accounts.go +++ b/accounts.go @@ -20,6 +20,8 @@ import ( "os/exec" "log" "strings" + "regexp" + "github.com/prometheus/client_golang/prometheus" ) func AccountsData() []byte { @@ -38,33 +40,59 @@ func AccountsData() []byte { return out } -type AccountMetrics struct { - resv float64 +type JobMetrics struct { + pending float64 + running float64 } -func ParseAccountsMetrics(input []byte) map[string]map[string]int { - accounts := make(map[string]map[string]int) +func ParseAccountsMetrics(input []byte) map[string]*JobMetrics { + accounts := make(map[string]*JobMetrics) lines := strings.Split(string(input), "\n") for _, line := range lines { if strings.Contains(line,"|") { - log.Print(line) - account := strings.Split(line,"|")[1] _,key := accounts[account] if !key { - accounts[account] = make(map[string]int) + accounts[account] = &JobMetrics{0,0} } - state := strings.Split(line,"|")[2] state = strings.ToLower(state) - _,key = accounts[account][state] - if !key { - accounts[account][state] = 1 - } else { - accounts[account][state] += 1 + running := regexp.MustCompile(`^running`) + pending := regexp.MustCompile(`^pending`) + switch { + case running.MatchString(state) == true: + accounts[account].running++ + case pending.MatchString(state) == true: + accounts[account].pending++ } } } return accounts } +type AccountsCollector struct { + running *prometheus.Desc + pending *prometheus.Desc +} + +func NewAccountsCollector() *AccountsCollector { + labels := []string{"account"} + return &AccountsCollector{ + running: prometheus.NewDesc("slurm_accounts_jobs_running", "Running jobs for account", labels, nil), + pending: prometheus.NewDesc("slurm_accounts_jobs_pending", "Running jobs for account", labels, nil), + } +} + +func (ac *AccountsCollector) Describe(ch chan<- *prometheus.Desc) { + ch <- ac.running + ch <- ac.pending +} + +func (ac *AccountsCollector) Collect(ch chan<- prometheus.Metric) { + am := ParseAccountsMetrics(AccountsData()) + for a := range am { + log.Print(a) + ch <- prometheus.MustNewConstMetric(ac.running, prometheus.GaugeValue, am[a].running, a) + ch <- prometheus.MustNewConstMetric(ac.pending, prometheus.GaugeValue, am[a].pending, a) + } +} diff --git a/accounts_test.go b/accounts_test.go index 6853ab3..0413d03 100644 --- a/accounts_test.go +++ b/accounts_test.go @@ -22,3 +22,4 @@ import ( func TestParseAccountsMetrics(t *testing.T) { t.Logf("%+v", ParseAccountsMetrics(AccountsData())) } + diff --git a/main.go b/main.go index 2e45d2e..d612990 100644 --- a/main.go +++ b/main.go @@ -1,4 +1,4 @@ -/* Copyright 2017 Victor Penso, Matteo Dessalvi +/* Copyright 2017-2020 Victor Penso, Matteo Dessalvi This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by @@ -29,6 +29,7 @@ func init() { prometheus.MustRegister(NewQueueCollector()) // from queue.go prometheus.MustRegister(NewNodesCollector()) // from nodes.go prometheus.MustRegister(NewCPUsCollector()) // from cpus.go + prometheus.MustRegister(NewAccountsCollector()) // from accounts.go } var listenAddress = flag.String( From 52f84a5fc5c8d15df566b946f588cce40021bfe0 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Thu, 8 Oct 2020 12:53:35 +0200 Subject: [PATCH 19/82] cosmetics, adjust make file --- Makefile | 2 +- accounts.go | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/Makefile b/Makefile index 1c2ac7e..c9793bc 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ PROJECT_NAME = prometheus-slurm-exporter ifndef GOPATH GOPATH=$(shell pwd):/usr/share/gocode endif -GOFILES=cpus.go main.go nodes.go queue.go scheduler.go +GOFILES=accounts.go cpus.go main.go nodes.go queue.go scheduler.go GOBIN=bin/$(PROJECT_NAME) build: diff --git a/accounts.go b/accounts.go index 70886aa..c3d9687 100644 --- a/accounts.go +++ b/accounts.go @@ -78,8 +78,8 @@ type AccountsCollector struct { func NewAccountsCollector() *AccountsCollector { labels := []string{"account"} return &AccountsCollector{ - running: prometheus.NewDesc("slurm_accounts_jobs_running", "Running jobs for account", labels, nil), - pending: prometheus.NewDesc("slurm_accounts_jobs_pending", "Running jobs for account", labels, nil), + running: prometheus.NewDesc("slurm_account_jobs_running", "Running jobs for account", labels, nil), + pending: prometheus.NewDesc("slurm_account_jobs_pending", "Pending jobs for account", labels, nil), } } From d782cc27f0a0d964bde7eb907eb70107a604ee5b Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Thu, 8 Oct 2020 13:07:41 +0200 Subject: [PATCH 20/82] add some more states --- accounts.go | 39 +++++++++++++++++++++++++++++++-------- 1 file changed, 31 insertions(+), 8 deletions(-) diff --git a/accounts.go b/accounts.go index c3d9687..89471f9 100644 --- a/accounts.go +++ b/accounts.go @@ -41,8 +41,11 @@ func AccountsData() []byte { } type JobMetrics struct { + cancelled float64 + completed float64 pending float64 running float64 + suspended float64 } func ParseAccountsMetrics(input []byte) map[string]*JobMetrics { @@ -53,17 +56,26 @@ func ParseAccountsMetrics(input []byte) map[string]*JobMetrics { account := strings.Split(line,"|")[1] _,key := accounts[account] if !key { - accounts[account] = &JobMetrics{0,0} + accounts[account] = &JobMetrics{0,0,0,0,0} } state := strings.Split(line,"|")[2] state = strings.ToLower(state) - running := regexp.MustCompile(`^running`) + cancelled := regexp.MustCompile(`^cancelled`) + completed := regexp.MustCompile(`^completed`) pending := regexp.MustCompile(`^pending`) + running := regexp.MustCompile(`^running`) + suspended := regexp.MustCompile(`^suspended`) switch { - case running.MatchString(state) == true: - accounts[account].running++ + case cancelled.MatchString(state) == true: + accounts[account].cancelled++ + case completed.MatchString(state) == true: + accounts[account].completed++ case pending.MatchString(state) == true: accounts[account].pending++ + case running.MatchString(state) == true: + accounts[account].running++ + case suspended.MatchString(state) == true: + accounts[account].suspended++ } } } @@ -71,28 +83,39 @@ func ParseAccountsMetrics(input []byte) map[string]*JobMetrics { } type AccountsCollector struct { - running *prometheus.Desc + cancelled *prometheus.Desc + completed *prometheus.Desc pending *prometheus.Desc + running *prometheus.Desc + suspended *prometheus.Desc } func NewAccountsCollector() *AccountsCollector { labels := []string{"account"} return &AccountsCollector{ + cancelled: prometheus.NewDesc("slurm_account_jobs_cancelled", "Cancelled jobs for account", labels, nil), + completed: prometheus.NewDesc("slurm_account_jobs_completed", "Completed jobs for account", labels, nil), running: prometheus.NewDesc("slurm_account_jobs_running", "Running jobs for account", labels, nil), pending: prometheus.NewDesc("slurm_account_jobs_pending", "Pending jobs for account", labels, nil), + suspended: prometheus.NewDesc("slurm_account_jobs_suspended", "Suspended jobs for account", labels, nil), } } func (ac *AccountsCollector) Describe(ch chan<- *prometheus.Desc) { - ch <- ac.running + ch <- ac.cancelled + ch <- ac.completed ch <- ac.pending + ch <- ac.running + ch <- ac.suspended } func (ac *AccountsCollector) Collect(ch chan<- prometheus.Metric) { am := ParseAccountsMetrics(AccountsData()) for a := range am { - log.Print(a) - ch <- prometheus.MustNewConstMetric(ac.running, prometheus.GaugeValue, am[a].running, a) + ch <- prometheus.MustNewConstMetric(ac.cancelled, prometheus.GaugeValue, am[a].cancelled, a) + ch <- prometheus.MustNewConstMetric(ac.completed, prometheus.GaugeValue, am[a].completed, a) ch <- prometheus.MustNewConstMetric(ac.pending, prometheus.GaugeValue, am[a].pending, a) + ch <- prometheus.MustNewConstMetric(ac.running, prometheus.GaugeValue, am[a].running, a) + ch <- prometheus.MustNewConstMetric(ac.suspended, prometheus.GaugeValue, am[a].suspended, a) } } From ca20fecd82f5f6af24173b752c40f61fa6ad3685 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Fri, 9 Oct 2020 06:35:04 +0200 Subject: [PATCH 21/82] limit to metrics relevant over time --- accounts.go | 18 +----------------- 1 file changed, 1 insertion(+), 17 deletions(-) diff --git a/accounts.go b/accounts.go index 89471f9..c71a934 100644 --- a/accounts.go +++ b/accounts.go @@ -41,8 +41,6 @@ func AccountsData() []byte { } type JobMetrics struct { - cancelled float64 - completed float64 pending float64 running float64 suspended float64 @@ -56,20 +54,14 @@ func ParseAccountsMetrics(input []byte) map[string]*JobMetrics { account := strings.Split(line,"|")[1] _,key := accounts[account] if !key { - accounts[account] = &JobMetrics{0,0,0,0,0} + accounts[account] = &JobMetrics{0,0,0} } state := strings.Split(line,"|")[2] state = strings.ToLower(state) - cancelled := regexp.MustCompile(`^cancelled`) - completed := regexp.MustCompile(`^completed`) pending := regexp.MustCompile(`^pending`) running := regexp.MustCompile(`^running`) suspended := regexp.MustCompile(`^suspended`) switch { - case cancelled.MatchString(state) == true: - accounts[account].cancelled++ - case completed.MatchString(state) == true: - accounts[account].completed++ case pending.MatchString(state) == true: accounts[account].pending++ case running.MatchString(state) == true: @@ -83,8 +75,6 @@ func ParseAccountsMetrics(input []byte) map[string]*JobMetrics { } type AccountsCollector struct { - cancelled *prometheus.Desc - completed *prometheus.Desc pending *prometheus.Desc running *prometheus.Desc suspended *prometheus.Desc @@ -93,8 +83,6 @@ type AccountsCollector struct { func NewAccountsCollector() *AccountsCollector { labels := []string{"account"} return &AccountsCollector{ - cancelled: prometheus.NewDesc("slurm_account_jobs_cancelled", "Cancelled jobs for account", labels, nil), - completed: prometheus.NewDesc("slurm_account_jobs_completed", "Completed jobs for account", labels, nil), running: prometheus.NewDesc("slurm_account_jobs_running", "Running jobs for account", labels, nil), pending: prometheus.NewDesc("slurm_account_jobs_pending", "Pending jobs for account", labels, nil), suspended: prometheus.NewDesc("slurm_account_jobs_suspended", "Suspended jobs for account", labels, nil), @@ -102,8 +90,6 @@ func NewAccountsCollector() *AccountsCollector { } func (ac *AccountsCollector) Describe(ch chan<- *prometheus.Desc) { - ch <- ac.cancelled - ch <- ac.completed ch <- ac.pending ch <- ac.running ch <- ac.suspended @@ -112,8 +98,6 @@ func (ac *AccountsCollector) Describe(ch chan<- *prometheus.Desc) { func (ac *AccountsCollector) Collect(ch chan<- prometheus.Metric) { am := ParseAccountsMetrics(AccountsData()) for a := range am { - ch <- prometheus.MustNewConstMetric(ac.cancelled, prometheus.GaugeValue, am[a].cancelled, a) - ch <- prometheus.MustNewConstMetric(ac.completed, prometheus.GaugeValue, am[a].completed, a) ch <- prometheus.MustNewConstMetric(ac.pending, prometheus.GaugeValue, am[a].pending, a) ch <- prometheus.MustNewConstMetric(ac.running, prometheus.GaugeValue, am[a].running, a) ch <- prometheus.MustNewConstMetric(ac.suspended, prometheus.GaugeValue, am[a].suspended, a) From 87bd341f847bd47b999603f96f42d7e81bdd04c3 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Fri, 9 Oct 2020 12:05:53 +0200 Subject: [PATCH 22/82] add collector for user specific metrics --- Makefile | 2 +- main.go | 1 + users.go | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 92 insertions(+), 1 deletion(-) create mode 100644 users.go diff --git a/Makefile b/Makefile index c9793bc..d52bbcd 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ PROJECT_NAME = prometheus-slurm-exporter ifndef GOPATH GOPATH=$(shell pwd):/usr/share/gocode endif -GOFILES=accounts.go cpus.go main.go nodes.go queue.go scheduler.go +GOFILES=accounts.go cpus.go main.go nodes.go queue.go scheduler.go users.go GOBIN=bin/$(PROJECT_NAME) build: diff --git a/main.go b/main.go index d612990..d1fd9ec 100644 --- a/main.go +++ b/main.go @@ -30,6 +30,7 @@ func init() { prometheus.MustRegister(NewNodesCollector()) // from nodes.go prometheus.MustRegister(NewCPUsCollector()) // from cpus.go prometheus.MustRegister(NewAccountsCollector()) // from accounts.go + prometheus.MustRegister(NewUsersCollector()) // from users.go } var listenAddress = flag.String( diff --git a/users.go b/users.go new file mode 100644 index 0000000..02b8505 --- /dev/null +++ b/users.go @@ -0,0 +1,90 @@ +/* Copyright 2020 Victor Penso + +This program is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program. If not, see . */ + +package main + +import ( + "io/ioutil" + "os/exec" + "log" + "strings" + "regexp" + "github.com/prometheus/client_golang/prometheus" +) + +func UsersData() []byte { + cmd := exec.Command("squeue", "-h", "-o '%A|%u|%T|%C'") + stdout, err := cmd.StdoutPipe() + if err != nil { + log.Fatal(err) + } + if err := cmd.Start(); err != nil { + log.Fatal(err) + } + out, _ := ioutil.ReadAll(stdout) + if err := cmd.Wait(); err != nil { + log.Fatal(err) + } + return out +} + +type UserJobMetrics struct { + running float64 +} + +func ParseUsersMetrics(input []byte) map[string]*UserJobMetrics { + users := make(map[string]*UserJobMetrics) + lines := strings.Split(string(input), "\n") + for _, line := range lines { + if strings.Contains(line,"|") { + user := strings.Split(line,"|")[1] + _,key := users[user] + if !key { + users[user] = &UserJobMetrics{0} + } + state := strings.Split(line,"|")[2] + state = strings.ToLower(state) + running := regexp.MustCompile(`^running`) + switch { + case running.MatchString(state) == true: + users[user].running++ + } + } + } + return users +} + +type UsersCollector struct { + running *prometheus.Desc +} + +func NewUsersCollector() *UsersCollector { + labels := []string{"user"} + return &UsersCollector { + running: prometheus.NewDesc("slurm_user_jobs_running", "Running jobs for user", labels, nil), + } +} + +func (uc *UsersCollector) Describe(ch chan<- *prometheus.Desc) { + ch <- uc.running +} + +func (uc *UsersCollector) Collect(ch chan<- prometheus.Metric) { + um := ParseUsersMetrics(UsersData()) + for u := range um { + ch <- prometheus.MustNewConstMetric(uc.running, prometheus.GaugeValue, um[u].running, u) + } +} + From c60f0a319c541519a885dc1fd7c50f818a8e118e Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Fri, 9 Oct 2020 12:22:24 +0200 Subject: [PATCH 23/82] add number of pending/suspended jobs per user --- users.go | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/users.go b/users.go index 02b8505..0ca9a97 100644 --- a/users.go +++ b/users.go @@ -41,7 +41,9 @@ func UsersData() []byte { } type UserJobMetrics struct { + pending float64 running float64 + suspended float64 } func ParseUsersMetrics(input []byte) map[string]*UserJobMetrics { @@ -52,14 +54,20 @@ func ParseUsersMetrics(input []byte) map[string]*UserJobMetrics { user := strings.Split(line,"|")[1] _,key := users[user] if !key { - users[user] = &UserJobMetrics{0} + users[user] = &UserJobMetrics{0,0,0} } state := strings.Split(line,"|")[2] state = strings.ToLower(state) + pending := regexp.MustCompile(`^pending`) running := regexp.MustCompile(`^running`) + suspended := regexp.MustCompile(`^suspended`) switch { + case pending.MatchString(state) == true: + users[user].pending++ case running.MatchString(state) == true: users[user].running++ + case suspended.MatchString(state) == true: + users[user].suspended++ } } } @@ -67,24 +75,32 @@ func ParseUsersMetrics(input []byte) map[string]*UserJobMetrics { } type UsersCollector struct { + pending *prometheus.Desc running *prometheus.Desc + suspended *prometheus.Desc } func NewUsersCollector() *UsersCollector { labels := []string{"user"} return &UsersCollector { + pending: prometheus.NewDesc("slurm_user_jobs_pending", "Pending jobs for user", labels, nil), running: prometheus.NewDesc("slurm_user_jobs_running", "Running jobs for user", labels, nil), + suspended: prometheus.NewDesc("slurm_user_jobs_suspended", "Suspended jobs for user", labels, nil), } } func (uc *UsersCollector) Describe(ch chan<- *prometheus.Desc) { + ch <- uc.pending ch <- uc.running + ch <- uc.suspended } func (uc *UsersCollector) Collect(ch chan<- prometheus.Metric) { um := ParseUsersMetrics(UsersData()) for u := range um { + ch <- prometheus.MustNewConstMetric(uc.pending, prometheus.GaugeValue, um[u].pending, u) ch <- prometheus.MustNewConstMetric(uc.running, prometheus.GaugeValue, um[u].running, u) + ch <- prometheus.MustNewConstMetric(uc.suspended, prometheus.GaugeValue, um[u].suspended, u) } } From 08026801d69bfe642dd1f66037a4279339bba713 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Sat, 10 Oct 2020 21:04:13 +0200 Subject: [PATCH 24/82] README: readjust the development section --- README.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 18614d5..dae873d 100644 --- a/README.md +++ b/README.md @@ -74,11 +74,12 @@ counted with this parameter almost always indicates three issues: ## Installation -Read [DEVELOPMENT.md](DEVELOPMENT.md) in order to build the Prometheus Slurm -Exporter. After a successful build copy the executable -`bin/prometheus-slurm-exporte` to a node with access to the Slurm command-line -interface. A [Systemd Unit][sdu] file to run the executable as service is -available in [lib/systemd/prometheus-slurm-exporter.service](lib/systemd/prometheus-slurm-exporter.service). +* Read [DEVELOPMENT.md](DEVELOPMENT.md) in order to build the Prometheus Slurm Exporter. After a successful build copy the executable +`bin/prometheus-slurm-exporter` to a node with access to the Slurm command-line interface. + +* A [Systemd Unit][sdu] file to run the executable as service is available in [lib/systemd/prometheus-slurm-exporter.service](lib/systemd/prometheus-slurm-exporter.service). + +* (**optional**) Distribute the exporter as a Snap package: consult the [following document](packages/snap/README.md). **NOTE**: this method requires the use of [Snap](https://snapcraft.io), which is built by [Canonical](https://canonical.com). [sdu]: https://www.freedesktop.org/software/systemd/man/systemd.service.html From 2bc6ba639a2adb388b2b25dd9cbcdc3c083f0e99 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Sun, 11 Oct 2020 13:14:07 +0200 Subject: [PATCH 25/82] README/DEVELOPMENT docs updated --- DEVELOPMENT.md | 19 ++++++++++--------- README.md | 7 +++++++ 2 files changed, 17 insertions(+), 9 deletions(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 1f0c138..98cf503 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -2,16 +2,16 @@ Setup the development environment on a node with access to the Slurm user command-line interface, in particular with the `sinfo`, `squeue`, and `sdiag` commands. -Install Go from source: +### Install Go from source ```bash -export VERSION=1.13 OS=linux ARCH=amd64 +export VERSION=1.15 OS=linux ARCH=amd64 wget https://dl.google.com/go/go$VERSION.$OS-$ARCH.tar.gz tar -xzvf go$VERSION.$OS-$ARCH.tar.gz export PATH=$PWD/go/bin:$PATH ``` -_Alternatively install Go from a package of your Linux distribution._ +_Alternatively install Go using the packaging system of your Linux distribution._ Use Git to clone the source code of the exporter, and download all Go dependency modules: @@ -30,7 +30,13 @@ go mod download Build the exporter: ```bash -go build -o bin/prometheus-slurm-exporter {main,cpus,nodes,queue,scheduler}.go +go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,nodes,queue,scheduler,users}.go +``` + +Run all tests included in `_test.go` files: + +```bash +go test -v *.go ``` Start the exporter (foreground), and query all metrics: @@ -51,8 +57,3 @@ References: * [Writing Exporters](https://prometheus.io/docs/instrumenting/writing_exporters/) * [Available Exporters](https://prometheus.io/docs/instrumenting/exporters/) -Run all tests included in `_test.go` files: - -```bash -go test -v *.go -``` diff --git a/README.md b/README.md index dae873d..16a81f0 100644 --- a/README.md +++ b/README.md @@ -48,6 +48,13 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/ [Information extracted from the SLURM **squeue** command](https://slurm.schedmd.com/squeue.html) +### Jobs information per Account and UserID + +The following information about jobs are also extracted via [squeue](https://slurm.schedmd.com/squeue.html): + +* **Running/Pending/Suspended** jobs per SLURM Account. +* **Running/Pending/Suspended** jobs per SLURM User. + ### Scheduler Information * **Server Thread count**: The number of current active ``slurmctld`` threads. From a9dfd3cabfc0c0ed6bc7f9a2e65b76d7d14e5ec3 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Mon, 12 Oct 2020 09:09:57 +0200 Subject: [PATCH 26/82] add the number of cpus used by running jobs per account --- accounts.go | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/accounts.go b/accounts.go index c71a934..d662355 100644 --- a/accounts.go +++ b/accounts.go @@ -20,12 +20,13 @@ import ( "os/exec" "log" "strings" + "strconv" "regexp" "github.com/prometheus/client_golang/prometheus" ) func AccountsData() []byte { - cmd := exec.Command("squeue", "-h", "-o '%A|%a|%T'") + cmd := exec.Command("squeue", "-h", "-o %A|%a|%T|%C") stdout, err := cmd.StdoutPipe() if err != nil { log.Fatal(err) @@ -43,6 +44,7 @@ func AccountsData() []byte { type JobMetrics struct { pending float64 running float64 + running_cpus float64 suspended float64 } @@ -54,10 +56,11 @@ func ParseAccountsMetrics(input []byte) map[string]*JobMetrics { account := strings.Split(line,"|")[1] _,key := accounts[account] if !key { - accounts[account] = &JobMetrics{0,0,0} + accounts[account] = &JobMetrics{0,0,0,0} } state := strings.Split(line,"|")[2] state = strings.ToLower(state) + cpus,_ := strconv.ParseFloat(strings.Split(line,"|")[3],64) pending := regexp.MustCompile(`^pending`) running := regexp.MustCompile(`^running`) suspended := regexp.MustCompile(`^suspended`) @@ -66,6 +69,7 @@ func ParseAccountsMetrics(input []byte) map[string]*JobMetrics { accounts[account].pending++ case running.MatchString(state) == true: accounts[account].running++ + accounts[account].running_cpus += cpus case suspended.MatchString(state) == true: accounts[account].suspended++ } @@ -77,14 +81,16 @@ func ParseAccountsMetrics(input []byte) map[string]*JobMetrics { type AccountsCollector struct { pending *prometheus.Desc running *prometheus.Desc + running_cpus *prometheus.Desc suspended *prometheus.Desc } func NewAccountsCollector() *AccountsCollector { labels := []string{"account"} return &AccountsCollector{ - running: prometheus.NewDesc("slurm_account_jobs_running", "Running jobs for account", labels, nil), pending: prometheus.NewDesc("slurm_account_jobs_pending", "Pending jobs for account", labels, nil), + running: prometheus.NewDesc("slurm_account_jobs_running", "Running jobs for account", labels, nil), + running_cpus: prometheus.NewDesc("slurm_account_cpus_running", "Running cpus for account", labels, nil), suspended: prometheus.NewDesc("slurm_account_jobs_suspended", "Suspended jobs for account", labels, nil), } } @@ -92,6 +98,7 @@ func NewAccountsCollector() *AccountsCollector { func (ac *AccountsCollector) Describe(ch chan<- *prometheus.Desc) { ch <- ac.pending ch <- ac.running + ch <- ac.running_cpus ch <- ac.suspended } @@ -100,6 +107,7 @@ func (ac *AccountsCollector) Collect(ch chan<- prometheus.Metric) { for a := range am { ch <- prometheus.MustNewConstMetric(ac.pending, prometheus.GaugeValue, am[a].pending, a) ch <- prometheus.MustNewConstMetric(ac.running, prometheus.GaugeValue, am[a].running, a) + ch <- prometheus.MustNewConstMetric(ac.running_cpus, prometheus.GaugeValue, am[a].running_cpus, a) ch <- prometheus.MustNewConstMetric(ac.suspended, prometheus.GaugeValue, am[a].suspended, a) } } From 4e023361ff424e0d9fd9c6a1ddc28c8c64590d94 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Mon, 12 Oct 2020 09:21:40 +0200 Subject: [PATCH 27/82] add the number of cpus used by running jobs per user --- users.go | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/users.go b/users.go index 0ca9a97..b6e4b2c 100644 --- a/users.go +++ b/users.go @@ -20,12 +20,13 @@ import ( "os/exec" "log" "strings" + "strconv" "regexp" "github.com/prometheus/client_golang/prometheus" ) func UsersData() []byte { - cmd := exec.Command("squeue", "-h", "-o '%A|%u|%T|%C'") + cmd := exec.Command("squeue", "-h", "-o %A|%u|%T|%C") stdout, err := cmd.StdoutPipe() if err != nil { log.Fatal(err) @@ -43,6 +44,7 @@ func UsersData() []byte { type UserJobMetrics struct { pending float64 running float64 + running_cpus float64 suspended float64 } @@ -54,10 +56,11 @@ func ParseUsersMetrics(input []byte) map[string]*UserJobMetrics { user := strings.Split(line,"|")[1] _,key := users[user] if !key { - users[user] = &UserJobMetrics{0,0,0} + users[user] = &UserJobMetrics{0,0,0,0} } state := strings.Split(line,"|")[2] state = strings.ToLower(state) + cpus,_ := strconv.ParseFloat(strings.Split(line,"|")[3],64) pending := regexp.MustCompile(`^pending`) running := regexp.MustCompile(`^running`) suspended := regexp.MustCompile(`^suspended`) @@ -66,6 +69,7 @@ func ParseUsersMetrics(input []byte) map[string]*UserJobMetrics { users[user].pending++ case running.MatchString(state) == true: users[user].running++ + users[user].running_cpus += cpus case suspended.MatchString(state) == true: users[user].suspended++ } @@ -77,6 +81,7 @@ func ParseUsersMetrics(input []byte) map[string]*UserJobMetrics { type UsersCollector struct { pending *prometheus.Desc running *prometheus.Desc + running_cpus *prometheus.Desc suspended *prometheus.Desc } @@ -85,6 +90,7 @@ func NewUsersCollector() *UsersCollector { return &UsersCollector { pending: prometheus.NewDesc("slurm_user_jobs_pending", "Pending jobs for user", labels, nil), running: prometheus.NewDesc("slurm_user_jobs_running", "Running jobs for user", labels, nil), + running_cpus: prometheus.NewDesc("slurm_user_cpus_running", "Running cpus for user", labels, nil), suspended: prometheus.NewDesc("slurm_user_jobs_suspended", "Suspended jobs for user", labels, nil), } } @@ -92,6 +98,7 @@ func NewUsersCollector() *UsersCollector { func (uc *UsersCollector) Describe(ch chan<- *prometheus.Desc) { ch <- uc.pending ch <- uc.running + ch <- uc.running_cpus ch <- uc.suspended } @@ -100,6 +107,7 @@ func (uc *UsersCollector) Collect(ch chan<- prometheus.Metric) { for u := range um { ch <- prometheus.MustNewConstMetric(uc.pending, prometheus.GaugeValue, um[u].pending, u) ch <- prometheus.MustNewConstMetric(uc.running, prometheus.GaugeValue, um[u].running, u) + ch <- prometheus.MustNewConstMetric(uc.running_cpus, prometheus.GaugeValue, um[u].running_cpus, u) ch <- prometheus.MustNewConstMetric(uc.suspended, prometheus.GaugeValue, um[u].suspended, u) } } From 635088a10bb26d2aec2d0c563f459c8da6b17da8 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Mon, 12 Oct 2020 10:31:36 +0200 Subject: [PATCH 28/82] remove zero values from the export... --- accounts.go | 16 ++++++++++++---- users.go | 16 ++++++++++++---- 2 files changed, 24 insertions(+), 8 deletions(-) diff --git a/accounts.go b/accounts.go index d662355..6e18abc 100644 --- a/accounts.go +++ b/accounts.go @@ -105,9 +105,17 @@ func (ac *AccountsCollector) Describe(ch chan<- *prometheus.Desc) { func (ac *AccountsCollector) Collect(ch chan<- prometheus.Metric) { am := ParseAccountsMetrics(AccountsData()) for a := range am { - ch <- prometheus.MustNewConstMetric(ac.pending, prometheus.GaugeValue, am[a].pending, a) - ch <- prometheus.MustNewConstMetric(ac.running, prometheus.GaugeValue, am[a].running, a) - ch <- prometheus.MustNewConstMetric(ac.running_cpus, prometheus.GaugeValue, am[a].running_cpus, a) - ch <- prometheus.MustNewConstMetric(ac.suspended, prometheus.GaugeValue, am[a].suspended, a) + if am[a].pending > 0 { + ch <- prometheus.MustNewConstMetric(ac.pending, prometheus.GaugeValue, am[a].pending, a) + } + if am[a].running > 0 { + ch <- prometheus.MustNewConstMetric(ac.running, prometheus.GaugeValue, am[a].running, a) + } + if am[a].running_cpus > 0 { + ch <- prometheus.MustNewConstMetric(ac.running_cpus, prometheus.GaugeValue, am[a].running_cpus, a) + } + if am[a].suspended > 0 { + ch <- prometheus.MustNewConstMetric(ac.suspended, prometheus.GaugeValue, am[a].suspended, a) + } } } diff --git a/users.go b/users.go index b6e4b2c..a7e38b4 100644 --- a/users.go +++ b/users.go @@ -105,10 +105,18 @@ func (uc *UsersCollector) Describe(ch chan<- *prometheus.Desc) { func (uc *UsersCollector) Collect(ch chan<- prometheus.Metric) { um := ParseUsersMetrics(UsersData()) for u := range um { - ch <- prometheus.MustNewConstMetric(uc.pending, prometheus.GaugeValue, um[u].pending, u) - ch <- prometheus.MustNewConstMetric(uc.running, prometheus.GaugeValue, um[u].running, u) - ch <- prometheus.MustNewConstMetric(uc.running_cpus, prometheus.GaugeValue, um[u].running_cpus, u) - ch <- prometheus.MustNewConstMetric(uc.suspended, prometheus.GaugeValue, um[u].suspended, u) + if um[u].pending > 0 { + ch <- prometheus.MustNewConstMetric(uc.pending, prometheus.GaugeValue, um[u].pending, u) + } + if um[u].running > 0 { + ch <- prometheus.MustNewConstMetric(uc.running, prometheus.GaugeValue, um[u].running, u) + } + if um[u].running_cpus > 0 { + ch <- prometheus.MustNewConstMetric(uc.running_cpus, prometheus.GaugeValue, um[u].running_cpus, u) + } + if um[u].suspended > 0 { + ch <- prometheus.MustNewConstMetric(uc.suspended, prometheus.GaugeValue, um[u].suspended, u) + } } } From 0941f6cf597df76f3e3d355359a7c44a3468b1c0 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Mon, 12 Oct 2020 10:35:30 +0200 Subject: [PATCH 29/82] remove test code for accounts module... --- accounts_test.go | 25 ------------------------- 1 file changed, 25 deletions(-) delete mode 100644 accounts_test.go diff --git a/accounts_test.go b/accounts_test.go deleted file mode 100644 index 0413d03..0000000 --- a/accounts_test.go +++ /dev/null @@ -1,25 +0,0 @@ -/* Copyright 2020 Victor Penso - -This program is free software: you can redistribute it and/or modify -it under the terms of the GNU General Public License as published by -the Free Software Foundation, either version 3 of the License, or -(at your option) any later version. - -This program is distributed in the hope that it will be useful, -but WITHOUT ANY WARRANTY; without even the implied warranty of -MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -GNU General Public License for more details. - -You should have received a copy of the GNU General Public License -along with this program. If not, see . */ - -package main - -import ( - "testing" -) - -func TestParseAccountsMetrics(t *testing.T) { - t.Logf("%+v", ParseAccountsMetrics(AccountsData())) -} - From 7548b3a8fc0ad6b8979364f6df3de6f03c217342 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Mon, 12 Oct 2020 11:16:48 +0200 Subject: [PATCH 30/82] new collector for partition metrics added --- Makefile | 2 +- main.go | 13 ++++---- partitions.go | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 99 insertions(+), 7 deletions(-) create mode 100644 partitions.go diff --git a/Makefile b/Makefile index d52bbcd..8ca3a74 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ PROJECT_NAME = prometheus-slurm-exporter ifndef GOPATH GOPATH=$(shell pwd):/usr/share/gocode endif -GOFILES=accounts.go cpus.go main.go nodes.go queue.go scheduler.go users.go +GOFILES=accounts.go cpus.go main.go nodes.go partitions.go queue.go scheduler.go users.go GOBIN=bin/$(PROJECT_NAME) build: diff --git a/main.go b/main.go index d1fd9ec..9b820eb 100644 --- a/main.go +++ b/main.go @@ -25,12 +25,13 @@ import ( func init() { // Metrics have to be registered to be exposed - prometheus.MustRegister(NewSchedulerCollector()) // from scheduler.go - prometheus.MustRegister(NewQueueCollector()) // from queue.go - prometheus.MustRegister(NewNodesCollector()) // from nodes.go - prometheus.MustRegister(NewCPUsCollector()) // from cpus.go - prometheus.MustRegister(NewAccountsCollector()) // from accounts.go - prometheus.MustRegister(NewUsersCollector()) // from users.go + prometheus.MustRegister(NewSchedulerCollector()) // from scheduler.go + prometheus.MustRegister(NewQueueCollector()) // from queue.go + prometheus.MustRegister(NewNodesCollector()) // from nodes.go + prometheus.MustRegister(NewCPUsCollector()) // from cpus.go + prometheus.MustRegister(NewAccountsCollector()) // from accounts.go + prometheus.MustRegister(NewUsersCollector()) // from users.go + prometheus.MustRegister(NewPartitionsCollector()) // from partitions.go } var listenAddress = flag.String( diff --git a/partitions.go b/partitions.go new file mode 100644 index 0000000..3d3767a --- /dev/null +++ b/partitions.go @@ -0,0 +1,91 @@ +/* Copyright 2020 Victor Penso + +This program is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program. If not, see . */ + +package main + +import ( + "io/ioutil" + "os/exec" + "log" + "strings" + "strconv" + "github.com/prometheus/client_golang/prometheus" +) + +func PartitionsData() []byte { + cmd := exec.Command("sinfo", "-h", "-o%R,%C") + stdout, err := cmd.StdoutPipe() + if err != nil { + log.Fatal(err) + } + if err := cmd.Start(); err != nil { + log.Fatal(err) + } + out, _ := ioutil.ReadAll(stdout) + if err := cmd.Wait(); err != nil { + log.Fatal(err) + } + return out +} + +type PartitionMetrics struct { + allocated float64 + idle float64 + other float64 + total float64 +} + +func ParsePartitionsMetrics(input []byte) map[string]*PartitionMetrics { + partitions := make(map[string]*PartitionMetrics) + lines := strings.Split(string(input), "\n") + for _, line := range lines { + if strings.Contains(line,",") { + // name of a partition + partition := strings.Split(line,",")[0] + _,key := partitions[partition] + if !key { + partitions[partition] = &PartitionMetrics{0,0,0,0} + } + states := strings.Split(line,",")[1] + allocated,_ := strconv.ParseFloat(strings.Split(states,"/")[0],64) + partitions[partition].allocated = allocated + } + } + return partitions +} + +type PartitionsCollector struct { + allocated *prometheus.Desc +} + +func NewPartitionsCollector() *PartitionsCollector { + labels := []string{"partition"} + return &PartitionsCollector{ + allocated: prometheus.NewDesc("slurm_partition_cpus_allocated", "Allocated CPUs for partition", labels,nil), + } +} + +func (pc *PartitionsCollector) Describe(ch chan<- *prometheus.Desc) { + ch <- pc.allocated +} + +func (pc *PartitionsCollector) Collect(ch chan<- prometheus.Metric) { + pm := ParsePartitionsMetrics(PartitionsData()) + for p := range pm { + if pm[p].allocated > 0 { + ch <- prometheus.MustNewConstMetric(pc.allocated, prometheus.GaugeValue, pm[p].allocated, p) + } + } +} From 7f340d6345693fcfc76d89a75b674856c842d6b7 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Mon, 12 Oct 2020 11:29:59 +0200 Subject: [PATCH 31/82] add the other metric for partitions --- partitions.go | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/partitions.go b/partitions.go index 3d3767a..9cddd42 100644 --- a/partitions.go +++ b/partitions.go @@ -20,7 +20,7 @@ import ( "os/exec" "log" "strings" - "strconv" + "strconv" "github.com/prometheus/client_golang/prometheus" ) @@ -59,8 +59,14 @@ func ParsePartitionsMetrics(input []byte) map[string]*PartitionMetrics { partitions[partition] = &PartitionMetrics{0,0,0,0} } states := strings.Split(line,",")[1] - allocated,_ := strconv.ParseFloat(strings.Split(states,"/")[0],64) + allocated,_ := strconv.ParseFloat(strings.Split(states,"/")[0],64) + idle,_ := strconv.ParseFloat(strings.Split(states,"/")[1],64) + other,_ := strconv.ParseFloat(strings.Split(states,"/")[2],64) + total,_ := strconv.ParseFloat(strings.Split(states,"/")[3],64) partitions[partition].allocated = allocated + partitions[partition].idle = idle + partitions[partition].other = other + partitions[partition].total = total } } return partitions @@ -68,17 +74,26 @@ func ParsePartitionsMetrics(input []byte) map[string]*PartitionMetrics { type PartitionsCollector struct { allocated *prometheus.Desc + idle *prometheus.Desc + other *prometheus.Desc + total *prometheus.Desc } func NewPartitionsCollector() *PartitionsCollector { labels := []string{"partition"} return &PartitionsCollector{ allocated: prometheus.NewDesc("slurm_partition_cpus_allocated", "Allocated CPUs for partition", labels,nil), + idle: prometheus.NewDesc("slurm_partition_cpus_idle", "Idle CPUs for partition", labels,nil), + other: prometheus.NewDesc("slurm_partition_cpus_other", "Other CPUs for partition", labels,nil), + total: prometheus.NewDesc("slurm_partition_cpus_total", "Total CPUs for partition", labels,nil), } } func (pc *PartitionsCollector) Describe(ch chan<- *prometheus.Desc) { ch <- pc.allocated + ch <- pc.idle + ch <- pc.other + ch <- pc.total } func (pc *PartitionsCollector) Collect(ch chan<- prometheus.Metric) { @@ -87,5 +102,14 @@ func (pc *PartitionsCollector) Collect(ch chan<- prometheus.Metric) { if pm[p].allocated > 0 { ch <- prometheus.MustNewConstMetric(pc.allocated, prometheus.GaugeValue, pm[p].allocated, p) } + if pm[p].idle > 0 { + ch <- prometheus.MustNewConstMetric(pc.idle, prometheus.GaugeValue, pm[p].idle, p) + } + if pm[p].other > 0 { + ch <- prometheus.MustNewConstMetric(pc.other, prometheus.GaugeValue, pm[p].other, p) + } + if pm[p].total > 0 { + ch <- prometheus.MustNewConstMetric(pc.total, prometheus.GaugeValue, pm[p].total, p) + } } } From 99d8fde14da4b35be98846a3500c847d88e8e9b2 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Wed, 14 Oct 2020 14:02:53 +0200 Subject: [PATCH 32/82] add a new metrics for pending jobs by partition --- partitions.go | 42 ++++++++++++++++++++++++++++++++++++++---- 1 file changed, 38 insertions(+), 4 deletions(-) diff --git a/partitions.go b/partitions.go index 9cddd42..16c6b36 100644 --- a/partitions.go +++ b/partitions.go @@ -40,23 +40,40 @@ func PartitionsData() []byte { return out } +func PartitionsPendingJobsData() []byte { + cmd := exec.Command("squeue","-a","-r","-h","-o%P","--states=PENDING") + stdout, err := cmd.StdoutPipe() + if err != nil { + log.Fatal(err) + } + if err := cmd.Start(); err != nil { + log.Fatal(err) + } + out, _ := ioutil.ReadAll(stdout) + if err := cmd.Wait(); err != nil { + log.Fatal(err) + } + return out +} + type PartitionMetrics struct { allocated float64 idle float64 other float64 + pending float64 total float64 } -func ParsePartitionsMetrics(input []byte) map[string]*PartitionMetrics { +func ParsePartitionsMetrics() map[string]*PartitionMetrics { partitions := make(map[string]*PartitionMetrics) - lines := strings.Split(string(input), "\n") + lines := strings.Split(string(PartitionsData()), "\n") for _, line := range lines { if strings.Contains(line,",") { // name of a partition partition := strings.Split(line,",")[0] _,key := partitions[partition] if !key { - partitions[partition] = &PartitionMetrics{0,0,0,0} + partitions[partition] = &PartitionMetrics{0,0,0,0,0} } states := strings.Split(line,",")[1] allocated,_ := strconv.ParseFloat(strings.Split(states,"/")[0],64) @@ -69,6 +86,17 @@ func ParsePartitionsMetrics(input []byte) map[string]*PartitionMetrics { partitions[partition].total = total } } + // get list of pending jobs by partition name + list := strings.Split(string(PartitionsPendingJobsData()),"\n") + for _,partition := range list { + // accumulate the number of pending jobs + _,key := partitions[partition] + if key { + partitions[partition].pending += 1 + } + } + + return partitions } @@ -76,6 +104,7 @@ type PartitionsCollector struct { allocated *prometheus.Desc idle *prometheus.Desc other *prometheus.Desc + pending *prometheus.Desc total *prometheus.Desc } @@ -85,6 +114,7 @@ func NewPartitionsCollector() *PartitionsCollector { allocated: prometheus.NewDesc("slurm_partition_cpus_allocated", "Allocated CPUs for partition", labels,nil), idle: prometheus.NewDesc("slurm_partition_cpus_idle", "Idle CPUs for partition", labels,nil), other: prometheus.NewDesc("slurm_partition_cpus_other", "Other CPUs for partition", labels,nil), + pending: prometheus.NewDesc("slurm_partition_jobs_pending", "Pending jobs for partition", labels,nil), total: prometheus.NewDesc("slurm_partition_cpus_total", "Total CPUs for partition", labels,nil), } } @@ -93,11 +123,12 @@ func (pc *PartitionsCollector) Describe(ch chan<- *prometheus.Desc) { ch <- pc.allocated ch <- pc.idle ch <- pc.other + ch <- pc.pending ch <- pc.total } func (pc *PartitionsCollector) Collect(ch chan<- prometheus.Metric) { - pm := ParsePartitionsMetrics(PartitionsData()) + pm := ParsePartitionsMetrics() for p := range pm { if pm[p].allocated > 0 { ch <- prometheus.MustNewConstMetric(pc.allocated, prometheus.GaugeValue, pm[p].allocated, p) @@ -108,6 +139,9 @@ func (pc *PartitionsCollector) Collect(ch chan<- prometheus.Metric) { if pm[p].other > 0 { ch <- prometheus.MustNewConstMetric(pc.other, prometheus.GaugeValue, pm[p].other, p) } + if pm[p].pending > 0 { + ch <- prometheus.MustNewConstMetric(pc.pending, prometheus.GaugeValue, pm[p].pending, p) + } if pm[p].total > 0 { ch <- prometheus.MustNewConstMetric(pc.total, prometheus.GaugeValue, pm[p].total, p) } From fbefc7292d9f929c335ccbe37223c364bd0f4ad2 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Wed, 14 Oct 2020 14:45:06 +0200 Subject: [PATCH 33/82] add expansion of array jobs when collecting job metrics for accounts/users --- accounts.go | 2 +- users.go | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/accounts.go b/accounts.go index 6e18abc..2bc4660 100644 --- a/accounts.go +++ b/accounts.go @@ -26,7 +26,7 @@ import ( ) func AccountsData() []byte { - cmd := exec.Command("squeue", "-h", "-o %A|%a|%T|%C") + cmd := exec.Command("squeue","-a","-r","-h","-o %A|%a|%T|%C") stdout, err := cmd.StdoutPipe() if err != nil { log.Fatal(err) diff --git a/users.go b/users.go index a7e38b4..2b0e85e 100644 --- a/users.go +++ b/users.go @@ -26,7 +26,7 @@ import ( ) func UsersData() []byte { - cmd := exec.Command("squeue", "-h", "-o %A|%u|%T|%C") + cmd := exec.Command("squeue","-a","-r","-h","-o %A|%u|%T|%C") stdout, err := cmd.StdoutPipe() if err != nil { log.Fatal(err) From 110ddc03a30cf0011ff804ff580fc86a561518b0 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 16:31:13 +0200 Subject: [PATCH 34/82] Prepare extraction of GPU statistics through SlurmMD --- gpus.go | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 102 insertions(+) create mode 100644 gpus.go diff --git a/gpus.go b/gpus.go new file mode 100644 index 0000000..74f456c --- /dev/null +++ b/gpus.go @@ -0,0 +1,102 @@ +/* Copyright 2020 Joeri Hermans, Victor Penso, Matteo Dessalvi + +This program is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program. If not, see . */ + +package main + +import ( + "github.com/prometheus/client_golang/prometheus" + "io/ioutil" + "log" + "os/exec" + "strconv" + "strings" +) + +type GPUsMetrics struct { + alloc float64 + idle float64 + other float64 + total float64 +} + +func GPUsGetMetrics() *GPUsMetrics { + return ParseGPUsMetrics(GPUsData()) +} + +func ParseGPUsMetrics(input []byte) *GPUsMetrics { + var gm GPUsMetrics + if strings.Contains(string(input), "/") { + splitted := strings.Split(strings.TrimSpace(string(input)), "/") + gm.alloc, _ = strconv.ParseFloat(splitted[0], 64) + gm.idle, _ = strconv.ParseFloat(splitted[1], 64) + gm.other, _ = strconv.ParseFloat(splitted[2], 64) + gm.total, _ = strconv.ParseFloat(splitted[3], 64) + } + return &gm +} + +// Execute the sinfo command and return its output +func GPUsData() []byte { + cmd := exec.Command("sinfo", "-h", "-o %C") + stdout, err := cmd.StdoutPipe() + if err != nil { + log.Fatal(err) + } + if err := cmd.Start(); err != nil { + log.Fatal(err) + } + out, _ := ioutil.ReadAll(stdout) + if err := cmd.Wait(); err != nil { + log.Fatal(err) + } + return out +} + +/* + * Implement the Prometheus Collector interface and feed the + * Slurm scheduler metrics into it. + * https://godoc.org/github.com/prometheus/client_golang/prometheus#Collector + */ + +func NewGPUsCollector() *GPUsCollector { + return &GPUsCollector{ + alloc: prometheus.NewDesc("slurm_gpus_alloc", "Allocated GPUs", nil, nil), + idle: prometheus.NewDesc("slurm_gpus_idle", "Idle GPUs", nil, nil), + other: prometheus.NewDesc("slurm_gpus_other", "Mix GPUs", nil, nil), + total: prometheus.NewDesc("slurm_gpus_total", "Total GPUs", nil, nil), + } +} + +type GPUsCollector struct { + alloc *prometheus.Desc + idle *prometheus.Desc + other *prometheus.Desc + total *prometheus.Desc +} + +// Send all metric descriptions +func (cc *GPUsCollector) Describe(ch chan<- *prometheus.Desc) { + ch <- cc.alloc + ch <- cc.idle + ch <- cc.other + ch <- cc.total +} +func (cc *GPUsCollector) Collect(ch chan<- prometheus.Metric) { + cm := GPUsGetMetrics() + ch <- prometheus.MustNewConstMetric(cc.alloc, prometheus.GaugeValue, cm.alloc) + ch <- prometheus.MustNewConstMetric(cc.idle, prometheus.GaugeValue, cm.idle) + ch <- prometheus.MustNewConstMetric(cc.other, prometheus.GaugeValue, cm.other) + ch <- prometheus.MustNewConstMetric(cc.total, prometheus.GaugeValue, cm.total) +} From 30836407e096dcd668739404625e9aafae73274c Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 16:33:40 +0200 Subject: [PATCH 35/82] Update README --- README.md | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 16a81f0..aa3ee6d 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,16 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/ - [Information extracted from the SLURM **sinfo** command](https://slurm.schedmd.com/sinfo.html) - [Slurm CPU Management User and Administrator Guide](https://slurm.schedmd.com/cpu_management.html) +### State of the GPUs + +* **Allocated**: GPUs which have been allocated to a job. +* **Idle**: GPUs not allocated to a job and thus available for use. +* **Other**: GPUs which are unavailable for use at the moment. +* **Total**: total number of GPUs. + +- [Information extracted from the SLURM **sinfo** command](https://slurm.schedmd.com/sinfo.html) +- [Slurm GRES scheduling](https://slurm.schedmd.com/gres.html) + ### State of the Nodes * **Allocated**: nodes which has been allocated to one or more jobs. @@ -57,7 +67,7 @@ The following information about jobs are also extracted via [squeue](https://slu ### Scheduler Information -* **Server Thread count**: The number of current active ``slurmctld`` threads. +* **Server Thread count**: The number of current active ``slurmctld`` threads. * **Queue size**: The length of the scheduler queue. * **DBD Agent queue size**: The length of the message queue for _SlurmDBD_. * **Last cycle**: Time in microseconds for last scheduling cycle. @@ -74,7 +84,7 @@ The following information about jobs are also extracted via [squeue](https://slu *DBD Agent queue size*: it is particularly important to keep track of it, since an increasing number of messages counted with this parameter almost always indicates three issues: -* the _SlurmDBD_ daemon is down; +* the _SlurmDBD_ daemon is down; * the database is either down or unreachable; * the status of the Slurm accounting DB may be inconsistent (e.g. ``sreport`` missing data, weird utilization of the cluster, etc.). @@ -82,7 +92,7 @@ counted with this parameter almost always indicates three issues: ## Installation * Read [DEVELOPMENT.md](DEVELOPMENT.md) in order to build the Prometheus Slurm Exporter. After a successful build copy the executable -`bin/prometheus-slurm-exporter` to a node with access to the Slurm command-line interface. +`bin/prometheus-slurm-exporter` to a node with access to the Slurm command-line interface. * A [Systemd Unit][sdu] file to run the executable as service is available in [lib/systemd/prometheus-slurm-exporter.service](lib/systemd/prometheus-slurm-exporter.service). @@ -99,7 +109,7 @@ scrape_configs: # # SLURM resource manager: -# +# - job_name: 'my_slurm_exporter' scrape_interval: 30s @@ -146,5 +156,3 @@ This is free software: you can redistribute it and/or modify it under the terms This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/. - - From 9a711d7c4da4ef78f5fa2e1a7c67a68c14d862ea Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 17:14:56 +0200 Subject: [PATCH 36/82] Add debug statement --- gpus.go | 39 ++++++++++++++++++++++++++++----------- 1 file changed, 28 insertions(+), 11 deletions(-) diff --git a/gpus.go b/gpus.go index 74f456c..a4695da 100644 --- a/gpus.go +++ b/gpus.go @@ -32,24 +32,41 @@ type GPUsMetrics struct { } func GPUsGetMetrics() *GPUsMetrics { - return ParseGPUsMetrics(GPUsData()) + return ParseGPUsMetrics() } -func ParseGPUsMetrics(input []byte) *GPUsMetrics { +func ParseAllocatedGPUs() float64 { + return 0.0 // TODO Implement +} + +func ParseIdleGPUs() float64 { + return 0.0 // TOOD Implement +} + +func ParseOtherGPUs() float64 { + return 0.0 // TODO Implement +} + +func ParseTotalGPUs() float64 { + args := []string{"sinfo", "-h", "-o \"%n %G\""} + output := Execute(args) + log.Info(output) + + return 10.0 // TODO Implement +} + +func ParseGPUsMetrics() *GPUsMetrics { var gm GPUsMetrics - if strings.Contains(string(input), "/") { - splitted := strings.Split(strings.TrimSpace(string(input)), "/") - gm.alloc, _ = strconv.ParseFloat(splitted[0], 64) - gm.idle, _ = strconv.ParseFloat(splitted[1], 64) - gm.other, _ = strconv.ParseFloat(splitted[2], 64) - gm.total, _ = strconv.ParseFloat(splitted[3], 64) - } + gm.alloc, _ = ParseAllocatedGPUs() + gm.idle, _ = ParseIdleGPUs() + gm.other, _ = ParseOtherGPUs() + gm.total, _ = ParseTotalGPUs() return &gm } // Execute the sinfo command and return its output -func GPUsData() []byte { - cmd := exec.Command("sinfo", "-h", "-o %C") +func Execute(arguments []string) []byte { + cmd := exec.Command(arguments...) stdout, err := cmd.StdoutPipe() if err != nil { log.Fatal(err) From be69ca32748c3ccaa0d3ae47a6697f7e9026596e Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 17:18:38 +0200 Subject: [PATCH 37/82] Register GPUs collector --- main.go | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/main.go b/main.go index 9b820eb..776cc53 100644 --- a/main.go +++ b/main.go @@ -25,13 +25,14 @@ import ( func init() { // Metrics have to be registered to be exposed - prometheus.MustRegister(NewSchedulerCollector()) // from scheduler.go - prometheus.MustRegister(NewQueueCollector()) // from queue.go - prometheus.MustRegister(NewNodesCollector()) // from nodes.go - prometheus.MustRegister(NewCPUsCollector()) // from cpus.go prometheus.MustRegister(NewAccountsCollector()) // from accounts.go - prometheus.MustRegister(NewUsersCollector()) // from users.go + prometheus.MustRegister(NewCPUsCollector()) // from cpus.go + prometheus.MustRegister(NewGPUsCollector()) // from gpus.go + prometheus.MustRegister(NewNodesCollector()) // from nodes.go prometheus.MustRegister(NewPartitionsCollector()) // from partitions.go + prometheus.MustRegister(NewQueueCollector()) // from queue.go + prometheus.MustRegister(NewSchedulerCollector()) // from scheduler.go + prometheus.MustRegister(NewUsersCollector()) // from users.go } var listenAddress = flag.String( From f79222e07facb486797c5572e402033d7c816ba5 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 17:20:03 +0200 Subject: [PATCH 38/82] Update comments --- main.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/main.go b/main.go index 776cc53..4b4182a 100644 --- a/main.go +++ b/main.go @@ -1,4 +1,4 @@ -/* Copyright 2017-2020 Victor Penso, Matteo Dessalvi +/* Copyright 2017-2020 Victor Penso, Matteo Dessalvi, Joeri Hermans This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by From da689b24eb8afa3e28e8ddc62a9fb3c71deab660 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 17:21:23 +0200 Subject: [PATCH 39/82] Update development.md --- DEVELOPMENT.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 98cf503..4eaa2d4 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -30,7 +30,7 @@ go mod download Build the exporter: ```bash -go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,nodes,queue,scheduler,users}.go +go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,gpus,nodes,queue,scheduler,users}.go ``` Run all tests included in `_test.go` files: @@ -56,4 +56,3 @@ References: * [Metric Types](https://prometheus.io/docs/concepts/metric_types/) * [Writing Exporters](https://prometheus.io/docs/instrumenting/writing_exporters/) * [Available Exporters](https://prometheus.io/docs/instrumenting/exporters/) - From 039af668f999bf63affa396e424ac5bf30b3cc2e Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 17:23:33 +0200 Subject: [PATCH 40/82] Add partitions to DEVELOPMENT.md --- DEVELOPMENT.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 4eaa2d4..2cb4672 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -30,7 +30,7 @@ go mod download Build the exporter: ```bash -go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,gpus,nodes,queue,scheduler,users}.go +go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,gpus,partitions,nodes,queue,scheduler,users}.go ``` Run all tests included in `_test.go` files: From ef02bb43356bc68983346bf2f64d40d983628c95 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 17:32:09 +0200 Subject: [PATCH 41/82] Fix program execution --- gpus.go | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/gpus.go b/gpus.go index a4695da..5620899 100644 --- a/gpus.go +++ b/gpus.go @@ -20,8 +20,6 @@ import ( "io/ioutil" "log" "os/exec" - "strconv" - "strings" ) type GPUsMetrics struct { @@ -48,25 +46,25 @@ func ParseOtherGPUs() float64 { } func ParseTotalGPUs() float64 { - args := []string{"sinfo", "-h", "-o \"%n %G\""} - output := Execute(args) - log.Info(output) + args := []string{"-h", "-o \"%n %G\""} + output := Execute("sinfo", args) + log.Fatal(output) return 10.0 // TODO Implement } func ParseGPUsMetrics() *GPUsMetrics { var gm GPUsMetrics - gm.alloc, _ = ParseAllocatedGPUs() - gm.idle, _ = ParseIdleGPUs() - gm.other, _ = ParseOtherGPUs() - gm.total, _ = ParseTotalGPUs() + gm.alloc = ParseAllocatedGPUs() + gm.idle = ParseIdleGPUs() + gm.other = ParseOtherGPUs() + gm.total = ParseTotalGPUs() return &gm } // Execute the sinfo command and return its output -func Execute(arguments []string) []byte { - cmd := exec.Command(arguments...) +func Execute(command string, arguments []string) []byte { + cmd := exec.Command(command, arguments...) stdout, err := cmd.StdoutPipe() if err != nil { log.Fatal(err) From 5950bfe4e1082317987176b14b708ccaddd873d0 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 17:35:25 +0200 Subject: [PATCH 42/82] Update --- gpus.go | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/gpus.go b/gpus.go index 5620899..584fa39 100644 --- a/gpus.go +++ b/gpus.go @@ -47,8 +47,7 @@ func ParseOtherGPUs() float64 { func ParseTotalGPUs() float64 { args := []string{"-h", "-o \"%n %G\""} - output := Execute("sinfo", args) - log.Fatal(output) + output := string(Execute("sinfo", args)) return 10.0 // TODO Implement } From 1508f063d7f87e4907dada4246634e8a0eeb3641 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 17:40:29 +0200 Subject: [PATCH 43/82] Update GPUs exporter --- gpus.go | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/gpus.go b/gpus.go index 584fa39..2994e6c 100644 --- a/gpus.go +++ b/gpus.go @@ -20,6 +20,7 @@ import ( "io/ioutil" "log" "os/exec" + "strings" ) type GPUsMetrics struct { @@ -48,8 +49,14 @@ func ParseOtherGPUs() float64 { func ParseTotalGPUs() float64 { args := []string{"-h", "-o \"%n %G\""} output := string(Execute("sinfo", args)) + if len(output) > 0 { + for _, line := range strings.Split(output, "\n") { + descriptor := strings.Split(line, " ")[0] + log.Fatal(descriptor) + } + } - return 10.0 // TODO Implement + return 0.0 } func ParseGPUsMetrics() *GPUsMetrics { From 97bdbf476958997c2dc7508fd074f4356fe2515e Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 17:43:54 +0200 Subject: [PATCH 44/82] Update GPUs exporter --- gpus.go | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/gpus.go b/gpus.go index 2994e6c..437bb40 100644 --- a/gpus.go +++ b/gpus.go @@ -17,8 +17,8 @@ package main import ( "github.com/prometheus/client_golang/prometheus" + "github.com/prometheus/common/log" "io/ioutil" - "log" "os/exec" "strings" ) @@ -51,8 +51,9 @@ func ParseTotalGPUs() float64 { output := string(Execute("sinfo", args)) if len(output) > 0 { for _, line := range strings.Split(output, "\n") { + log.Infof("Line %s: ", line) descriptor := strings.Split(line, " ")[0] - log.Fatal(descriptor) + log.Infof("Descriptor %s: ", descriptor) } } From 1b452ee0ace9fa2f10068e1035f8c50a891f8f2d Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 17:47:07 +0200 Subject: [PATCH 45/82] Update GPUs exporter --- gpus.go | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/gpus.go b/gpus.go index 437bb40..5a1d547 100644 --- a/gpus.go +++ b/gpus.go @@ -51,9 +51,11 @@ func ParseTotalGPUs() float64 { output := string(Execute("sinfo", args)) if len(output) > 0 { for _, line := range strings.Split(output, "\n") { - log.Infof("Line %s: ", line) - descriptor := strings.Split(line, " ")[0] - log.Infof("Descriptor %s: ", descriptor) + if len(line) > 0 { + log.Infof("Line %s: ", line) + descriptor := strings.Split(line, " ")[0] + log.Infof("Descriptor %s: ", descriptor) + } } } From a5d5a991c5448630146569f6e072b2437b1053e4 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 18:30:34 +0200 Subject: [PATCH 46/82] Update retrieval of total number of GPUs --- gpus.go | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/gpus.go b/gpus.go index 5a1d547..0c9b595 100644 --- a/gpus.go +++ b/gpus.go @@ -21,6 +21,7 @@ import ( "io/ioutil" "os/exec" "strings" + "strconv" ) type GPUsMetrics struct { @@ -47,19 +48,26 @@ func ParseOtherGPUs() float64 { } func ParseTotalGPUs() float64 { + var num_gpus = 0.0 + args := []string{"-h", "-o \"%n %G\""} output := string(Execute("sinfo", args)) if len(output) > 0 { for _, line := range strings.Split(output, "\n") { if len(line) > 0 { - log.Infof("Line %s: ", line) - descriptor := strings.Split(line, " ")[0] - log.Infof("Descriptor %s: ", descriptor) + line = strings.Trim(line, "\"") + descriptor := strings.Fields(line)[1] + descriptor = strings.TrimPrefix(descriptor, "gpu:") + descriptor = strings.Split(descriptor, "(")[0] + node_gpus, err := strconv.ParseFloat(descriptor, 64) + if err != nil { + num_gpus += node_gpus + } } } } - return 0.0 + return num_gpus } func ParseGPUsMetrics() *GPUsMetrics { From 32560157980cda37334980bfe368f73da4f72362 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 18:42:29 +0200 Subject: [PATCH 47/82] Add all GPU exporter features --- gpus.go | 55 +++++++++++++++++++++++++++++++++---------------------- 1 file changed, 33 insertions(+), 22 deletions(-) diff --git a/gpus.go b/gpus.go index 0c9b595..0173c36 100644 --- a/gpus.go +++ b/gpus.go @@ -25,10 +25,10 @@ import ( ) type GPUsMetrics struct { - alloc float64 - idle float64 - other float64 - total float64 + alloc float64 + idle float64 + total float64 + utilization float64 } func GPUsGetMetrics() *GPUsMetrics { @@ -36,15 +36,24 @@ func GPUsGetMetrics() *GPUsMetrics { } func ParseAllocatedGPUs() float64 { - return 0.0 // TODO Implement -} + var num_gpus = 0.0 -func ParseIdleGPUs() float64 { - return 0.0 // TOOD Implement -} + args := []string{"-a", "-X", "--format=Allocgres", "--state=RUNNING", "--noheader", "--parsable2"} + output := string(Execute("sacct", args)) + if len(output) > 0 { + for _, line := range strings.Split(output, "\n") { + if len(line) > 0 { + line = strings.Trim(line, "\"") + descriptor := strings.TrimPrefix(line, "gpu:") + job_gpus, err := strconv.ParseFloat(descriptor, 64) + if err != nil { + num_gpus += job_gpus + } + } + } + } -func ParseOtherGPUs() float64 { - return 0.0 // TODO Implement + return num_gpus } func ParseTotalGPUs() float64 { @@ -72,10 +81,12 @@ func ParseTotalGPUs() float64 { func ParseGPUsMetrics() *GPUsMetrics { var gm GPUsMetrics - gm.alloc = ParseAllocatedGPUs() - gm.idle = ParseIdleGPUs() - gm.other = ParseOtherGPUs() - gm.total = ParseTotalGPUs() + total_gpus := ParseTotalGPUs() + allocated_gpus := ParseAllocatedGPUs() + gm.alloc = allocated_gpus + gm.idle = total_gpus - allocated_gpus + gm.total = total_gpus + gm.utilization = allocated_gpus / total_gpus return &gm } @@ -106,29 +117,29 @@ func NewGPUsCollector() *GPUsCollector { return &GPUsCollector{ alloc: prometheus.NewDesc("slurm_gpus_alloc", "Allocated GPUs", nil, nil), idle: prometheus.NewDesc("slurm_gpus_idle", "Idle GPUs", nil, nil), - other: prometheus.NewDesc("slurm_gpus_other", "Mix GPUs", nil, nil), total: prometheus.NewDesc("slurm_gpus_total", "Total GPUs", nil, nil), + utilization: prometheus.NewDesc("slurm_gpus_utilization", "Total GPU utilization", nil, nil), } } type GPUsCollector struct { - alloc *prometheus.Desc - idle *prometheus.Desc - other *prometheus.Desc - total *prometheus.Desc + alloc *prometheus.Desc + idle *prometheus.Desc + total *prometheus.Desc + utilization *prometheus.Desc } // Send all metric descriptions func (cc *GPUsCollector) Describe(ch chan<- *prometheus.Desc) { ch <- cc.alloc ch <- cc.idle - ch <- cc.other ch <- cc.total + ch <- cc.utilization } func (cc *GPUsCollector) Collect(ch chan<- prometheus.Metric) { cm := GPUsGetMetrics() ch <- prometheus.MustNewConstMetric(cc.alloc, prometheus.GaugeValue, cm.alloc) ch <- prometheus.MustNewConstMetric(cc.idle, prometheus.GaugeValue, cm.idle) - ch <- prometheus.MustNewConstMetric(cc.other, prometheus.GaugeValue, cm.other) ch <- prometheus.MustNewConstMetric(cc.total, prometheus.GaugeValue, cm.total) + ch <- prometheus.MustNewConstMetric(cc.utilization, prometheus.GaugeValue, cm.utilization) } From 72364e30f42080fd3ee1d80f3f32092c4b09595a Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 18:44:26 +0200 Subject: [PATCH 48/82] Update README --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index aa3ee6d..5d4c023 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/ * **Other**: CPUs which are unavailable for use at the moment. * **Total**: total number of CPUs. -- [Information extracted from the SLURM **sinfo** command](https://slurm.schedmd.com/sinfo.html) +- Information extracted from the SLURM [**sinfo**](https://slurm.schedmd.com/sinfo.html) command. - [Slurm CPU Management User and Administrator Guide](https://slurm.schedmd.com/cpu_management.html) ### State of the GPUs @@ -21,7 +21,7 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/ * **Other**: GPUs which are unavailable for use at the moment. * **Total**: total number of GPUs. -- [Information extracted from the SLURM **sinfo** command](https://slurm.schedmd.com/sinfo.html) +- Information extracted from the SLURM [**sinfo**](https://slurm.schedmd.com/sinfo.html) and [**sacct**](https://slurm.schedmd.com/sacct.html) command. - [Slurm GRES scheduling](https://slurm.schedmd.com/gres.html) ### State of the Nodes From afdbaf1db273ad4f8138829959836d419a8160d8 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 18:45:09 +0200 Subject: [PATCH 49/82] Update README --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 5d4c023..cb459ec 100644 --- a/README.md +++ b/README.md @@ -17,9 +17,9 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/ ### State of the GPUs * **Allocated**: GPUs which have been allocated to a job. -* **Idle**: GPUs not allocated to a job and thus available for use. * **Other**: GPUs which are unavailable for use at the moment. * **Total**: total number of GPUs. +* **Utilization**: total GPU utiliazation on the cluster. - Information extracted from the SLURM [**sinfo**](https://slurm.schedmd.com/sinfo.html) and [**sacct**](https://slurm.schedmd.com/sacct.html) command. - [Slurm GRES scheduling](https://slurm.schedmd.com/gres.html) From f0351a3d9fa05cca48b03c45575264e748b0c43a Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 18:46:25 +0200 Subject: [PATCH 50/82] Update README --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index cb459ec..39b3262 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/ * **Mixed**: nodes which have some of their CPUs ALLOCATED while others are IDLE. * **Resv**: these nodes are in an advanced reservation and not generally available. -[Information extracted from the SLURM **sinfo** command](https://slurm.schedmd.com/sinfo.html) +- Information extracted from the SLURM [**sinfo**](https://slurm.schedmd.com/sinfo.html) command. ### Status of the Jobs @@ -56,7 +56,7 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/ * **PREEMPTED**: Jobs terminated due to preemption. * **NODE_FAIL**: Jobs terminated due to failure of one or more allocated nodes. -[Information extracted from the SLURM **squeue** command](https://slurm.schedmd.com/squeue.html) +- Information extracted from the SLURM [**squeue**](https://slurm.schedmd.com/squeue.html) command. ### Jobs information per Account and UserID @@ -80,7 +80,7 @@ The following information about jobs are also extracted via [squeue](https://slu * **(Backfill) Total Backfilled Jobs** (since last stats cycle start): number of jobs started thanks to backfilling since last time stats where reset. * **(Backfill) Total backfilled heterogeneous Job components**: number of heterogeneous job components started thanks to backfilling since last Slurm start. -[Information extracted from the SLURM **sdiag** command](https://slurm.schedmd.com/sdiag.html) +- Information extracted from the SLURM [**sdiag**](https://slurm.schedmd.com/sdiag.html) command. *DBD Agent queue size*: it is particularly important to keep track of it, since an increasing number of messages counted with this parameter almost always indicates three issues: From de241a60f2ef34ef06e40dc80c51a5cb79e10f0a Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 18:47:10 +0200 Subject: [PATCH 51/82] Update README --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 39b3262..e21ce67 100644 --- a/README.md +++ b/README.md @@ -149,7 +149,7 @@ visualize the exported metrics through [Grafana](https://grafana.com): ## License -Copyright 2017-2020 Victor Penso, Matteo Dessalvi +Copyright 2017-2020 Victor Penso, Matteo Dessalvi, Joeri Hermans This is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. From e8e8444a6e9555bd5a0f767173fb926e590b6b96 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 18:48:40 +0200 Subject: [PATCH 52/82] Add debugging --- gpus.go | 1 + 1 file changed, 1 insertion(+) diff --git a/gpus.go b/gpus.go index 0173c36..da8098f 100644 --- a/gpus.go +++ b/gpus.go @@ -69,6 +69,7 @@ func ParseTotalGPUs() float64 { descriptor = strings.TrimPrefix(descriptor, "gpu:") descriptor = strings.Split(descriptor, "(")[0] node_gpus, err := strconv.ParseFloat(descriptor, 64) + log.Infof("Number of GPUs %f", node_gpus) if err != nil { num_gpus += node_gpus } From 065e42fb2fe162f57ff9e3f9e877689704b46fe8 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 18:49:50 +0200 Subject: [PATCH 53/82] Add debugging --- gpus.go | 1 + 1 file changed, 1 insertion(+) diff --git a/gpus.go b/gpus.go index da8098f..6a848d1 100644 --- a/gpus.go +++ b/gpus.go @@ -71,6 +71,7 @@ func ParseTotalGPUs() float64 { node_gpus, err := strconv.ParseFloat(descriptor, 64) log.Infof("Number of GPUs %f", node_gpus) if err != nil { + log.Infof("Adding GPUs %f", node_gpus) num_gpus += node_gpus } } From d61533965bf204f716a13f1051f5407a0ef47e99 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 18:52:47 +0200 Subject: [PATCH 54/82] Update README --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e21ce67..39b3262 100644 --- a/README.md +++ b/README.md @@ -149,7 +149,7 @@ visualize the exported metrics through [Grafana](https://grafana.com): ## License -Copyright 2017-2020 Victor Penso, Matteo Dessalvi, Joeri Hermans +Copyright 2017-2020 Victor Penso, Matteo Dessalvi This is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. From f267b8421af159b87e9bcd1420446fc593ef3427 Mon Sep 17 00:00:00 2001 From: Joeri Hermans Date: Thu, 15 Oct 2020 18:53:58 +0200 Subject: [PATCH 55/82] Update README --- gpus.go | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/gpus.go b/gpus.go index 6a848d1..ca3bcaf 100644 --- a/gpus.go +++ b/gpus.go @@ -45,10 +45,8 @@ func ParseAllocatedGPUs() float64 { if len(line) > 0 { line = strings.Trim(line, "\"") descriptor := strings.TrimPrefix(line, "gpu:") - job_gpus, err := strconv.ParseFloat(descriptor, 64) - if err != nil { - num_gpus += job_gpus - } + job_gpus, _ := strconv.ParseFloat(descriptor, 64) + num_gpus += job_gpus } } } @@ -68,12 +66,8 @@ func ParseTotalGPUs() float64 { descriptor := strings.Fields(line)[1] descriptor = strings.TrimPrefix(descriptor, "gpu:") descriptor = strings.Split(descriptor, "(")[0] - node_gpus, err := strconv.ParseFloat(descriptor, 64) - log.Infof("Number of GPUs %f", node_gpus) - if err != nil { - log.Infof("Adding GPUs %f", node_gpus) - num_gpus += node_gpus - } + node_gpus, _ := strconv.ParseFloat(descriptor, 64) + num_gpus += node_gpus } } } From e9f7aefb9f1ef59a04ea1daee780bf6c31bbcc8d Mon Sep 17 00:00:00 2001 From: Gaetan Dumortier Date: Thu, 15 Oct 2020 18:57:30 +0200 Subject: [PATCH 56/82] Add info for custom port --- DEVELOPMENT.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 2cb4672..34eb78a 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -44,6 +44,13 @@ Start the exporter (foreground), and query all metrics: ```bash bin/prometheus-slurm-exporter ... + +If you wish to run the exporter on a different port, or the default port (8080) is already in use, run with the following argument: + +```bash +bin/prometheus-slurm-exporter --listen-address="0.0.0.0:" +... + # query all metrics (default port) curl http://localhost:8080/metrics ``` From 2d1136a45c055797d9db69230c034bb830b666f2 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Tue, 20 Oct 2020 07:07:06 +0200 Subject: [PATCH 57/82] add details about partition metrics --- README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 16a81f0..db447c0 100644 --- a/README.md +++ b/README.md @@ -48,7 +48,12 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/ [Information extracted from the SLURM **squeue** command](https://slurm.schedmd.com/squeue.html) -### Jobs information per Account and UserID +### State of the Partitions + +* Running/suspended Jobs per partitions, divided between Slurm accounts and users. +* CPUs total/allocated/idle per partition plus used CPU per user ID. + +### Jobs information per Account and User The following information about jobs are also extracted via [squeue](https://slurm.schedmd.com/squeue.html): From 00a7dee2196a5d9e1734cff02c1971489b1c8396 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Fri, 23 Oct 2020 20:22:10 +0200 Subject: [PATCH 58/82] DEVELOPMENT: fix typo in build command --- DEVELOPMENT.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 98cf503..300d347 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -30,7 +30,7 @@ go mod download Build the exporter: ```bash -go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,nodes,queue,scheduler,users}.go +go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,nodes,partitions,queue,scheduler,users}.go ``` Run all tests included in `_test.go` files: From ca161caa10edfa05b792d51bc2058707c39c7a12 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Wed, 3 Feb 2021 14:00:59 +0100 Subject: [PATCH 59/82] Makefile: include new gpus module --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 8ca3a74..a6765d8 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ PROJECT_NAME = prometheus-slurm-exporter ifndef GOPATH GOPATH=$(shell pwd):/usr/share/gocode endif -GOFILES=accounts.go cpus.go main.go nodes.go partitions.go queue.go scheduler.go users.go +GOFILES=accounts.go cpus.go gpus.go main.go nodes.go partitions.go queue.go scheduler.go users.go GOBIN=bin/$(PROJECT_NAME) build: From ff97e6714abdf571237567dfa5f8880582dac875 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Thu, 4 Feb 2021 10:36:20 +0100 Subject: [PATCH 60/82] fix closing code segement --- DEVELOPMENT.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 34eb78a..3071f8a 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -43,7 +43,7 @@ Start the exporter (foreground), and query all metrics: ```bash bin/prometheus-slurm-exporter -... +``` If you wish to run the exporter on a different port, or the default port (8080) is already in use, run with the following argument: From 4440740becd92654e6694234e7191f387f733a5d Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Thu, 4 Feb 2021 10:54:38 +0100 Subject: [PATCH 61/82] add new collector for fairshare data --- DEVELOPMENT.md | 2 +- Makefile | 2 +- sshare.go | 38 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 40 insertions(+), 2 deletions(-) create mode 100644 sshare.go diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 3071f8a..4b7f4ab 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -30,7 +30,7 @@ go mod download Build the exporter: ```bash -go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,gpus,partitions,nodes,queue,scheduler,users}.go +go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,gpus,partitions,nodes,queue,scheduler,sshare,users}.go ``` Run all tests included in `_test.go` files: diff --git a/Makefile b/Makefile index a6765d8..58974ee 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ PROJECT_NAME = prometheus-slurm-exporter ifndef GOPATH GOPATH=$(shell pwd):/usr/share/gocode endif -GOFILES=accounts.go cpus.go gpus.go main.go nodes.go partitions.go queue.go scheduler.go users.go +GOFILES=accounts.go cpus.go gpus.go main.go nodes.go partitions.go queue.go scheduler.go sshare.go users.go GOBIN=bin/$(PROJECT_NAME) build: diff --git a/sshare.go b/sshare.go new file mode 100644 index 0000000..b9528a2 --- /dev/null +++ b/sshare.go @@ -0,0 +1,38 @@ +/* Copyright 2021 Victor Penso + +This program is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program. If not, see . */ + +package main + +import ( + "io/ioutil" + "os/exec" + "log" +) + +func FairShareData() []byte { + cmd := exec.Command("sshare", "-o account,fairshare") + stdout, err := cmd.StdoutPipe() + if err != nil { + log.Fatal(err) + } + if err := cmd.Start(); err != nil { + log.Fatal(err) + } + out, _ := ioutil.ReadAll(stdout) + if err := cmd.Wait(); err != nil { + log.Fatal(err) + } + return out +} From c7e9b138e57c2f855fc364910f83c3ece6a63158 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Thu, 4 Feb 2021 11:21:55 +0100 Subject: [PATCH 62/82] first fairshare collector prototype --- sshare.go | 48 +++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/sshare.go b/sshare.go index b9528a2..d4ad660 100644 --- a/sshare.go +++ b/sshare.go @@ -19,10 +19,13 @@ import ( "io/ioutil" "os/exec" "log" + "strings" + "strconv" + "github.com/prometheus/client_golang/prometheus" ) func FairShareData() []byte { - cmd := exec.Command("sshare", "-o account,fairshare") + cmd := exec.Command("sshare", "-nPo account,fairshare", "|", "grep '^ [a-z]'", "|", "tr -d ' '" ) stdout, err := cmd.StdoutPipe() if err != nil { log.Fatal(err) @@ -36,3 +39,46 @@ func FairShareData() []byte { } return out } + +type FairShareMetrics struct { + fairshare float64 +} + +func ParseFairShareMetrics() map[string]*FairShareMetrics { + accounts := make(map[string]*FairShareMetrics) + lines := strings.Split(string(FairShareData()), "\n") + for _, line := range lines { + if strings.Contains(line,"|") { + account := strings.Split(line,"|")[0] + _,key := accounts[account] + if !key { + accounts[account] = &FairShareMetrics{0} + } + fairshare,_ := strconv.ParseFloat(strings.Split(line,"|")[1],64) + accounts[account].fairshare = fairshare + } + } + return accounts +} + +type FairShareCollector struct { + fairshare *prometheus.Desc +} + +func NewFairShareCollector() *FairShareCollector { + labels := []string{"account"} + return &FairShareCollector{ + fairshare: prometheus.NewDesc("slurm_account_fairshare","FairShare for account" , labels,nil), + } +} + +func (fsc *FairShareCollector) Describe(ch chan<- *prometheus.Desc) { + ch <- fsc.fairshare +} + +func (fsc *FairShareCollector) Collect(ch chan<- prometheus.Metric) { + fsm := ParseFairShareMetrics() + for f := range fsm { + ch <- prometheus.MustNewConstMetric(fsc.fairshare, prometheus.GaugeValue, fsm[f].fairshare, f) + } +} From 88b636dcde7d7a486eb8654134c0a686797cb9eb Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Thu, 4 Feb 2021 12:01:35 +0100 Subject: [PATCH 63/82] working on pipelines... --- main.go | 1 + sshare.go | 4 +++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/main.go b/main.go index 4b4182a..617c01e 100644 --- a/main.go +++ b/main.go @@ -32,6 +32,7 @@ func init() { prometheus.MustRegister(NewPartitionsCollector()) // from partitions.go prometheus.MustRegister(NewQueueCollector()) // from queue.go prometheus.MustRegister(NewSchedulerCollector()) // from scheduler.go + prometheus.MustRegister(NewFairShareCollector()) // from sshare.go prometheus.MustRegister(NewUsersCollector()) // from users.go } diff --git a/sshare.go b/sshare.go index d4ad660..3ef34a9 100644 --- a/sshare.go +++ b/sshare.go @@ -22,10 +22,11 @@ import ( "strings" "strconv" "github.com/prometheus/client_golang/prometheus" + "fmt" ) func FairShareData() []byte { - cmd := exec.Command("sshare", "-nPo account,fairshare", "|", "grep '^ [a-z]'", "|", "tr -d ' '" ) + cmd := exec.Command( "sshare", "-n", "-P", "-o", "account,fairshare", "|", "grep '^ [a-z]'" ) stdout, err := cmd.StdoutPipe() if err != nil { log.Fatal(err) @@ -48,6 +49,7 @@ func ParseFairShareMetrics() map[string]*FairShareMetrics { accounts := make(map[string]*FairShareMetrics) lines := strings.Split(string(FairShareData()), "\n") for _, line := range lines { + fmt.Printf(line) if strings.Contains(line,"|") { account := strings.Split(line,"|")[0] _,key := accounts[account] From 51ab47866b809c62548594abad4c9fb6b1012016 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Thu, 4 Feb 2021 12:31:32 +0100 Subject: [PATCH 64/82] omit subaccounts --- sshare.go | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/sshare.go b/sshare.go index 3ef34a9..a6d4b79 100644 --- a/sshare.go +++ b/sshare.go @@ -22,11 +22,10 @@ import ( "strings" "strconv" "github.com/prometheus/client_golang/prometheus" - "fmt" ) func FairShareData() []byte { - cmd := exec.Command( "sshare", "-n", "-P", "-o", "account,fairshare", "|", "grep '^ [a-z]'" ) + cmd := exec.Command( "sshare", "-n", "-P", "-o", "account,fairshare" ) stdout, err := cmd.StdoutPipe() if err != nil { log.Fatal(err) @@ -49,15 +48,16 @@ func ParseFairShareMetrics() map[string]*FairShareMetrics { accounts := make(map[string]*FairShareMetrics) lines := strings.Split(string(FairShareData()), "\n") for _, line := range lines { - fmt.Printf(line) - if strings.Contains(line,"|") { - account := strings.Split(line,"|")[0] - _,key := accounts[account] - if !key { - accounts[account] = &FairShareMetrics{0} + if ! strings.HasPrefix(line," ") { + if strings.Contains(line,"|") { + account := strings.Split(line,"|")[0] + _,key := accounts[account] + if !key { + accounts[account] = &FairShareMetrics{0} + } + fairshare,_ := strconv.ParseFloat(strings.Split(line,"|")[1],64) + accounts[account].fairshare = fairshare } - fairshare,_ := strconv.ParseFloat(strings.Split(line,"|")[1],64) - accounts[account].fairshare = fairshare } } return accounts From 23ff781061ed26f40d700f0795b94a54161b4814 Mon Sep 17 00:00:00 2001 From: Victor Penso Date: Thu, 4 Feb 2021 12:33:20 +0100 Subject: [PATCH 65/82] trim account names --- sshare.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sshare.go b/sshare.go index a6d4b79..ecbaa69 100644 --- a/sshare.go +++ b/sshare.go @@ -50,7 +50,7 @@ func ParseFairShareMetrics() map[string]*FairShareMetrics { for _, line := range lines { if ! strings.HasPrefix(line," ") { if strings.Contains(line,"|") { - account := strings.Split(line,"|")[0] + account := strings.Trim(strings.Split(line,"|")[0]," ") _,key := accounts[account] if !key { accounts[account] = &FairShareMetrics{0} From 45f58f7df016602478cdc3d90178d3958b4f8b19 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Sat, 6 Feb 2021 18:15:42 +0100 Subject: [PATCH 66/82] README: mention information collected via sshare --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index 97fdf58..2bd7203 100644 --- a/README.md +++ b/README.md @@ -93,6 +93,9 @@ counted with this parameter almost always indicates three issues: * the database is either down or unreachable; * the status of the Slurm accounting DB may be inconsistent (e.g. ``sreport`` missing data, weird utilization of the cluster, etc.). +### Share Information + +Collect _share_ statistics for every Slurm account. Refer to the [manpage of the sshare command](https://slurm.schedmd.com/sshare.html) to get more information. ## Installation From b94df4565bd9bfd83289d39a6c94dbe8bb183e11 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Thu, 18 Mar 2021 17:01:05 +0100 Subject: [PATCH 67/82] Enable GPUs accounting only via cmd line option (see #45) --- main.go | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/main.go b/main.go index 617c01e..e4c628d 100644 --- a/main.go +++ b/main.go @@ -27,7 +27,6 @@ func init() { // Metrics have to be registered to be exposed prometheus.MustRegister(NewAccountsCollector()) // from accounts.go prometheus.MustRegister(NewCPUsCollector()) // from cpus.go - prometheus.MustRegister(NewGPUsCollector()) // from gpus.go prometheus.MustRegister(NewNodesCollector()) // from nodes.go prometheus.MustRegister(NewPartitionsCollector()) // from partitions.go prometheus.MustRegister(NewQueueCollector()) // from queue.go @@ -41,8 +40,19 @@ var listenAddress = flag.String( ":8080", "The address to listen on for HTTP requests.") +var gpuAcct = flag.Bool( + "gpus-acct", + false, + "Enable GPUs accounting") + func main() { flag.Parse() + + // Turn on GPUs accounting only if the corresponding command line option is set to true. + if *gpuAcct { + prometheus.MustRegister(NewGPUsCollector()) // from gpus.go + } + // The Handler function provides a default handler to expose metrics // via an HTTP server. "/metrics" is the usual endpoint for that. log.Infof("Starting Server: %s", *listenAddress) From 3c868b1461ceeba873651f6c965b46e851d27c5f Mon Sep 17 00:00:00 2001 From: Chris Read Date: Fri, 19 Mar 2021 10:54:41 -0500 Subject: [PATCH 68/82] Some build improvements - All targets in the `Makefile` now work - Upated the documentation on getting started --- .gitignore | 1 + DEVELOPMENT.md | 33 ++++++++++++--------------------- Makefile | 36 ++++++++++++++++++++++-------------- 3 files changed, 35 insertions(+), 35 deletions(-) diff --git a/.gitignore b/.gitignore index 9a5346d..0fe0276 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,3 @@ bin/ +go/ *.snap diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 4b7f4ab..e6b3b93 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -1,8 +1,10 @@ +# Development + Setup the development environment on a node with access to the Slurm user command-line interface, in particular with the `sinfo`, `squeue`, and `sdiag` commands. -### Install Go from source +## Install Go from source ```bash export VERSION=1.15 OS=linux ARCH=amd64 @@ -13,51 +15,40 @@ export PATH=$PWD/go/bin:$PATH _Alternatively install Go using the packaging system of your Linux distribution._ -Use Git to clone the source code of the exporter, and download all Go dependency -modules: +## Clone this repository and build + +Use Git to clone the source code of the exporter, run all the tests and build the binary: ```bash # clone the source code git clone https://github.com/vpenso/prometheus-slurm-exporter.git cd prometheus-slurm-exporter -# download dependencies -export GOPATH=$PWD/go/modules -go mod download -``` - -### Build - -Build the exporter: - -```bash -go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,gpus,partitions,nodes,queue,scheduler,sshare,users}.go +make ``` -Run all tests included in `_test.go` files: +To just run the tests: ```bash -go test -v *.go +make test ``` Start the exporter (foreground), and query all metrics: ```bash -bin/prometheus-slurm-exporter +./bin/prometheus-slurm-exporter ``` If you wish to run the exporter on a different port, or the default port (8080) is already in use, run with the following argument: ```bash -bin/prometheus-slurm-exporter --listen-address="0.0.0.0:" +./bin/prometheus-slurm-exporter --listen-address="0.0.0.0:" ... # query all metrics (default port) curl http://localhost:8080/metrics ``` -### Development - -References: +## References * [GOlang Package Documentation](https://godoc.org/github.com/prometheus/client_golang/prometheus) * [Metric Types](https://prometheus.io/docs/concepts/metric_types/) diff --git a/Makefile b/Makefile index 58974ee..d7d24bc 100644 --- a/Makefile +++ b/Makefile @@ -1,20 +1,28 @@ PROJECT_NAME = prometheus-slurm-exporter -ifndef GOPATH - GOPATH=$(shell pwd):/usr/share/gocode -endif -GOFILES=accounts.go cpus.go gpus.go main.go nodes.go partitions.go queue.go scheduler.go sshare.go users.go -GOBIN=bin/$(PROJECT_NAME) +SHELL := $(shell which bash) -eu -o pipefail -build: - mkdir -p $(shell pwd)/bin - @echo "Build $(GOFILES) to $(GOBIN)" - @GOPATH=$(GOPATH) go build -o $(GOBIN) $(GOFILES) +GOPATH := $(shell pwd)/go/modules +GOBIN := bin/$(PROJECT_NAME) +GOFILES := $(shell ls *.go) -test: - @GOPATH=$(GOPATH) go test -v *.go +.PHONY: build +build: test $(GOBIN) -run: - @GOPATH=$(GOPATH) go run $(GOFILES) +$(GOBIN): go/modules/pkg/mod $(GOFILES) + mkdir -p bin + @echo "Building $(GOBIN)" + go build -v -o $(GOBIN) + +go/modules/pkg/mod: go.mod + go mod download + +.PHONY: test +test: go/modules/pkg/mod $(GOFILES) + go test -v + +run: $(GOBIN) + $(GOBIN) clean: - if [ -f ${GOBIN} ] ; then rm -f ${GOBIN} ; fi + go clean -modcache + rm -fr bin/ go/ From 6a34d8fe53ee5071c5940a6fc6d5806cd72773b5 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Fri, 16 Apr 2021 11:38:46 +0200 Subject: [PATCH 69/82] Add modified code from Chris Read (check PR#47) --- node.go | 137 ++++++++++++++++++++++++++++++++++++++++ node_test.go | 57 +++++++++++++++++ test_data/sinfo_mem.txt | 21 ++++++ 3 files changed, 215 insertions(+) create mode 100644 node.go create mode 100644 node_test.go create mode 100644 test_data/sinfo_mem.txt diff --git a/node.go b/node.go new file mode 100644 index 0000000..bf2f759 --- /dev/null +++ b/node.go @@ -0,0 +1,137 @@ +/* Copyright 2021 Chris Read + +This program is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program. If not, see . */ + +package main + +import ( + "log" + "os/exec" + "sort" + "strconv" + "strings" + + "github.com/prometheus/client_golang/prometheus" +) + +// NodeMetrics stores metrics for each node +type NodeMetrics struct { + memAlloc uint64 + memTotal uint64 + cpuAlloc uint64 + cpuIdle uint64 + cpuOther uint64 + cpuTotal uint64 + nodeStatus string +} + +func NodeGetMetrics() map[string]*NodeMetrics { + return ParseNodeMetrics(NodeData()) +} + +// ParseNodeMetrics takes the output of sinfo with node data +// It returns a map of metrics per node +func ParseNodeMetrics(input []byte) map[string]*NodeMetrics { + nodes := make(map[string]*NodeMetrics) + lines := strings.Split(string(input), "\n") + + // Sort and remove all the duplicates from the 'sinfo' output + sort.Strings(lines) + linesUniq := RemoveDuplicates(lines) + + for _, line := range linesUniq { + node := strings.Fields(line) + nodeName := node[0] + nodeStatus := node[4] // mixed, allocated, etc. + + nodes[nodeName] = &NodeMetrics{0, 0, 0, 0, 0, 0, ""} + + memAlloc, _ := strconv.ParseUint(node[1], 10, 64) + memTotal, _ := strconv.ParseUint(node[2], 10, 64) + + + cpuInfo := strings.Split(node[3], "/") + cpuAlloc, _ := strconv.ParseUint(cpuInfo[0], 10, 64) + cpuIdle, _ := strconv.ParseUint(cpuInfo[1], 10, 64) + cpuOther, _ := strconv.ParseUint(cpuInfo[2], 10, 64) + cpuTotal, _ := strconv.ParseUint(cpuInfo[3], 10, 64) + + nodes[nodeName].memAlloc = memAlloc + nodes[nodeName].memTotal = memTotal + nodes[nodeName].cpuAlloc = cpuAlloc + nodes[nodeName].cpuIdle = cpuIdle + nodes[nodeName].cpuOther = cpuOther + nodes[nodeName].cpuTotal = cpuTotal + nodes[nodeName].nodeStatus = nodeStatus + } + + return nodes +} + +// NodeData executes the sinfo command to get data for each node +// It returns the output of the sinfo command +func NodeData() []byte { + cmd := exec.Command("sinfo", "-h", "-N", "-O", "NodeList,AllocMem,Memory,CPUsState,StateLong") + out, err := cmd.Output() + if err != nil { + log.Fatal(err) + } + return out +} + +type NodeCollector struct { + cpuAlloc *prometheus.Desc + cpuIdle *prometheus.Desc + cpuOther *prometheus.Desc + cpuTotal *prometheus.Desc + memAlloc *prometheus.Desc + memTotal *prometheus.Desc +} + +// NewNodeCollector creates a Prometheus collector to keep all our stats in +// It returns a set of collections for consumption +func NewNodeCollector() *NodeCollector { + labels := []string{"node","status"} + + return &NodeCollector{ + cpuAlloc: prometheus.NewDesc("slurm_node_cpu_alloc", "Allocated CPUs per node", labels, nil), + cpuIdle: prometheus.NewDesc("slurm_node_cpu_idle", "Idle CPUs per node", labels, nil), + cpuOther: prometheus.NewDesc("slurm_node_cpu_other", "Other CPUs per node", labels, nil), + cpuTotal: prometheus.NewDesc("slurm_node_cpu_total", "Total CPUs per node", labels, nil), + memAlloc: prometheus.NewDesc("slurm_node_mem_alloc", "Allocated memory per node", labels, nil), + memTotal: prometheus.NewDesc("slurm_node_mem_total", "Total memory per node", labels, nil), + } +} + +// Send all metric descriptions +func (nc *NodeCollector) Describe(ch chan<- *prometheus.Desc) { + ch <- nc.cpuAlloc + ch <- nc.cpuIdle + ch <- nc.cpuOther + ch <- nc.cpuTotal + ch <- nc.memAlloc + ch <- nc.memTotal +} + +func (nc *NodeCollector) Collect(ch chan<- prometheus.Metric) { + nodes := NodeGetMetrics() + for node := range nodes { + ch <- prometheus.MustNewConstMetric(nc.cpuAlloc, prometheus.GaugeValue, float64(nodes[node].cpuAlloc), node, nodes[node].nodeStatus) + ch <- prometheus.MustNewConstMetric(nc.cpuIdle, prometheus.GaugeValue, float64(nodes[node].cpuIdle), node, nodes[node].nodeStatus) + ch <- prometheus.MustNewConstMetric(nc.cpuOther, prometheus.GaugeValue, float64(nodes[node].cpuOther), node, nodes[node].nodeStatus) + ch <- prometheus.MustNewConstMetric(nc.cpuTotal, prometheus.GaugeValue, float64(nodes[node].cpuTotal), node, nodes[node].nodeStatus) + ch <- prometheus.MustNewConstMetric(nc.memAlloc, prometheus.GaugeValue, float64(nodes[node].memAlloc), node, nodes[node].nodeStatus) + ch <- prometheus.MustNewConstMetric(nc.memTotal, prometheus.GaugeValue, float64(nodes[node].memTotal), node, nodes[node].nodeStatus) + } +} diff --git a/node_test.go b/node_test.go new file mode 100644 index 0000000..b554ddc --- /dev/null +++ b/node_test.go @@ -0,0 +1,57 @@ +/* Copyright 2021 Chris Read + +This program is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program. If not, see . */ + +package main + +import ( + "io/ioutil" + "testing" + + "github.com/stretchr/testify/assert" +) + +/* +For this example data line: + +a048,79384,193000,3/13/0/16,mix + +We want output that looks like: + +slurm_node_cpus_allocated{name="a048",status="mix"} 3 +slurm_node_cpus_idle{name="a048",status="mix"} 3 +slurm_node_cpus_other{name="a048",status="mix"} 0 +slurm_node_cpus_total{name="a048",status="mix"} 16 +slurm_node_mem_allocated{name="a048",status="mix"} 179384 +slurm_node_mem_total{name="a048",status="mix"} 193000 + +*/ + +func TestNodeMetrics(t *testing.T) { + // Read the input data from a file + data, err := ioutil.ReadFile("test_data/sinfo_mem.txt") + if err != nil { + t.Fatalf("Can not open test data: %v", err) + } + metrics := ParseNodeMetrics(data) + t.Logf("%+v", metrics) + + assert.Contains(t, metrics, "b001") + assert.Equal(t, uint64(327680), metrics["b001"].memAlloc) + assert.Equal(t, uint64(386000), metrics["b001"].memTotal) + assert.Equal(t, uint64(32), metrics["b001"].cpuAlloc) + assert.Equal(t, uint64(0), metrics["b001"].cpuIdle) + assert.Equal(t, uint64(0), metrics["b001"].cpuOther) + assert.Equal(t, uint64(32), metrics["b001"].cpuTotal) +} diff --git a/test_data/sinfo_mem.txt b/test_data/sinfo_mem.txt new file mode 100644 index 0000000..88f170c --- /dev/null +++ b/test_data/sinfo_mem.txt @@ -0,0 +1,21 @@ +a048 163840 193000 16/0/0/16 mixed +a048 163840 193000 16/0/0/16 mixed +a048 163840 193000 16/0/0/16 idle +a048 163840 193000 16/0/0/16 idle +a049 163840 193000 16/0/0/16 idle +a049 163840 193000 16/0/0/16 idle +a049 163840 193000 16/0/0/16 idle +a049 163840 193000 16/0/0/16 idle +a050 163840 193000 16/0/0/16 idle +a050 163840 193000 16/0/0/16 idle +a050 163840 193000 16/0/0/16 idle +a051 163840 193000 16/0/0/16 idle +a051 163840 193000 16/0/0/16 idle +a051 163840 193000 16/0/0/16 idle +a052 0 193000 0/16/0/16 idle +b001 327680 386000 32/0/0/32 down +b001 327680 386000 32/0/0/32 down +b002 327680 386000 32/0/0/32 down +b002 327680 386000 32/0/0/32 idle +b003 296960 386000 29/3/0/32 down +b003 296960 386000 29/3/0/32 idle From c96a2ecbab74a6a10e8a12669290c4e444f38b75 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Fri, 16 Apr 2021 11:42:49 +0200 Subject: [PATCH 70/82] Add call to collector for node.go --- main.go | 1 + 1 file changed, 1 insertion(+) diff --git a/main.go b/main.go index 617c01e..8570602 100644 --- a/main.go +++ b/main.go @@ -29,6 +29,7 @@ func init() { prometheus.MustRegister(NewCPUsCollector()) // from cpus.go prometheus.MustRegister(NewGPUsCollector()) // from gpus.go prometheus.MustRegister(NewNodesCollector()) // from nodes.go + prometheus.MustRegister(NewNodeCollector()) // from node.go prometheus.MustRegister(NewPartitionsCollector()) // from partitions.go prometheus.MustRegister(NewQueueCollector()) // from queue.go prometheus.MustRegister(NewSchedulerCollector()) // from scheduler.go From 9cd06542649b2520d2817f1116ba2efdd842d384 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Fri, 16 Apr 2021 11:43:19 +0200 Subject: [PATCH 71/82] Makefile: add node.go --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 58974ee..ee04441 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ PROJECT_NAME = prometheus-slurm-exporter ifndef GOPATH GOPATH=$(shell pwd):/usr/share/gocode endif -GOFILES=accounts.go cpus.go gpus.go main.go nodes.go partitions.go queue.go scheduler.go sshare.go users.go +GOFILES=accounts.go cpus.go gpus.go main.go node.go nodes.go partitions.go queue.go scheduler.go sshare.go users.go GOBIN=bin/$(PROJECT_NAME) build: From 987e677f4a76131b42821af923e00b3235c8edf3 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Fri, 16 Apr 2021 11:47:55 +0200 Subject: [PATCH 72/82] RemoveDuplicates: additional check on length --- nodes.go | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/nodes.go b/nodes.go index f8001eb..0a88a9a 100644 --- a/nodes.go +++ b/nodes.go @@ -50,8 +50,10 @@ func RemoveDuplicates(s []string) []string { // Walk through the slice 's' and for each value we haven't seen so far, append it to 't'. for _, v := range s { if _, seen := m[v]; !seen { - t = append(t, v) - m[v] = true + if len(v) > 0 { + t = append(t, v) + m[v] = true + } } } From b50fdb768965ca349d2b8c25a9ee82e1a4e16dfe Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Fri, 16 Apr 2021 12:17:59 +0200 Subject: [PATCH 73/82] README: add info about node usage data --- README.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/README.md b/README.md index 2bd7203..aa2d498 100644 --- a/README.md +++ b/README.md @@ -41,6 +41,16 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/ - Information extracted from the SLURM [**sinfo**](https://slurm.schedmd.com/sinfo.html) command. +#### Additional info about node usage + +Since version **0.18**, the following information are also extracted and exported for **every** node known by Slurm: + +* CPUs: how many are _allocated_, _idle_, _other_ and in _total_. +* Memory: _allocated_ and in _total_. +* Labels: hostname and its Slurm status (e.g. _idle_, _mix_, _allocated_, _draining_, etc.). + +See the related [test data](https://github.com/vpenso/prometheus-slurm-exporter/blob/master/test_data/sinfo_mem.txt) to check the format of the information extracted from Slurm. + ### Status of the Jobs * **PENDING**: Jobs awaiting for resource allocation. From 0ef4ae5464abaaed066762c354ba251dcefd155e Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Fri, 16 Apr 2021 16:09:41 +0200 Subject: [PATCH 74/82] Print info about GPU accounting status --- main.go | 1 + 1 file changed, 1 insertion(+) diff --git a/main.go b/main.go index 37f178b..48291fc 100644 --- a/main.go +++ b/main.go @@ -57,6 +57,7 @@ func main() { // The Handler function provides a default handler to expose metrics // via an HTTP server. "/metrics" is the usual endpoint for that. log.Infof("Starting Server: %s", *listenAddress) + log.Infof("GPUs Accounting: %t", *gpuAcct) http.Handle("/metrics", promhttp.Handler()) log.Fatal(http.ListenAndServe(*listenAddress, nil)) } From e1d78c9ca6b22f6575a55f9eccb67bf1d89ce2cf Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Fri, 16 Apr 2021 16:24:08 +0200 Subject: [PATCH 75/82] README: add info about the cmd line option for gpus accounting --- README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/README.md b/README.md index aa2d498..485ca19 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,13 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/ - Information extracted from the SLURM [**sinfo**](https://slurm.schedmd.com/sinfo.html) and [**sacct**](https://slurm.schedmd.com/sacct.html) command. - [Slurm GRES scheduling](https://slurm.schedmd.com/gres.html) +**NOTE**: since version **0.19**, GPU accounting has to be **explicitly** enabled adding the _-gpu-acct_ option to the command line otherwise it will not be activated. + +Be aware that: + +* According to issue #38, users reported that newer version of Slurm provides slightly different output and thus GPUs accounting may not work properly. +* Users who do not have GPUs and/or do not have accounting activated may want to keep GPUs accounting **off** (see issue #45). + ### State of the Nodes * **Allocated**: nodes which has been allocated to one or more jobs. From 76fff8ef8d9d9b641b57ebd34665a282c476c9e4 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Fri, 16 Apr 2021 16:31:35 +0200 Subject: [PATCH 76/82] Add a bare bones changelog --- CHANGELOG.md | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) create mode 100644 CHANGELOG.md diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..be1b81d --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,64 @@ +## Changelog + +Full commit history per tag: https://github.com/vpenso/prometheus-slurm-exporter/commits/{tag number} + +* **0.19** +- Merge PR#50 + +* **0.18** +- Add CPU/Memory info per node (see PR#47) + +* **0.17** +- Add fair share collector + +* **0.16** +- Export more data per account/partition, fix squeue for pending jobs +- Merge PR#34 + +* **0.15** +- CPU allocation status per partition + +* **0.14** +- add stats about jobs per account/per user + +* **0.13** +- Merge pull request #32 from pdtpartners/faster-node-metrics + +* **0.12** +- Merge pull request #30 from omnivector-solutions/add_snap_packaging + +* **0.11** +- Merge PR#29 +- Add more backfill stats (see PR#27) + +* **0.10** +- Scheduler: keep track of the DBD agent queue size + +* **0.9** +- README: update to fix build problem raised with issue #26 + +* **0.8** +- Merge pull request #21 from cleargray/command-paths + +* **0.7** +- Update scheduler.go (fix issue #18) + +* **0.6** +- Merge pull request #13 from rug-cit-hpc/master + +* **0.5** +- [BUG]: count all job states (issue #9) + +* **0.4** +- Merge pull request #8 from MatMaul/pending-dep + +* **0.3** +- Fix issue #4 + +* **0.2** +- Fix issue #3 + +* **0.1** +- Basic prototype +- Merge PR#2 +- Add Grafana dashboard From 77d84f1a7cef8cdcead7a7e61f600b25eb4289d1 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Fri, 16 Apr 2021 16:35:30 +0200 Subject: [PATCH 77/82] Changelog: adjust markdown --- CHANGELOG.md | 46 +++++++++++++++++++++++----------------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index be1b81d..78aab39 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,62 +3,62 @@ Full commit history per tag: https://github.com/vpenso/prometheus-slurm-exporter/commits/{tag number} * **0.19** -- Merge PR#50 + - Merge PR#50 * **0.18** -- Add CPU/Memory info per node (see PR#47) + - Add CPU/Memory info per node (see PR#47) * **0.17** -- Add fair share collector + - Add fair share collector * **0.16** -- Export more data per account/partition, fix squeue for pending jobs -- Merge PR#34 + - Export more data per account/partition, fix squeue for pending jobs + - Merge PR#34 * **0.15** -- CPU allocation status per partition + - CPU allocation status per partition * **0.14** -- add stats about jobs per account/per user + - add stats about jobs per account/per user * **0.13** -- Merge pull request #32 from pdtpartners/faster-node-metrics + - Merge pull request #32 from pdtpartners/faster-node-metrics * **0.12** -- Merge pull request #30 from omnivector-solutions/add_snap_packaging + - Merge pull request #30 from omnivector-solutions/add_snap_packaging * **0.11** -- Merge PR#29 -- Add more backfill stats (see PR#27) + - Merge PR#29 + - Add more backfill stats (see PR#27) * **0.10** -- Scheduler: keep track of the DBD agent queue size + - Scheduler: keep track of the DBD agent queue size * **0.9** -- README: update to fix build problem raised with issue #26 + - README: update to fix build problem raised with issue #26 * **0.8** -- Merge pull request #21 from cleargray/command-paths + - Merge pull request #21 from cleargray/command-paths * **0.7** -- Update scheduler.go (fix issue #18) + - Update scheduler.go (fix issue #18) * **0.6** -- Merge pull request #13 from rug-cit-hpc/master + - Merge pull request #13 from rug-cit-hpc/master * **0.5** -- [BUG]: count all job states (issue #9) + - [BUG]: count all job states (issue #9) * **0.4** -- Merge pull request #8 from MatMaul/pending-dep + - Merge pull request #8 from MatMaul/pending-dep * **0.3** -- Fix issue #4 + - Fix issue #4 * **0.2** -- Fix issue #3 + - Fix issue #3 * **0.1** -- Basic prototype -- Merge PR#2 -- Add Grafana dashboard + - Basic prototype + - Merge PR#2 + - Add Grafana dashboard From fe8deb8c0af8835173194626677c4d78680f6150 Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Sun, 18 Apr 2021 13:48:51 +0200 Subject: [PATCH 78/82] DEVELOPMENT.md: fix typo (issue#51) --- DEVELOPMENT.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 4b7f4ab..34f1915 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -30,7 +30,7 @@ go mod download Build the exporter: ```bash -go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,gpus,partitions,nodes,queue,scheduler,sshare,users}.go +go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,gpus,partitions,node,nodes,queue,scheduler,sshare,users}.go ``` Run all tests included in `_test.go` files: From 6276362e59a35c84350ff75003c099503cf8604e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakub=20Klinkovsk=C3=BD?= Date: Sun, 10 Oct 2021 08:31:28 +0200 Subject: [PATCH 79/82] Fixed typo in the README --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 485ca19..e9735ce 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/ - Information extracted from the SLURM [**sinfo**](https://slurm.schedmd.com/sinfo.html) and [**sacct**](https://slurm.schedmd.com/sacct.html) command. - [Slurm GRES scheduling](https://slurm.schedmd.com/gres.html) -**NOTE**: since version **0.19**, GPU accounting has to be **explicitly** enabled adding the _-gpu-acct_ option to the command line otherwise it will not be activated. +**NOTE**: since version **0.19**, GPU accounting has to be **explicitly** enabled adding the _-gpus-acct_ option to the command line otherwise it will not be activated. Be aware that: From 6fd4e0c8408137eee26ef9e0272fcc21bfc7fc8d Mon Sep 17 00:00:00 2001 From: Markus Opolka Date: Tue, 1 Feb 2022 16:12:06 +0100 Subject: [PATCH 80/82] Fix typos in README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 485ca19..277f84d 100644 --- a/README.md +++ b/README.md @@ -61,7 +61,7 @@ See the related [test data](https://github.com/vpenso/prometheus-slurm-exporter/ ### Status of the Jobs * **PENDING**: Jobs awaiting for resource allocation. -* **PENDING_DEPENDENCY**: Jobs awaiting because of a unexecuted job dependency. +* **PENDING_DEPENDENCY**: Jobs awaiting because of an unexecuted job dependency. * **RUNNING**: Jobs currently allocated. * **SUSPENDED**: Job has an allocation but execution has been suspended and CPUs have been released for other jobs. * **CANCELLED**: Jobs which were explicitly cancelled by the user or system administrator. @@ -148,7 +148,7 @@ scrape_configs: * **scrape_interval**: a 30 seconds interval will avoid possible 'overloading' on the SLURM master due to frequent calls of sdiag/squeue/sinfo commands through the exporter. * **scrape_timeout**: on a busy SLURM master a too short scraping timeout will abort the communication from the Prometheus server toward the exporter, thus generating a ``context_deadline_exceeded`` error. -The previous configuration file can be immediately used with a fresh installation of Promethues. At the same time, we highly recommend to include at least the ``global`` section into the configuration. Official documentation about __configuring Prometheus__ is [available here](https://prometheus.io/docs/prometheus/latest/configuration/configuration/). +The previous configuration file can be immediately used with a fresh installation of Prometheus. At the same time, we highly recommend to include at least the ``global`` section into the configuration. Official documentation about __configuring Prometheus__ is [available here](https://prometheus.io/docs/prometheus/latest/configuration/configuration/). **NOTE**: the Prometheus server is using __YAML__ as format for its configuration file, thus **indentation** is really important. Before reloading the Prometheus server it would be better to check the syntax: From e601bf4342367e2ad67a6ce695e8bdcd6d19041c Mon Sep 17 00:00:00 2001 From: Matteo Dessalvi Date: Tue, 8 Mar 2022 14:56:58 +0000 Subject: [PATCH 81/82] Add stretchr/testify as required module for tests --- go.mod | 1 + 1 file changed, 1 insertion(+) diff --git a/go.mod b/go.mod index a0210a2..fb8cd67 100644 --- a/go.mod +++ b/go.mod @@ -5,4 +5,5 @@ go 1.12 require ( github.com/prometheus/client_golang v1.2.1 github.com/prometheus/common v0.7.0 + github.com/stretchr/testify v1.3.0 ) From d99972384ea317e59ba9f73d7f994659d6276fec Mon Sep 17 00:00:00 2001 From: Jean-Baptiste Denis Date: Wed, 16 Mar 2022 17:30:37 +0100 Subject: [PATCH 82/82] Correct spec file path --- packages/rpm/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packages/rpm/README.md b/packages/rpm/README.md index 5db8e6e..a14d323 100644 --- a/packages/rpm/README.md +++ b/packages/rpm/README.md @@ -31,7 +31,7 @@ cp lib/systemd/prometheus-slurm-exporter.service ~/rpmbuild/SOURCES 6. Copy the SPEC file in the proper directory: ```bash cd prometheus-slurm-exporter -cp packaging/rpm/*.spec ~/rpmbuild/SPECS +cp packages/rpm/*.spec ~/rpmbuild/SPECS ``` ### Build the RPM package