The project contains collectd
CUDA exec
plugin for collecting nVidia
GPU metrics. The plugin works well with single and multi-GPU machines.
First, make sure that the collectd
exec
plugin is loaded. Uncomment
or add the following line to your collectd.conf
:
LoadPlugin exec
Then, add the path to collectd_cuda.sh
in exec
configuration.
The configuration file, plugins_config.sh
, is in a separate file and is
required by the main script.
<Plugin exec>
Exec some_user "/path/to/collectd_cuda.sh"
</Plugin>
Depending on the metrics selection, the plugin will return PUTVAL
Plain Text
Protocol messages.
You can find below the sample output from the server with four TitanX cards.
PUTVAL server.fqdn/cuda-0000:02:00.0/percent-fan_speed interval=10 N:23
PUTVAL server.fqdn/cuda-0000:02:00.0/memory-memory_free interval=10 N:11172
PUTVAL server.fqdn/cuda-0000:02:00.0/temperature-temperature_gpu interval=10 N:32
PUTVAL server.fqdn/cuda-0000:02:00.0/power-power_draw interval=10 N:16.87
PUTVAL server.fqdn/cuda-0000:02:00.0/memory-memory_used interval=10 N:0
PUTVAL server.fqdn/cuda-0000:03:00.0/percent-fan_speed interval=10 N:23
PUTVAL server.fqdn/cuda-0000:03:00.0/memory-memory_free interval=10 N:11172
PUTVAL server.fqdn/cuda-0000:03:00.0/temperature-temperature_gpu interval=10 N:36
PUTVAL server.fqdn/cuda-0000:03:00.0/power-power_draw interval=10 N:17.08
PUTVAL server.fqdn/cuda-0000:03:00.0/memory-memory_used interval=10 N:0
PUTVAL server.fqdn/cuda-0000:83:00.0/percent-fan_speed interval=10 N:23
PUTVAL server.fqdn/cuda-0000:83:00.0/memory-memory_free interval=10 N:11172
PUTVAL server.fqdn/cuda-0000:83:00.0/temperature-temperature_gpu interval=10 N:35
PUTVAL server.fqdn/cuda-0000:83:00.0/power-power_draw interval=10 N:16.88
PUTVAL server.fqdn/cuda-0000:83:00.0/memory-memory_used interval=10 N:0
PUTVAL server.fqdn/cuda-0000:84:00.0/percent-fan_speed interval=10 N:23
PUTVAL server.fqdn/cuda-0000:84:00.0/memory-memory_free interval=10 N:11172
PUTVAL server.fqdn/cuda-0000:84:00.0/temperature-temperature_gpu interval=10 N:42
PUTVAL server.fqdn/cuda-0000:84:00.0/power-power_draw interval=10 N:17.37
PUTVAL server.fqdn/cuda-0000:84:00.0/memory-memory_used interval=10 N:0
Metrics can be added or removed from the config
array.
declare -A config=(
["temperature_gpu"]=temperature
["fan_speed"]=percent
["memory_used"]=memory
["memory_free"]=memory
["utilization_gpu"]=percent
["utilization_memory"]=percent
["power_draw"]=power
)
Each entry should be in the following format:
["metric_name"]=value_type
Any query string from nvidia-smi
can be a metric_name
, but each dot .
has to be replaced by an underscore _
. For example, temperature.gpu
becomes temperature_gpu
.
The full list of query options can be obtained with the following command:
nvidia-smi --help-query-gpu
I store my metrics in the InfluxDB and visualize them with Grafana. You can find below a sample dashboard from one of the servers I administer.