This check monitors NVIDIA Management Library (NVML) exposed metrics through the Datadog Agent and can correlate them with the exposed Kubernetes devices.
This package is NOT included in the Datadog Agent package.
If you are using Agent v6.8+ follow the instructions below to install the check on your host. See the dedicated Agent guide for installing community integrations to install checks with the Agent prior v6.8 or the Docker Agent:
-
Run the following command to install the integrations wheel with the Agent:
datadog-agent integration install -t datadog-nvml==<INTEGRATION_VERSION> # You may also need to install dependencies since those aren't packaged into the wheel sudo -u dd-agent -H /opt/datadog-agent/embedded/bin/pip3 install grpcio pynvml
If you are using Docker, there is an example Dockerfile in the NVML repository.
docker build --build-arg=DD_AGENT_VERSION=7.18.0 .
-
Configure your integration like any other packaged integration.
-
If you're using Docker and Kubernetes, you will need to expose the environment variables
NVIDIA_VISIBLE_DEVICES
andNVIDIA_DRIVER_CAPABILITIES
. See the included Dockerfile for an example. -
If you want to be able to correlate reserved Kubernetes NVIDIA devices with the Kubernetes pod using the device, mount the Unix domain socket
/var/lib/kubelet/pod-resources/kubelet.sock
into your Agent's configuration. More information about this socket is on the Kubernetes website. Note this device is in beta support for version 1.15.
-
Edit the
nvml.d/conf.yaml
file, in theconf.d/
folder at the root of your Agent's configuration directory to start collecting your NVML performance data. See the sample nvml.d/conf.yaml for all available configuration options.
Run the Agent's status subcommand and look for nvml
under the Checks section.
See metadata.csv for a list of metrics provided by this check. The authoritative metric documentation is on the NVIDIA website.
There is an attempt to, when possible, match metric names with NVIDIA's Data Center GPU Manager (DCGM) exporter.
NVML does not include any service checks.
NVML does not include any events.
Need help? Contact Datadog support.