Skip to content

Latest commit

 

History

History

nvml

Agent Check: Nvidia NVML

Overview

This check monitors NVIDIA Management Library (NVML) exposed metrics through the Datadog Agent and can correlate them with the exposed Kubernetes devices.

Setup

This package is NOT included in the Datadog Agent package.

Installation

If you are using Agent v6.8+ follow the instructions below to install the check on your host. See the dedicated Agent guide for installing community integrations to install checks with the Agent prior v6.8 or the Docker Agent:

  1. Download and launch the Datadog Agent.

  2. Run the following command to install the integrations wheel with the Agent:

    datadog-agent integration install -t datadog-nvml==<INTEGRATION_VERSION>
    # You may also need to install dependencies since those aren't packaged into the wheel
    sudo -u dd-agent -H /opt/datadog-agent/embedded/bin/pip3 install grpcio pynvml

If you are using Docker, there is an example Dockerfile in the NVML repository.

docker build --build-arg=DD_AGENT_VERSION=7.18.0 .
  1. Configure your integration like any other packaged integration.

  2. If you're using Docker and Kubernetes, you will need to expose the environment variables NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES. See the included Dockerfile for an example.

  3. If you want to be able to correlate reserved Kubernetes NVIDIA devices with the Kubernetes pod using the device, mount the Unix domain socket /var/lib/kubelet/pod-resources/kubelet.sock into your Agent's configuration. More information about this socket is on the Kubernetes website. Note this device is in beta support for version 1.15.

Configuration

  1. Edit the nvml.d/conf.yaml file, in the conf.d/ folder at the root of your Agent's configuration directory to start collecting your NVML performance data. See the sample nvml.d/conf.yaml for all available configuration options.

  2. Restart the Agent.

Validation

Run the Agent's status subcommand and look for nvml under the Checks section.

Data Collected

Metrics

See metadata.csv for a list of metrics provided by this check. The authoritative metric documentation is on the NVIDIA website.

There is an attempt to, when possible, match metric names with NVIDIA's Data Center GPU Manager (DCGM) exporter.

Service Checks

NVML does not include any service checks.

Events

NVML does not include any events.

Troubleshooting

Need help? Contact Datadog support.