gpu-monitoring-exporter

Prometheus exporter for GPU process metrics.

Metrics

nvidia_smi_process

Currently running process information on the GPU, including Docker container informations.

labels

name	description	data source
command	command with all its arguments as a string	ps
container_name	container name of the compute application	docker
cpu	CPU utilization of the process	ps
gpu	index of the GPU	nvidia_smi
image_name	image name of the compute application	docker
mem	ratio of the process's resident set size to the physical memory on the machine	ps
pid	process id of the compute application	nvidia_smi
process_name	process Name	nvidia_smi
user	effective user name	ps
uuid	globally unique immutable alphanumeric identifier of the GPU	nvidia_smi

Output example

# HELP nvidia_smi_process Process Info
# TYPE nvidia_smi_process gauge
nvidia_smi_process{command="python /work/cnn_mnist.py",container_name="boring_spence",cpu="133",gpu="0",image_name="tensorflow/tensorflow:latest-gpu",mem="1.0",pid="36301",process_name="python",user="root",uuid="GPU-74c4d80d-b8ae-d50a-d48d-4d7fe273c206"} 1
nvidia_smi_process{command="python /work/cnn_mnist.py",container_name="boring_spence",cpu="134",gpu="1",image_name="tensorflow/tensorflow:latest-gpu",mem="1.0",pid="36301",process_name="python",user="root",uuid="GPU-4c7d1647-2a51-77df-c347-a152b885e29d"} 1

Prerequisites

Use the output of nvidia-smi.

NVIDIA GPU drivers Installed
nvidia-docker version > 2.0 (see how to install and it's prerequisites)

Installation

Clone gpu-monitoring-exporter from the master branch of GitHub repository.

git clone https://github.com/yahoojapan/gpu-monitoring-exporter.git

Using Docker

cd docker
docker-compose up

Bundled docker-compose starts node_exporter. Metrics can be confirmed as follows:

% curl -s localhost:9100/metrics |grep nvidia_smi_

# HELP nvidia_smi_cpu Process Info
# TYPE nvidia_smi_cpu gauge
nvidia_smi_cpu{command="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server --model ",container_name="BMS",gpu="1",image_name="BMS",pid="757516",process_name="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server",user="997",uuid="GPU-305e55aa-680a-a1dd-96f9-0a3ab516e0a8"} 0
nvidia_smi_cpu{command="python3 ./ComfyUI/main.py ",container_name="comfyui",gpu="0",image_name="comfyui-docker:latest",pid="3922759",process_name="python3",user="1000",uuid="GPU-a0029be2-eb4d-2fcf-047c-1279d3a58352"} 0.3
# HELP nvidia_smi_gmem_used Process Info
# TYPE nvidia_smi_gmem_used gauge
nvidia_smi_gmem_used{command="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server --model ",container_name="BMS",gpu="1",image_name="BMS",pid="757516",process_name="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server",user="997",uuid="GPU-305e55aa-680a-a1dd-96f9-0a3ab516e0a8"} 6504
nvidia_smi_gmem_used{command="python3 ./ComfyUI/main.py ",container_name="comfyui",gpu="0",image_name="comfyui-docker:latest",pid="3922759",process_name="python3",user="1000",uuid="GPU-a0029be2-eb4d-2fcf-047c-1279d3a58352"} 384
# HELP nvidia_smi_gmem_util Process Info
# TYPE nvidia_smi_gmem_util gauge
nvidia_smi_gmem_util{command="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server --model ",container_name="BMS",gpu="1",image_name="BMS",pid="757516",process_name="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server",user="997",uuid="GPU-305e55aa-680a-a1dd-96f9-0a3ab516e0a8"} 0
nvidia_smi_gmem_util{command="python3 ./ComfyUI/main.py ",container_name="comfyui",gpu="0",image_name="comfyui-docker:latest",pid="3922759",process_name="python3",user="1000",uuid="GPU-a0029be2-eb4d-2fcf-047c-1279d3a58352"} 0
# HELP nvidia_smi_gpu_util Process Info
# TYPE nvidia_smi_gpu_util gauge
nvidia_smi_gpu_util{command="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server --model ",container_name="BMS",gpu="1",image_name="BMS",pid="757516",process_name="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server",user="997",uuid="GPU-305e55aa-680a-a1dd-96f9-0a3ab516e0a8"} 0
nvidia_smi_gpu_util{command="python3 ./ComfyUI/main.py ",container_name="comfyui",gpu="0",image_name="comfyui-docker:latest",pid="3922759",process_name="python3",user="1000",uuid="GPU-a0029be2-eb4d-2fcf-047c-1279d3a58352"} 0
# HELP nvidia_smi_max_memory_usage Process Info
# TYPE nvidia_smi_max_memory_usage gauge
nvidia_smi_max_memory_usage{command="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server --model ",container_name="BMS",gpu="1",image_name="BMS",pid="757516",process_name="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server",user="997",uuid="GPU-305e55aa-680a-a1dd-96f9-0a3ab516e0a8"} 6504
nvidia_smi_max_memory_usage{command="python3 ./ComfyUI/main.py ",container_name="comfyui",gpu="0",image_name="comfyui-docker:latest",pid="3922759",process_name="python3",user="1000",uuid="GPU-a0029be2-eb4d-2fcf-047c-1279d3a58352"} 384
# HELP nvidia_smi_mem Process Info
# TYPE nvidia_smi_mem gauge
nvidia_smi_mem{command="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server --model ",container_name="BMS",gpu="1",image_name="BMS",pid="757516",process_name="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server",user="997",uuid="GPU-305e55aa-680a-a1dd-96f9-0a3ab516e0a8"} 0.5
nvidia_smi_mem{command="python3 ./ComfyUI/main.py ",container_name="comfyui",gpu="0",image_name="comfyui-docker:latest",pid="3922759",process_name="python3",user="1000",uuid="GPU-a0029be2-eb4d-2fcf-047c-1279d3a58352"} 0.4
# HELP nvidia_smi_process Process Info
# TYPE nvidia_smi_process gauge
nvidia_smi_process{command="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server --model ",container_name="BMS",gpu="1",image_name="BMS",pid="757516",process_name="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server",user="997",uuid="GPU-305e55aa-680a-a1dd-96f9-0a3ab516e0a8"} 1
nvidia_smi_process{command="python3 ./ComfyUI/main.py ",container_name="comfyui",gpu="0",image_name="comfyui-docker:latest",pid="3922759",process_name="python3",user="1000",uuid="GPU-a0029be2-eb4d-2fcf-047c-1279d3a58352"} 1
# HELP nvidia_smi_run_time Process Info
# TYPE nvidia_smi_run_time gauge
nvidia_smi_run_time{command="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server --model ",container_name="BMS",gpu="1",image_name="BMS",pid="757516",process_name="/tmp/ollama1896338632/runners/cuda_v11/ollama_llama_server",user="997",uuid="GPU-305e55aa-680a-a1dd-96f9-0a3ab516e0a8"} 0
nvidia_smi_run_time{command="python3 ./ComfyUI/main.py ",container_name="comfyui",gpu="0",image_name="comfyui-docker:latest",pid="3922759",process_name="python3",user="1000",uuid="GPU-a0029be2-eb4d-2fcf-047c-1279d3a58352"} 0

# the label values BMS means running on the Bare Metal Server.
container_name="BMS",image_name="BMS"

Output only when the process is running on the GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docker		docker
nvidia-smi-exporter		nvidia-smi-exporter
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpu-monitoring-exporter

Metrics

nvidia_smi_process

labels

Output example

Prerequisites

Installation

Using Docker

About

Releases

Packages

Languages

License

dotwoo/gpu-monitoring-exporter

Folders and files

Latest commit

History

Repository files navigation

gpu-monitoring-exporter

Metrics

nvidia_smi_process

labels

Output example

Prerequisites

Installation

Using Docker

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages