Skip to content

Collects GPU capacity, reserved, free metrics from Kubernetes nodes to provide KubeVirt VMs

License

Notifications You must be signed in to change notification settings

vikipranata/gpu-capacity-exporter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU Capacity Exporter

Overview

Collects GPU capacity, reserved, free metrics from Kubernetes nodes and KubeVirt VMIs

Disclaimer

This repository is provided for hobby and educational purposes only. If you plan to use it in a production environment, it will most likely require significant customization, testing, and security audits. Please fork and build your own application.

Features

  • Dynamic GPU type detection via node labels
  • Metrics for total capacity, reserved, and free GPUs per node
  • Flexible GPU device matching (can adjust mapping)

Node Labels

Add the following labels to GPU nodes:

When a node is labeled with gpu-workload=false, it will be returned as 0.0.

Example GPU NVIDIA node

kubectl label node node01 gpu-workload=true
kubectl label node node01 nvidia.com/NVIDIA-H200-SXM=8

Example GPU AMD node

kubectl label node node02 gpu-workload=true
kubectl label node node02 amd.com/INSTINCT-MI300X=8

Example GPU INTEL node

kubectl label node node01 gpu-workload=true
kubectl label node node03 intel.com/DCGPU-MAX1550=8

Metrics

1. GPU Node Capacity

kubevirt_gpu_capacity{gpu_type="nvidia.com/NVIDIA-H200-SXM",node="node01"} 8.0
kubevirt_gpu_capacity{gpu_type="amd.com/INSTINCT-MI300X",node="node02"} 8.0
kubevirt_gpu_capacity{gpu_type="intel.com/DCGPU-MAX1550",node="node03"} 8.0

2. GPU Total Cluster Capacity

kubevirt_gpu_total_cluster_capacity{gpu_type="nvidia.com/NVIDIA-H200-SXM"} 8.0
kubevirt_gpu_total_cluster_capacity{gpu_type="amd.com/INSTINCT-MI300X"} 8.0
kubevirt_gpu_total_cluster_capacity{gpu_type="intel.com/DCGPU-MAX1550"} 8.0

3. GPU Reserved by KubeVirt VMI

kubevirt_gpu_reserved{gpu_type="nvidia.com/NVIDIA-H200-SXM",node="node01"} 4.0
kubevirt_gpu_reserved{gpu_type="amd.com/INSTINCT-MI300X",node="node02"} 2.0
kubevirt_gpu_reserved{gpu_type="intel.com/DCGPU-MAX1550",node="node03"} 1.0

4. GPU Used by KubeVirt VMI

kubevirt_gpu_instance{gpu_type="nvidia.com/NVIDIA-H200-SXM",node="node01"namespace="production",instance="vm-gpu01",address="10.244.2.88"} 4.0
kubevirt_gpu_instance{gpu_type="amd.com/INSTINCT-MI300X",node="node02",namespace="development",instance="vm-gpu02",address="10.244.2.99"} 2.0
kubevirt_gpu_instance{gpu_type="intel.com/DCGPU-MAX1550",node="node03",namespace="testing",instance="vm-gpu03",address="10.244.2.66"} 1.0

5. GPU Free Usage

kubevirt_gpu_free{gpu_type="nvidia.com/NVIDIA-H200-SXM",node="node01"} 8.0
kubevirt_gpu_free{gpu_type="amd.com/INSTINCT-MI300X",node="node02"} 8.0
kubevirt_gpu_free{gpu_type="intel.com/DCGPU-MAX1550",node="node03"} 8.0

Running the Exporter

Kubernetes deploy

kubectl apply -f manifests/rbac.yaml
kubectl apply -f manifests/service.yaml
kubectl apply -f manifests/deployment.yaml
kubectl apply -f manifests/serviceMonitor.yaml

Expose port 9100 to Prometheus.

About

Collects GPU capacity, reserved, free metrics from Kubernetes nodes to provide KubeVirt VMs

Topics

Resources

License

Stars

Watchers

Forks