GPU Capacity Exporter

Overview

Collects GPU capacity, reserved, free metrics from Kubernetes nodes and KubeVirt VMIs

Disclaimer

This repository is provided for hobby and educational purposes only. If you plan to use it in a production environment, it will most likely require significant customization, testing, and security audits. Please fork and build your own application.

Features

Dynamic GPU type detection via node labels
Metrics for total capacity, reserved, and free GPUs per node
Flexible GPU device matching (can adjust mapping)

Node Labels

Add the following labels to GPU nodes:

When a node is labeled with gpu-workload=false, it will be returned as 0.0.

Example GPU NVIDIA node

kubectl label node node01 gpu-workload=true
kubectl label node node01 nvidia.com/NVIDIA-H200-SXM=8

Example GPU AMD node

kubectl label node node02 gpu-workload=true
kubectl label node node02 amd.com/INSTINCT-MI300X=8

Example GPU INTEL node

kubectl label node node01 gpu-workload=true
kubectl label node node03 intel.com/DCGPU-MAX1550=8

Metrics

1. GPU Node Capacity

kubevirt_gpu_capacity{gpu_type="nvidia.com/NVIDIA-H200-SXM",node="node01"} 8.0
kubevirt_gpu_capacity{gpu_type="amd.com/INSTINCT-MI300X",node="node02"} 8.0
kubevirt_gpu_capacity{gpu_type="intel.com/DCGPU-MAX1550",node="node03"} 8.0

2. GPU Total Cluster Capacity

kubevirt_gpu_total_cluster_capacity{gpu_type="nvidia.com/NVIDIA-H200-SXM"} 8.0
kubevirt_gpu_total_cluster_capacity{gpu_type="amd.com/INSTINCT-MI300X"} 8.0
kubevirt_gpu_total_cluster_capacity{gpu_type="intel.com/DCGPU-MAX1550"} 8.0

3. GPU Reserved by KubeVirt VMI

kubevirt_gpu_reserved{gpu_type="nvidia.com/NVIDIA-H200-SXM",node="node01"} 4.0
kubevirt_gpu_reserved{gpu_type="amd.com/INSTINCT-MI300X",node="node02"} 2.0
kubevirt_gpu_reserved{gpu_type="intel.com/DCGPU-MAX1550",node="node03"} 1.0

4. GPU Used by KubeVirt VMI

kubevirt_gpu_instance{gpu_type="nvidia.com/NVIDIA-H200-SXM",node="node01"namespace="production",instance="vm-gpu01",address="10.244.2.88"} 4.0
kubevirt_gpu_instance{gpu_type="amd.com/INSTINCT-MI300X",node="node02",namespace="development",instance="vm-gpu02",address="10.244.2.99"} 2.0
kubevirt_gpu_instance{gpu_type="intel.com/DCGPU-MAX1550",node="node03",namespace="testing",instance="vm-gpu03",address="10.244.2.66"} 1.0

5. GPU Free Usage

kubevirt_gpu_free{gpu_type="nvidia.com/NVIDIA-H200-SXM",node="node01"} 8.0
kubevirt_gpu_free{gpu_type="amd.com/INSTINCT-MI300X",node="node02"} 8.0
kubevirt_gpu_free{gpu_type="intel.com/DCGPU-MAX1550",node="node03"} 8.0

Running the Exporter

Kubernetes deploy

kubectl apply -f manifests/rbac.yaml
kubectl apply -f manifests/service.yaml
kubectl apply -f manifests/deployment.yaml
kubectl apply -f manifests/serviceMonitor.yaml

Expose port 9100 to Prometheus.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
app		app
manifests		manifests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
grafana-dashboard.json		grafana-dashboard.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPU Capacity Exporter

Overview

Disclaimer

Features

Node Labels

Example GPU NVIDIA node

Example GPU AMD node

Example GPU INTEL node

Metrics

1. GPU Node Capacity

2. GPU Total Cluster Capacity

3. GPU Reserved by KubeVirt VMI

4. GPU Used by KubeVirt VMI

5. GPU Free Usage

Running the Exporter

Kubernetes deploy

About

Uh oh!

Languages

License

vikipranata/gpu-capacity-exporter

Folders and files

Latest commit

History

Repository files navigation

GPU Capacity Exporter

Overview

Disclaimer

Features

Node Labels

Example GPU NVIDIA node

Example GPU AMD node

Example GPU INTEL node

Metrics

1. GPU Node Capacity

2. GPU Total Cluster Capacity

3. GPU Reserved by KubeVirt VMI

4. GPU Used by KubeVirt VMI

5. GPU Free Usage

Running the Exporter

Kubernetes deploy

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages