The metagpu device plugin (mgdp
) allows you to share one or more Nvidia GPUs between
different K8s workloads.
K8s doesn't provide a support for the GPU sharing. Meaning user must allocate entire GPU to his workload, even if the actual GPU usage is much bellow of 100%. This project will help to improve the GPU utilization by allowing GPU sharing between multiple K8s workloads.
The mgdp
is based on Nvidia Container Runtime
and on go-nvml
One for the features the nvidia container runtime providers, is an ability
to specify the visible GPU devices Ids by using env vars NVIDIA_VISIBLE_DEVICES
.
The most short & simple explanation of the mgdp
logic is:
mgdp
detects all the GPU devices Ids- From the real GPU deices Ids, it's generates a meta-devices Ids
mgdp
advertise these meta-devices Ids to the K8s- Once a user requests for a gpu fraction, for example 0.5 GPU,
mgdp
will allocate 50 meta-devices IDs - The 50 meta-gpus are bounded to 1 real device id, this real device ID will be injected to the container
In addition, each metagpu container will have mgctl
binary.
The mgctl
is an alternative for nvidia-smi
.
The mgctl
improves security and provides better K8s integration.
By default, mgdp
will share each of your GPU devices to 100 meta-gpus.
For example, if you've a machine with 2 GPUs, mgdp
will generate 200 metagpus.
Requesting for 50 metagpus, will give you 0.5 GPU, requesting 150 metagpus,
will give you 1.5 metagpus.
- clone the repo
- use helm chart to install or dump manifest and install manually
# cd into cloned directory and run
# for openshift set ocp=true
helm install chart --set ocp=false -ncnvrg
# cd into cloned directory and run
# for openshift set ocp=true
helm template chart --set ocp=false -ncnvrg > meatgpu.yaml
kubectl apply -f meatgpu.yaml
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: metagpu-test
namespace: cnvrg
spec:
tolerations:
- operator: "Exists"
containers:
- name: gpu-test-with-gpu
image: tensorflow/tensorflow:latest-gpu
command:
- /usr/local/bin/python
- -c
- |
import tensorflow as tf
tf.get_logger().setLevel('INFO')
gpus = tf.config.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.set_logical_device_configuration(gpus[0],[tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
while True:
print(tf.reduce_sum(tf.random.normal([1000, 1000])))
resources:
limits:
cnvrg.io/metagpu: "30"
EOF