Setup Nvidia GPUs on OpenShift with ease. This repo is intended as a foundation for GPU workloads on OpenShift.
Initially bootstrap.sh
configures GPU time-slicing which allows 2 workloads
to share a single GPU.
- Try out GPUs in OpenShift Dev Spaces via this devfile.yaml
- Run jupyter notebooks with pytorch or tensorflow
The components folder is intended for reuse with ArgoCD or OpenShift GitOps.
Familiarity with Kustomize will be helpful. This folder contains various secret recipes for oc apply -k
.
- Nvidia GPU hardware or cloud provider with GPU instances
- OpenShift 4.11+ w/ cluster admin
- Internet access
- AWS (auto scaling, optional)
- OpenShift Dev Spaces 3.8.0+ (optional)
Red Hat Demo Platform Options (Tested)
- AWS with OpenShift Open Environment
- 1 x Control Plane -
m5.4xlarge
- 0 x Workers -
m5.2xlarge
- 1 x Control Plane -
- MLOps Demo: Data Science & Edge Practice
Setup cluster GPU operators
scripts/bootstrap.sh
AWS autoscaling w/ OpenShift Dev Spaces
NOTE: GPU nodes may take 10 - 15 mins to become available
# aws gpu - load functions
. scripts/bootstrap.sh
# aws gpu - basic gpu autoscaling
ocp_aws_cluster_autoscaling
# deploy devspaces
setup_operator_devspaces
Deploy GPU test pod
oc apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/tests/gpu-pod.yaml
Setup Time Slicing (2x)
oc apply -k components/operators/gpu-operator-certified/instance/overlays/time-sliced-2
Request / Test a GPU workload of 6 GPUs
oc apply -k components/demos/nvidia-gpu-verification/overlays/toleration-replicas-6
# check the number of pods
oc -n nvidia-gpu-verification get pods
Get GPU nodes
oc get nodes -l node-role.kubernetes.io/gpu
oc get nodes \
-l node-role.kubernetes.io/gpu \
-o jsonpath={.items[*].status.allocatable} | jq . | grep nvidia
Watch cluster autoscaler logs
oc -n openshift-machine-api logs -f deploy/cluster-autoscaler-default
Manually label nodes as GPU (optional)
NODE=worker1.ocp.run
oc label node/${NODE} --overwrite "node-role.kubernetes.io/gpu="
Nvidia Multi Instance GPU (MIG) on OpenShift
- Additional Notes
- Docs - AWS GPU Instances
- Docs - Nvidia GPU Operator on Openshift
- Docs - Nvidia GPU admin dashboard
- Docs - Multi Instance GPU (MIG) in OCP
- Docs - Time Slicing in OCP
- Docs - KB GPU Autoscaling
- Blog - RH Nvidia GPUs on OpenShift
- Demo - bkoz GPU DevSpaces
- GPU Operator default config map
udi-cuda
images from HERE are based on official NVIDIA CUDA images.
Please be aware of any of the associated terms and conditions.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.