Skip to content

redhat-na-ssa/demo-ocp-gpu

Repository files navigation

Demo GPUs on OpenShift

Setup Nvidia GPUs on OpenShift with ease. This repo is intended as a foundation for GPU workloads on OpenShift.

Initially bootstrap.sh configures GPU time-slicing which allows 2 workloads to share a single GPU.

In addition

The components folder is intended for reuse with ArgoCD or OpenShift GitOps. Familiarity with Kustomize will be helpful. This folder contains various secret recipes for oc apply -k.

Prerequisites

  • Nvidia GPU hardware or cloud provider with GPU instances
  • OpenShift 4.11+ w/ cluster admin
  • Internet access
  • AWS (auto scaling, optional)
  • OpenShift Dev Spaces 3.8.0+ (optional)

Red Hat Demo Platform Options (Tested)

Quickstart

Setup cluster GPU operators

scripts/bootstrap.sh

Various Commands

AWS autoscaling w/ OpenShift Dev Spaces

NOTE: GPU nodes may take 10 - 15 mins to become available

# aws gpu - load functions
. scripts/bootstrap.sh

# aws gpu - basic gpu autoscaling
ocp_aws_cluster_autoscaling

# deploy devspaces
setup_operator_devspaces

Deploy GPU test pod

oc apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/tests/gpu-pod.yaml

Setup Time Slicing (2x)

oc apply -k components/operators/gpu-operator-certified/instance/overlays/time-sliced-2

Request / Test a GPU workload of 6 GPUs

oc apply -k components/demos/nvidia-gpu-verification/overlays/toleration-replicas-6

# check the number of pods
oc -n nvidia-gpu-verification get pods

Get GPU nodes

oc get nodes -l node-role.kubernetes.io/gpu

oc get nodes \
  -l node-role.kubernetes.io/gpu \
  -o jsonpath={.items[*].status.allocatable} | jq . | grep nvidia

Watch cluster autoscaler logs

oc -n openshift-machine-api logs -f deploy/cluster-autoscaler-default

Manually label nodes as GPU (optional)

NODE=worker1.ocp.run
  oc label node/${NODE} --overwrite "node-role.kubernetes.io/gpu="

Other Instructions

Nvidia Multi Instance GPU (MIG) on OpenShift

Links

Container License

udi-cuda images from HERE are based on official NVIDIA CUDA images.

Please be aware of any of the associated terms and conditions.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.

By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.