Automate provisioning of NVIDIA GPU support on Linux nodes and make GPUs available to Kubernetes clusters.
- Optionally installs NVIDIA drivers via package manager (Ubuntu)
- Installs and configures
nvidia-container-toolkitfor containerd/Docker - Deploys a systemd helper that ensures GPU devices are available after boot and refreshes the device-plugin
- Provides CI/CD scaffolding (GitHub Actions) for lint, test, and packaging
gpu-deploy/
├── ansible/
│ ├── playbook.yaml # Example playbook
│ └── roles/gpu/ # Ansible role for GPU setup
│ ├── defaults/ # Role variables
│ ├── files/ # nvidia-device-ready.sh script
│ ├── handlers/ # systemd reload handler
│ ├── tasks/ # Role tasks (driver, toolkit, systemd)
│ └── templates/ # nvidia-device-ready.service unit
├── .github/workflows/ci.yml # CI pipeline
├── tests/e2e/smoke.sh # Smoke test for GPU presence
├── Makefile # Shortcuts for lint/test/package
├── CHANGELOG.md
└── README.md
- Ubuntu-based nodes (18.04+, 20.04, 22.04, 24.04)
- Ansible 2.9+ on control node
- SSH access and sudo on target nodes
- (Optional) GPU hardware and NVIDIA-compatible kernel
-
Clone this repo:
git clone <repo-url> gpu-deploy cd gpu-deploy
-
Edit
ansible/playbook.yamland set variables:- hosts: gpu_nodes become: true roles: - role: gpu vars: gpu_install_driver: true # set false if driver already installed gpu_nvidia_ctk_runtime: containerd # or 'docker' gpu_reboot_after_driver_install: true
-
Run the playbook:
ansible-playbook -i <your-inventory> ansible/playbook.yaml
-
Verify:
ssh <target-host> nvidia-smi ssh <target-host> systemctl status nvidia-device-ready.service
| Variable | Default | Description |
|---|---|---|
gpu_install_driver |
false |
Install NVIDIA driver via ubuntu-drivers autoinstall |
gpu_driver_package |
ubuntu-drivers-common |
Package to install for driver management |
gpu_nvidia_ctk_runtime |
containerd |
Container runtime (containerd or docker) |
gpu_reboot_after_driver_install |
true |
Reboot if driver installed and /var/run/reboot-required exists |
gpu_kubeconfig_path |
/etc/rancher/k3s/k3s.yaml |
Path to kubeconfig for device-plugin refresh |
Run the smoke test on a GPU node:
./tests/e2e/smoke.shExpected: nvidia-smi and /dev/nvidia0 are present.
GitHub Actions workflow (.github/workflows/ci.yml) runs:
- Lint:
ansible-lint,shellcheck - Test: smoke test (requires self-hosted GPU runner for full validation)
- Package: creates a tarball artifact on push to
main
For production, add a self-hosted runner with GPU access and configure secrets as needed.
The role installs a systemd unit that:
- Binds to
dev-nvidia0.device(triggered when/dev/nvidia0appears) - Runs
/usr/local/bin/nvidia-device-ready.shwhich:- Waits for
/dev/nvidia0(with timeout) - Restarts container runtime (containerd/docker)
- Deletes device-plugin pods to force re-registration with kubelet
- Waits for
This ensures GPU resources remain available after reboots or driver updates.
Check status:
systemctl status nvidia-device-ready.service
journalctl -u nvidia-device-ready.service -bAfter running the role on GPU nodes:
- Deploy NVIDIA device-plugin DaemonSet (see NVIDIA device-plugin)
- Verify node allocatable:
kubectl get node <node> -o json | jq '.status.allocatable["nvidia.com/gpu"]' - Run a GPU workload (e.g.,
cuda-vectoraddsample)
For full automation consider NVIDIA GPU Operator (Helm chart) which handles driver, toolkit, device-plugin, and monitoring.
# Lint
make lint
# Test
make test
# Package
make packageMIT — see LICENSE
PRs welcome. Please lint and test before submitting.