gpu-deploy — GPU node automation

Automate provisioning of NVIDIA GPU support on Linux nodes and make GPUs available to Kubernetes clusters.

What this does

Optionally installs NVIDIA drivers via package manager (Ubuntu)
Installs and configures nvidia-container-toolkit for containerd/Docker
Deploys a systemd helper that ensures GPU devices are available after boot and refreshes the device-plugin
Provides CI/CD scaffolding (GitHub Actions) for lint, test, and packaging

Repository structure

gpu-deploy/
├── ansible/
│   ├── playbook.yaml           # Example playbook
│   └── roles/gpu/              # Ansible role for GPU setup
│       ├── defaults/           # Role variables
│       ├── files/              # nvidia-device-ready.sh script
│       ├── handlers/           # systemd reload handler
│       ├── tasks/              # Role tasks (driver, toolkit, systemd)
│       └── templates/          # nvidia-device-ready.service unit
├── .github/workflows/ci.yml    # CI pipeline
├── tests/e2e/smoke.sh          # Smoke test for GPU presence
├── Makefile                    # Shortcuts for lint/test/package
├── CHANGELOG.md
└── README.md

Quick start

Prerequisites

Ubuntu-based nodes (18.04+, 20.04, 22.04, 24.04)
Ansible 2.9+ on control node
SSH access and sudo on target nodes
(Optional) GPU hardware and NVIDIA-compatible kernel

Install on a single node

Clone this repo:

git clone <repo-url> gpu-deploy
cd gpu-deploy

Edit ansible/playbook.yaml and set variables:

- hosts: gpu_nodes
  become: true
  roles:
    - role: gpu
      vars:
        gpu_install_driver: true           # set false if driver already installed
        gpu_nvidia_ctk_runtime: containerd # or 'docker'
        gpu_reboot_after_driver_install: true

Run the playbook:

ansible-playbook -i <your-inventory> ansible/playbook.yaml

Verify:

ssh <target-host> nvidia-smi
ssh <target-host> systemctl status nvidia-device-ready.service

Role variables (defaults in `ansible/roles/gpu/defaults/main.yml`)

Variable	Default	Description
`gpu_install_driver`	`false`	Install NVIDIA driver via `ubuntu-drivers autoinstall`
`gpu_driver_package`	`ubuntu-drivers-common`	Package to install for driver management
`gpu_nvidia_ctk_runtime`	`containerd`	Container runtime (`containerd` or `docker`)
`gpu_reboot_after_driver_install`	`true`	Reboot if driver installed and `/var/run/reboot-required` exists
`gpu_kubeconfig_path`	`/etc/rancher/k3s/k3s.yaml`	Path to kubeconfig for device-plugin refresh

Testing

Run the smoke test on a GPU node:

./tests/e2e/smoke.sh

Expected: nvidia-smi and /dev/nvidia0 are present.

CI/CD

GitHub Actions workflow (.github/workflows/ci.yml) runs:

Lint: ansible-lint, shellcheck
Test: smoke test (requires self-hosted GPU runner for full validation)
Package: creates a tarball artifact on push to main

For production, add a self-hosted runner with GPU access and configure secrets as needed.

Systemd helper (`nvidia-device-ready.service`)

The role installs a systemd unit that:

Binds to dev-nvidia0.device (triggered when /dev/nvidia0 appears)
Runs /usr/local/bin/nvidia-device-ready.sh which:
- Waits for /dev/nvidia0 (with timeout)
- Restarts container runtime (containerd/docker)
- Deletes device-plugin pods to force re-registration with kubelet

This ensures GPU resources remain available after reboots or driver updates.

Check status:

systemctl status nvidia-device-ready.service
journalctl -u nvidia-device-ready.service -b

Kubernetes integration

After running the role on GPU nodes:

Deploy NVIDIA device-plugin DaemonSet (see NVIDIA device-plugin)
Verify node allocatable: kubectl get node <node> -o json | jq '.status.allocatable["nvidia.com/gpu"]'
Run a GPU workload (e.g., cuda-vectoradd sample)

For full automation consider NVIDIA GPU Operator (Helm chart) which handles driver, toolkit, device-plugin, and monitoring.

Development

# Lint
make lint

# Test
make test

# Package
make package

License

MIT — see LICENSE

Contributing

PRs welcome. Please lint and test before submitting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpu-deploy — GPU node automation

What this does

Repository structure

Quick start

Prerequisites

Install on a single node

Role variables (defaults in `ansible/roles/gpu/defaults/main.yml`)

Testing

CI/CD

Systemd helper (`nvidia-device-ready.service`)

Kubernetes integration

Development

License

Contributing

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
ansible		ansible
tests/e2e		tests/e2e
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README-OPERATION.md		README-OPERATION.md
README.md		README.md

License

Salimpossible/gpu-deploy

Folders and files

Latest commit

History

Repository files navigation

gpu-deploy — GPU node automation

What this does

Repository structure

Quick start

Prerequisites

Install on a single node

Role variables (defaults in ansible/roles/gpu/defaults/main.yml)

Testing

CI/CD

Systemd helper (nvidia-device-ready.service)

Kubernetes integration

Development

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Role variables (defaults in `ansible/roles/gpu/defaults/main.yml`)

Systemd helper (`nvidia-device-ready.service`)

Packages