Skip to content

Salimpossible/gpu-deploy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gpu-deploy — GPU node automation

Automate provisioning of NVIDIA GPU support on Linux nodes and make GPUs available to Kubernetes clusters.

What this does

  • Optionally installs NVIDIA drivers via package manager (Ubuntu)
  • Installs and configures nvidia-container-toolkit for containerd/Docker
  • Deploys a systemd helper that ensures GPU devices are available after boot and refreshes the device-plugin
  • Provides CI/CD scaffolding (GitHub Actions) for lint, test, and packaging

Repository structure

gpu-deploy/
├── ansible/
│   ├── playbook.yaml           # Example playbook
│   └── roles/gpu/              # Ansible role for GPU setup
│       ├── defaults/           # Role variables
│       ├── files/              # nvidia-device-ready.sh script
│       ├── handlers/           # systemd reload handler
│       ├── tasks/              # Role tasks (driver, toolkit, systemd)
│       └── templates/          # nvidia-device-ready.service unit
├── .github/workflows/ci.yml    # CI pipeline
├── tests/e2e/smoke.sh          # Smoke test for GPU presence
├── Makefile                    # Shortcuts for lint/test/package
├── CHANGELOG.md
└── README.md

Quick start

Prerequisites

  • Ubuntu-based nodes (18.04+, 20.04, 22.04, 24.04)
  • Ansible 2.9+ on control node
  • SSH access and sudo on target nodes
  • (Optional) GPU hardware and NVIDIA-compatible kernel

Install on a single node

  1. Clone this repo:

    git clone <repo-url> gpu-deploy
    cd gpu-deploy
  2. Edit ansible/playbook.yaml and set variables:

    - hosts: gpu_nodes
      become: true
      roles:
        - role: gpu
          vars:
            gpu_install_driver: true           # set false if driver already installed
            gpu_nvidia_ctk_runtime: containerd # or 'docker'
            gpu_reboot_after_driver_install: true
  3. Run the playbook:

    ansible-playbook -i <your-inventory> ansible/playbook.yaml
  4. Verify:

    ssh <target-host> nvidia-smi
    ssh <target-host> systemctl status nvidia-device-ready.service

Role variables (defaults in ansible/roles/gpu/defaults/main.yml)

Variable Default Description
gpu_install_driver false Install NVIDIA driver via ubuntu-drivers autoinstall
gpu_driver_package ubuntu-drivers-common Package to install for driver management
gpu_nvidia_ctk_runtime containerd Container runtime (containerd or docker)
gpu_reboot_after_driver_install true Reboot if driver installed and /var/run/reboot-required exists
gpu_kubeconfig_path /etc/rancher/k3s/k3s.yaml Path to kubeconfig for device-plugin refresh

Testing

Run the smoke test on a GPU node:

./tests/e2e/smoke.sh

Expected: nvidia-smi and /dev/nvidia0 are present.

CI/CD

GitHub Actions workflow (.github/workflows/ci.yml) runs:

  • Lint: ansible-lint, shellcheck
  • Test: smoke test (requires self-hosted GPU runner for full validation)
  • Package: creates a tarball artifact on push to main

For production, add a self-hosted runner with GPU access and configure secrets as needed.

Systemd helper (nvidia-device-ready.service)

The role installs a systemd unit that:

  • Binds to dev-nvidia0.device (triggered when /dev/nvidia0 appears)
  • Runs /usr/local/bin/nvidia-device-ready.sh which:
    • Waits for /dev/nvidia0 (with timeout)
    • Restarts container runtime (containerd/docker)
    • Deletes device-plugin pods to force re-registration with kubelet

This ensures GPU resources remain available after reboots or driver updates.

Check status:

systemctl status nvidia-device-ready.service
journalctl -u nvidia-device-ready.service -b

Kubernetes integration

After running the role on GPU nodes:

  1. Deploy NVIDIA device-plugin DaemonSet (see NVIDIA device-plugin)
  2. Verify node allocatable: kubectl get node <node> -o json | jq '.status.allocatable["nvidia.com/gpu"]'
  3. Run a GPU workload (e.g., cuda-vectoradd sample)

For full automation consider NVIDIA GPU Operator (Helm chart) which handles driver, toolkit, device-plugin, and monitoring.

Development

# Lint
make lint

# Test
make test

# Package
make package

License

MIT — see LICENSE

Contributing

PRs welcome. Please lint and test before submitting.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published