Skip to content

Plugin GPU Checker

Dwi Elfianto edited this page Dec 6, 2025 · 3 revisions

Docker CLI plugin for verifying NVIDIA GPU availability and configuration for Docker containers.

Overview

Plugin: docker-smi (System Management Interface)
Location: ~/.docker/cli-plugins/docker-smi

docker smi checks:

  1. Docker daemon accessibility
  2. NVIDIA Container Toolkit installation
  3. GPU availability by running nvidia-smi in test container

Installation

sudo cp /srv/compose/docker/cli-plugins/docker-smi ~/.docker/cli-plugins/
chmod +x ~/.docker/cli-plugins/docker-smi
docker smi --help

Usage

# Run GPU check
docker smi

# With options
docker smi [options]

Output

Successful Check

Checking Docker GPU configuration...

✓ Docker daemon is accessible
✓ NVIDIA Container Toolkit is installed
✓ Testing GPU access...

GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-...)
  Driver Version: 535.154.05
  CUDA Version: 12.2

✓ GPU is accessible from Docker containers

Failed Check

Checking Docker GPU configuration...

✓ Docker daemon is accessible
✗ NVIDIA Container Toolkit is NOT installed or not configured

Error: nvidia-container-runtime not found in Docker
Please install NVIDIA Container Toolkit:
  https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/

Checks Performed

1. Docker Daemon Access

Verifies Docker daemon is running and accessible.

Pass: Docker commands execute successfully
Fail: Connection refused, permission denied

2. NVIDIA Container Toolkit

Checks if NVIDIA Container Toolkit is installed.

Pass: nvidia-container-runtime exists
Fail: Runtime not found or not configured

3. GPU Test Container

Runs test container with GPU access:

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Pass: nvidia-smi output shows GPU
Fail: No GPU detected, permission issues

Use Cases

Initial Setup Verification

# After installing NVIDIA Container Toolkit
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

# Verify it works
docker smi

Troubleshooting GPU Issues

# Service can't access GPU
composectl logs genai-ollama | grep -i gpu

# Check Docker GPU config
docker smi

# If fails, check nvidia-smi on host
nvidia-smi

Pre-Flight Check

# Before starting GPU services
docker smi

# If successful, start services
sudo composectl start genai-ollama genai-swarmui

CI/CD Pipeline

#!/bin/bash
# Ensure GPU is available before deployment
if docker smi; then
    echo "GPU check passed, deploying AI services"
    sudo composectl start genai-ollama genai-openwebui
else
    echo "GPU check failed, skipping AI services"
    exit 1
fi

Troubleshooting

Plugin Not Found

# Check installation
ls -la ~/.docker/cli-plugins/docker-smi

# Make executable
chmod +x ~/.docker/cli-plugins/docker-smi

# Verify Docker sees it
docker --help | grep smi

Docker Daemon Not Accessible

✗ Docker daemon is not accessible

Solutions:

# Check Docker is running
sudo systemctl status docker

# Start Docker
sudo systemctl start docker

# Check user in docker group
groups | grep docker

# Add user to docker group
sudo usermod -aG docker $USER
# Log out and back in

NVIDIA Toolkit Not Found

✗ NVIDIA Container Toolkit is NOT installed

Solutions:

# Install NVIDIA Container Toolkit (Ubuntu/Debian)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker
sudo nvidia-ctk runtime configure --runtime=docker

# Restart Docker
sudo systemctl restart docker

# Test again
docker smi

GPU Not Detected in Container

✗ GPU test failed: no GPU detected

Solutions:

# Check GPU on host
nvidia-smi

# If GPU works on host, check Docker config
cat /etc/docker/daemon.json

# Should have:
{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

# Restart Docker after config change
sudo systemctl restart docker

# Test again
docker smi

Common Issues

Outdated NVIDIA Drivers

Error: CUDA version mismatch

Solution: Update NVIDIA drivers to 450+ for container support

Docker Runtime Not Configured

Error: unknown runtime specified nvidia

Solution: Run sudo nvidia-ctk runtime configure --runtime=docker

Permission Denied

Error: permission denied while trying to connect

Solution: Add user to docker group or run with sudo

Best Practices

  1. Run after driver updates - Verify GPU still works with Docker
  2. Use in setup scripts - Automated verification
  3. Check before GPU services - Prevent startup failures
  4. Test after Docker changes - Config or daemon.json updates
  5. Include in documentation - Help users verify GPU setup
  6. Use in health checks - CI/CD pipeline verification

Integration Examples

Startup Script

#!/bin/bash
# Safe GPU service startup

if ! docker smi; then
    echo "GPU not available, cannot start AI services"
    exit 1
fi

echo "GPU available, starting services..."
sudo composectl start genai-ollama genai-swarmui genai-embedding

Health Check Script

#!/bin/bash
# Periodic GPU health check

if docker smi > /dev/null 2>&1; then
    echo "GPU OK"
else
    echo "GPU PROBLEM - Restarting Docker"
    sudo systemctl restart docker
    sleep 5
    docker smi
fi

Service Wrapper

#!/bin/bash
# Wrap service start with GPU check

service=$1

if [[ "$service" == genai-* ]] || [[ "$service" == "swarmui" ]]; then
    if ! docker smi > /dev/null 2>&1; then
        echo "Warning: GPU not detected, service may not function correctly"
        read -p "Continue anyway? (y/n) " -n 1 -r
        echo
        [[ ! $REPLY =~ ^[Yy]$ ]] && exit 1
    fi
fi

sudo composectl start "$service"

Related Documentation

Quick Reference

# Check GPU availability
docker smi

# Compare with host GPU
nvidia-smi

# Use in scripts
if docker smi; then
    echo "GPU OK"
else
    echo "GPU NOT OK"
fi

# Check specific output
docker smi | grep "Driver Version"

Next: Troubleshooting - Common issues and solutions →

Clone this wiki locally