Quick Setup: For an easy setup script and quick start instructions, please refer to the README in mnl_factory.
A comprehensive solution for setting up and managing GPU nodes with automated deployment using Ansible.
- Prerequisites
- Initial Setup
- Directory Structure
- Configuration
- GPU Setup Process
- Usage
- Troubleshooting
- Post-Installation
- Notes
- Deployment Instructions
- Ansible installed on your control node
- SSH access to target nodes
- Sudo privileges on target nodes
- NVIDIA GPU(s) on target nodes (the playbook will automatically skip GPU setup if no GPU is detected)
- Internet access for package downloads
- Secure Boot disabled in BIOS (required for NVIDIA driver installation)
- Install required Ansible collections:
ansible-galaxy collection install -r requirements.yml- Set up secrets:
- Copy
group_vars/vault.yml.exampletogroup_vars/vault.yml - Edit
vault.ymlwith your actual credentials - (Optional) Encrypt the vault file:
ansible-vault encrypt group_vars/vault.yml
- Copy
mnl_factory/
├── group_vars/
│ ├── all.yml # Common variables
│ ├── vault.yml # Encrypted secrets
│ └── vault.yml.example # Example secrets template
├── inventory/
│ └── hosts.yml # Inventory file
├── playbooks/
│ └── site.yml # Main playbook
├── roles/
│ ├── prerequisites/ # System prerequisites
│ ├── nvidia_gpu/ # NVIDIA GPU setup and driver installation
│ ├── docker/ # Docker installation
│ └── setup/ # Final configuration
├── requirements.yml # Ansible Galaxy requirements
└── .gitignore # Git ignore patterns
- Edit
inventory/hosts.ymlto add your target nodes:
all:
children:
gpu_nodes:
hosts:
your-gpu-node:
ansible_host: 192.168.1.100
ansible_user: your-user
ansible_ssh_private_key_file: ~/.ssh/id_rsa-
Adjust variables in
group_vars/all.yml:docker_compose_version: Docker Compose versionnvidia_driver_version: NVIDIA driver version (default: "535")cuda_version: CUDA version (default: "12.2")docker_image_name: Docker image namedocker_registry: Docker registry URL (if needed)
-
Configure secrets in
group_vars/vault.yml
The playbook performs the following steps for GPU setup:
-
GPU Detection:
- Checks for NVIDIA GPU presence using
lspci - Skips all GPU-related tasks if no GPU is found
- Checks for NVIDIA GPU presence using
-
Driver Status Check:
- Verifies if NVIDIA drivers are already installed via
nvidia-smi - Proceeds with installation only if drivers are missing or need update
- Verifies if NVIDIA drivers are already installed via
-
Secure Boot Check:
- Verifies Secure Boot status using
mokutil - Fails with clear message if Secure Boot is enabled
- Verifies Secure Boot status using
-
Driver Installation:
- Removes any existing NVIDIA drivers
- Updates package lists
- Installs specified NVIDIA driver version
- Holds the driver package to prevent automatic updates
- Installs nvtop for GPU monitoring
-
Verification:
- Verifies driver installation with
nvidia-smi - Checks driver version and GPU information
- Confirms package hold status
- Verifies driver installation with
- Test connection to your nodes:
ansible all -i inventory/hosts.yml -m ping- Run the playbook:
ansible-playbook -i inventory/hosts.yml playbooks/site.ymlIf you encrypted the vault file:
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --ask-vault-pass-
Secure Boot Error:
- Message: "Secure Boot is enabled"
- Solution: Disable Secure Boot in BIOS settings
-
Driver Installation Fails:
- Check
/var/log/dpkg.logfor package installation errors - Verify internet connectivity
- Ensure compatible driver version is specified
- Check
-
No GPU Detected:
- Verify GPU is properly seated
- Check
lspci | grep -i nvidiaoutput - Ensure GPU is supported and powered correctly
-
Package Lock Issues:
- The playbook automatically handles apt/dpkg locks
- Retries operations up to 5 times with delays
- Manual fix: Remove lock files if needed (not recommended)
After successful installation:
- Verify GPU status:
nvidia-smi- Monitor GPU usage:
nvtop- Check driver version:
nvidia-smi --query-gpu=driver_version --format=csv,noheader- The playbook is idempotent and can be run multiple times safely
- GPU setup is skipped automatically on non-GPU nodes
- Driver installation requires a system reboot
- The playbook includes automatic retry mechanisms for package operations
- Keep your vault.yml file secure and never commit it to version control
- Ensure adequate cooling and power for GPU operations
- Consider using NVIDIA container toolkit for Docker GPU support
-
Prepare the Collection Ensure you have the correct directory structure and all required files:
mnl_factory/ ├── galaxy.yml # Collection metadata ├── README.md ├── plugins/ ├── playbooks/ └── roles/ -
Build the Collection From the root directory of the collection, run:
ansible-galaxy collection build
This will create a tarball like
ratio1-multi_node_launcher-1.0.0.tar.gz -
Install the Collection You can install the collection locally using:
ansible-galaxy collection install ratio1-multi_node_launcher-1.0.0.tar.gz -p ./collections
-
Install Dependencies
ansible-galaxy collection install -r requirements.yml
-
Configure Your Environment
- Copy and edit the inventory file:
cp inventory/hosts.yml.example inventory/hosts.yml
- Update the hosts file with your target nodes
- Copy and edit the vault file:
cp group_vars/vault.yml.example group_vars/vault.yml
- Copy and edit the inventory file:
-
Run the Deployment
ansible-playbook -i inventory/hosts.yml playbooks/site.yml
If using vault encryption:
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --ask-vault-pass
To publish the collection to Ansible Galaxy:
-
Create an Account Sign up at galaxy.ansible.com
-
Get API Token Generate an API token from your Galaxy profile
-
Publish
ansible-galaxy collection publish ./ratio1-multi_node_launcher-1.0.0.tar.gz --api-key=your_api_token
MIT
- Andrei Damian
- Vitalii Toderian
curl -sSL https://raw.githubusercontent.com/YourUsername/r1setup/main/install.sh | sudo bash