Simulate a cluster to help you test various Slurm settings
To use this project you have to install and initialize Pulumi see ./doc/bootstrap.md
Configure your cluster in pulumi/__main__.py
cd pulumi
# Start the cluster
pulumi up
# `pulumi up` does not wait for cloud-init on rocky (if you know why; please open an issue),
# workaround :
sleep 60s
# check that every nodes as an ip (replace "test-cluster" with the name of
# your cluster that you defined in the bootstrap phase by "pulumi stack init test-cluster")
virsh net-dhcp-leases test-cluster-admin
# Back to the root of the project
cd ..This project provide compiled slurm packages for :
- RHEL (Rocky) 8.10
- Ubuntu 24.04
If you need to support other nodes OS, see doc/compiling-slurm.md.
Run the ansible playbook
ansible-playbook playbook.ymlAt the end of the playbook, you have a cluster configured with :
- Slurm :
- accounting: slurmdbd+mariadb,
- cgroup v2
- pdsh :
pdsh -g all uname -r | dshbak -c: get kernel version on all nodes,pdsh -g compute uname -r | dshbak -c: get kernel version on compute nodes.
- nfs for home folder and slurm conf.
Every node should be up and all compute node should be IDLE in the all partition.
You can now experiment with various slurm settings to solve your issue.
cd pulumi
pulumi downI sometimes have to setup very complex Slurm configuration and plugins for my clients.
I have created a private pulumi project for this purpose and used it to test various solutions to solve my issues without requiring a baremetal cluster.
I have now decided to clean up the code and make public for every sys-admin that might need it.
Please open an issue explaining what is missing.
You can have multiple simulator running :
- create a branch for each simulator :
git checkout -b sim1 - create and configure a pulumi stack for the branch :
pulumi stack init sim1,- see ./doc/bootstrap.md for "Create the pulumi stack"
- continue with Starting the cluster
This is not the purpose of this project:
- it might be dangerous (for example the database password is public),
- the inventory is generated dynamicaly from the pulumi stack,
- ...
However feel free to derive some proper ansible roles and inventory for your needs.
My first slurm simulator was built using Terraform (OpenTofu wasn't born yet).
The main advantage that I see for Pulumi is that I can use Python as a glue to :
- easly define the cluster,
- integrate the cluster as an ansible inventory.
The first tag should include :
Configured slurmrestd,Create one slurm account per group,Create one slurm association per user/group membership,Use of the storage network for sharing,/home- Configured OpenMPI over the fabric network with
srunusing it.- This one is hard, manual tests shows that on a cluster with a mix of Ubuntu and Rocky nodes :
- Login nodes can only submit mpi jobs to the same OS compute nodes,
- MPI jobs does not work accross OSes.
- Maybe limit that to a ansible task that ensures single OS Login+Compute nodes.
- This one is hard, manual tests shows that on a cluster with a mix of Ubuntu and Rocky nodes :