Skip to content

Advanced Use Cases

David Allen edited this page Aug 4, 2025 · 7 revisions

After going through the tutorial, you should be familiar and comfortable enough with OpenCHAMI to make changes to the deployment process and configuration. We're going to cover some of the more common use-cases that an OpenCHAMI user would want to pursue.

At this point, we can use what we have learned so far in the tutorial to customize our nodes in various ways such as changing how we serve images, deriving new images, updating our cloud-init config, and running MPI jobs. This sections explores some of the use cases that you may want to explore to utilize OpenCHAMI to fit your own needs.

Some of the use cases include:

  1. Adding SLURM and MPI to the Compute Node
  2. Serving the Root Filesystem with NFS
  3. Enable WireGuard Security for the cloud-init-server
  4. Using Image Layers to Customize Boot Image with a Common Base
  5. Using kexec to Reboot Nodes For an Kernel Upgrade or Specialized Kernel
  6. Discovering Nodes Dynamically with Redfish

Note

This guide generally assumes that you have completed the tutorial and already have a working OpenCHAMI deployment.

Adding SLURM and MPI to the Compute Node

After getting our nodes to boot using our compute images, let's try to install SLURM and run a test MPI job. We can do this at least two ways here:

  • Create a new compute-slurm image similar to the compute-debug image using the compute-base image as a base. You do not have to rebuild the parent images unless you want to make changes to them, but keep in mind that you will also have to rebuild any derivative images as well. See the Building Into the Image section for this method.

  • Change the cloud-init config to install SLURM and OpenMPI (or any other MPI package of choice) on boot. See the Installing via Cloud-Init section for this method.

One thing to note here. We need to install and start munge and share the munge key before we start SLURM on our nodes. Since we want to protect the key, we will use WireGuard with cloud-init to share the key across compute nodes.

Prepare the Head Node

Before we install SLURM, we need to install munge and set up the munge keys across our cluster. Let's download and install the latest release.

curl -fsSL https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz
tar xJf munge-0.5.16.tar.xz
cd munge-0.5.16
./configure \
	--prefix=/usr \
    --sysconfdir=/etc \
    --localstatedir=/var \
	--runstatedir=/run
make
make check
sudo make install

This will install munge on the head node. Then, create a munge key stored in /etc/munge/munge.key as a non-root user.

sudo -u munge /usr/sbin/mungekey --verbose

Then we can enable and start the munge service.

sudo systemctl enable --now munge.service

Warning

The clock must be synced across all of your nodes for munge to work!

Let's now install the SLURM server using the recommended method for production.

curl -fsSL https://download.schedmd.com/slurm/slurm-25.05.1.tar.bz2
rpmbuild -ta slurm-25.05.1.tar.bz2

We can go ahead and enable and start the slurmctld service on the head node (aka the "controller" node) since the munge service is already running.

systemctl enable --now slurmctld

Prepare the Compute Nodes

We need to set up the compute nodes similar to the head node with munge and SLURM. Like before, we need to do two things:

  1. Propagate the /etc/munge/mungekey created on the head node
  2. Install SLURM and start slurmd service

As mentioned before, we're going to do this in cloud-init to pass around our secrets securely to the nodes.

Building Into the Image

We can use the image-builder tool to build a new image with the SLURM and OpenMPI packages directly in the image. Since the new image will be for the compute nodes, we'll base our new image on the compute-base image definition from the tutorial.

You should already have a directory at /opt/workdir/images. Make sure you already have a base compute image with s3cmd ls.

# TODO: put the output of `s3cmd ls` here with the compute-base image

Tip

If you do not already have the compute-base image, go back to this step from the tutorial, build the image, and push it to S3. Once you have done that, proceed to the next step.

Now, edit a new file at path /opt/workdir/images/compute-slurm-rocky9.yaml and copy the contents below into it.

options:
  layer_type: 'base'
  name: 'compute-slurm'
  publish_tags:
    - 'rocky9'
  pkg_manager: 'dnf'
  parent: 'demo.openchami.cluster:5000/demo/rocky-base:9'
  registry_opts_pull:
    - '--tls-verify=false'

  # Publish SquashFS image to local S3
  publish_s3: 'http://demo.openchami.cluster:9000'
  s3_prefix: 'compute/base/'
  s3_bucket: 'boot-images'

  # Publish OCI image to container registry
  #
  # This is the only way to be able to re-use this image as
  # a parent for another image layer.
  publish_registry: 'demo.openchami.cluster:5000/demo'
  registry_opts_push:
    - '--tls-verify=false'

repos:
  - alias: 'Epel9'
    url: 'https://dl.fedoraproject.org/pub/epel/9/Everything/x86_64/'
    gpg: 'https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-9'

packages:
  - slurm
  - openmpi

cmds:
	# Add 'slurm' and 'munge' users to run 'slurmd' and 'munge' respectively
	- cmd: "useradd -mG wheel slurm"
	- cmd: "useradd -mG wheel munge"

	# Install munge like on head node
	```bash
	- cmd: "curl -fsSL https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz"
	- cmd: "tar xJf munge-0.5.16.tar.xz"
	- cmd: "cd munge-0.5.16"
	- cmd: "./configure \
		--prefix=/usr \
	    --sysconfdir=/etc \
	    --localstatedir=/var \
		--runstatedir=/run"
	- cmd: "make"
	- cmd: "make check"
	- cmd: "sudo make install"
	

Notice the changes to the new image definition. We changed the options.name and added packages, and cmds. Since we're basing this image on another image, we only need the packages we want to add to the new image. We can build the image and push it to S3 now.

podman run --rm --device /dev/fuse --network host -e S3_ACCESS=admin -e S3_SECRET=admin123 -v /opt/workdir/images/compute-slurm-rocky9.yaml:/home/builder/config.yaml ghcr.io/openchami/image-build:latest image-build --config config.yaml --log-level DEBUG

Wait until the build finishes and check the S3 bucket to confirm that it is there with s3cmd ls again. Add a new boot script to /opt/workdir/boot/boot-compute-slurm.yaml which we will use to boot our compute nodes.

kernel: 'http://172.16.0.254:9000/boot-images/efi-images/compute/debug/vmlinuz-5.14.0-570.21.1.el9_6.x86_64'
initrd: 'http://172.16.0.254:9000/boot-images/efi-images/compute/debug/initramfs-5.14.0-570.21.1.el9_6.x86_64.img'
params: 'nomodeset ro root=live:http://172.16.0.254:9000/boot-images/compute/debug/rocky9.6-compute-slurm-rocky9 ip=dhcp overlayroot=tmpfs overlayroot_cfgdisk=disabled apparmor=0 selinux=0 console=ttyS0,115200 ip6=off cloud-init=enabled ds=nocloud-net;s=http://172.16.0.254:8081/cloud-init'
macs:
  - 52:54:00:be:ef:01
  - 52:54:00:be:ef:02
  - 52:54:00:be:ef:03
  - 52:54:00:be:ef:04
  - 52:54:00:be:ef:05

Set and confirm that the boot parameters have been set correctly.

ochami bss boot params set -f yaml -d @/opt/workdir/boot/boot-compute-slurm.yaml
ochami bss boot params get -F yaml

Installing via Cloud-Init

Alternatively, we can install the necessary SLURM and OpenMPI packages after booting by adding packages to our cloud-init config and run a couple of commands to configure. This also gives us an opportunity to install and configure munge in one go instead of installing into the image and then setting up using cloud-init.

Let's start by making changes to the cloud-init config file in /opt/workdir/cloud-init/computes.yaml that we used previously. Note that we are using a pre-built RPMs to install SLURM and OpenMPI from the Rocky 9 repos.

- name: compute
  description: "compute config"
  file:
    encoding: plain
    content: |
      ## template: jinja
      #cloud-config
      merge_how:
      - name: list
        settings: [append]
      - name: dict
        settings: [no_replace, recurse_list]
      users:
        - name: root
          ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }}
      disable_root: false
      packages:
	    - slurm
	    - openmpi
      cmds:
	    - curl -fsSL https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz
		- tar xJf munge-0.5.16.tar.xz
		- cd munge-0.5.16
		- "./configure \
			--prefix=/usr \
		    --sysconfdir=/etc \
		    --localstatedir=/var \
			--runstatedir=/run"
		- make
		- make check
		- sudo make install

We added the packages section to tell cloud-init to install the slurm and openmpi packages after booting the compute. Then, we install munge just like we did before on the head node.

TODO: add section about sharing the munge key using cloud-init wireguard

Run a Sample MPI job across two VMs

Finally, once we have everything set up, we can boot the compute nodes.

sudo virt-install \
  --name compute1 \
  --memory 4096 \
  --vcpus 1 \
  --disk none \
  --pxe \
  --os-variant centos-stream9 \
  --network network=openchami-net,model=virtio,mac=52:54:00:be:ef:01 \
  --graphics none \
  --console pty,target_type=serial \
  --boot network,hd \
  --boot loader=/usr/share/OVMF/OVMF_CODE.secboot.fd,loader.readonly=yes,loader.type=pflash,nvram.template=/usr/share/OVMF/OVMF_VARS.fd,loader_secure=no \
  --virt-type kvm

Your compute node should start up with iPXE output. If your node does not boot, check the troubleshooting sections for common issues. Both SLURM and OpenMPI should be installed too, but we don't want to start the services yet since we have not set up munge on the node. Start another compute node and call it compute2 using the MAC address specified below.

sudo virt-install \
  --name compute2 \
  --memory 4096 \
  --vcpus 1 \
  --disk none \
  --pxe \
  --os-variant centos-stream9 \
  --network network=openchami-net,model=virtio,mac=52:54:00:be:ef:02 \
  --graphics none \
  --console pty,target_type=serial \
  --boot network,hd \
  --boot loader=/usr/share/OVMF/OVMF_CODE.secboot.fd,loader.readonly=yes,loader.type=pflash,nvram.template=/usr/share/OVMF/OVMF_VARS.fd,loader_secure=no \
  --virt-type kvm

After we have installed both SLURM and OpenMPI on the compute node, let's try and launch a "hello world" MPI job. To do so, we will need three things:

  1. Source code for MPI program
  2. Compiled MPI executable binary
  3. SLURM job script

We create the MPI program in C. First, create a new directory to store our source code. Then, edit the /opt/workdir/apps/hello.c file.

mkdir -p /opt/workdir/apps/mpi/hello
# edit /opt/workdir/apps/mpi/hello/hello.c

Now copy the contents below into the hello.c file.

/*The Parallel Hello World Program*/
#include <stdio.h>
#include <mpi.h>

main(int argc, char **argv)
{
   int node;
   
   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &node);
   
   printf("Hello World from Node %d\n",node);
   
   MPI_Finalize();
}

Compile the program.

cd /opt/workdir/apps/mpi/hello
mpicc hello.c -o hello

You should have an hello executable in the /opt/workdir/apps/mpi/hello directory now. We can use this binary executable with SLURM to launch process in parallel.

Let's create a job script to launch the executable we just created. Create a new directory to hold our SLURM job script. Then, edit a new file called launch-hello.sh in the new /opt/workdir/jobscripts directory.

mkdir -p /opt/workdir/jobscripts
cd /opt/workdir/jobscripts
# edit launch.sh

Copy the contents below into the launch-hello.sh job script.

Note

The contents of your job script may vary significantly depending on your cluster. Refer to the documentation for your institution and adjust the script accordingly to your needs.

#!/bin/bash 

#SBATCH --job-name=hello
#SBATCH --account=account_name
#SBATCH --partition=partition_name 
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:00:30
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK /opt/workdir/apps/mpi/hello/hello

We should now have everything we need to test our MPI job with our compute node(s). Launch the job with the sbatch command.

sbatch /opt/workdir/jobscripts/launch-hello.sh

We can confirm the job is running with the squeue command.

squeue

You should see a list with a job named hello that was given in the launch-hello.sh job script.

# TODO: add output of squeue above

If you saw the output above, you should now be able to inspect the output of the job when it completes.

# TODO: add output of MPI job (should be something like hello.o and/or hello.e)

And that's it! You have successfully launched an MPI job with SLURM from an OpenCHAMI deployed system.

Serving the Root Filesystem with NFS

For the tutorial, we served images via HTTP with a local S3 bucket using MinIO and an OCI registry. We could instead serve our images by network mounting the directories that hold our images with NFS. We can spin up a NFS server on the head node by including NFS tools in our base image and configure our nodes to mount the images.

Configure NFS to serve your SquashFS nfsroot with as much performance as possible.

sudo mkdir -p /opt/nfsroot && sudo chown rocky /opt/nfsroot

Create a file at path /etc/exports and copy the following contents to export the /opt/nfsroot directory for use by our compute nodes.

/opt/nfsroot *(ro,no_root_squash,no_subtree_check,noatime,async,fsid=0)

Reload the NFS daemon to apply the changes.

modprobe -r nfsd && modprobe nfsd

For NFS, we need to update the /etc/exports file and then reload the kernel nfs daemon

Create /opt/nfsroot to serve our images

sudo mkdir /srv/nfs
sudo chown rocky: /srv/nfs
  • Create the /etc/exports file with the following contents to export the /srv/nfs directory for use by our compute nodes

    /srv/nfs *(ro,no_root_squash,no_subtree_check,noatime,async,fsid=0)
  • Reload the nfs daemon

    sudo modprobe -r nfsd && sudo modprobe nfsd

Webserver for Boot Artifacts

We expose our NFS directory over https as well to make it easy to serve boot artifacts.

# nginx.container
[Unit]
Description=Serve /srv/nfs over HTTP
After=network-online.target
Wants=network-online.target

[Container]
ContainerName=nginx
Image=docker.io/library/nginx:1.28-alpine
Volume=/srv/nfs:/usr/share/nginx/html:Z
PublishPort=80:80

[Service]
TimeoutStartSec=0
Restart=always

Import Images from OCI to Share with NFS

Import-image Script

Enable WireGuard Security for the cloud-init-server

When nodes boot in OpenCHAMI, they make a request out to the cloud-init-server to retrieve a cloud-init config. The request is not encrypted and can be intercepted and modified.

Using WireGuard with Cloud-Init

The OpenCHAMI cloud-init metadata server includes a feature to enable a wireguard tunnel before running cloud-init.

TODO: Add more content on how to do this

Create a systemd override file for cloud-init

[Service]
PassEnvironment=ochami_wg_ip
ExecStartPre=/usr/local/bin/ochami-wg-cloud-init-setup.sh
ExecPostStop=/bin/bash -c "ip link delete wg0"

Create a Script to Activate WireGuard

#!/bin/sh
set -e -o pipefail

# As configured in systemd, we expect to inherit the "ochami_wg_url" cmdline
# parameter as an env var. Exit if this is not the case.
if [ -z "${ochami_wg_ip}" ];
then
    echo "ERROR: Failed to find the 'ochami_wg_url' environment variable."
    echo "It should be specified on the kernel cmdline, and will be inherited from there."
    if [ -f "/etc/cloud/cloud.cfg.d/ochami.cfg" ];
    then
        echo "Removing ochami-specific cloud-config; cloud-init will use other defaults"
        rm /etc/cloud/cloud.cfg.d/ochami.cfg
    else
        echo "Not writing ochami-specific cloud-config; cloud-init will use other defaults"
    fi
    exit 0
fi
echo "Found OpenCHAMI cloud-init URL '${ochami_wg_ip}'"
echo "!!!!Starting pre cloud-init config!!!!"

echo "Loading WireGuard kernel mod"
modprobe wireguard

echo "Generating WireGuard keys"
wg genkey | tee /etc/wireguard/private.key | wg pubkey > /etc/wireguard/public.key

echo "Making Request to configure wireguard tunnel"
PUBLIC_KEY=$(cat /etc/wireguard/public.key)
PAYLOAD="{ \"public_key\": \"${PUBLIC_KEY}\" }"
WG_PAYLOAD=$(curl -s -X POST -d "${PAYLOAD}" http://${ochami_wg_ip}:27777/cloud-init/wg-init)

echo $WG_PAYLOAD | jq

CLIENT_IP=$(echo $WG_PAYLOAD | jq -r '."client-vpn-ip"')
SERVER_IP=$(echo $WG_PAYLOAD | jq -r '."server-ip"' | awk -F'/' '{print $1}')
SERVER_PORT=$(echo $WG_PAYLOAD | jq -r '."server-port"')
SERVER_KEY=$(echo $WG_PAYLOAD | jq -r '."server-public-key"')

echo "Setting up local wireguard interface"
echo "Adding wg0 link"
ip link add dev wg0 type wireguard
echo "Adding ip address ${CLIENT_IP}/32"
ip address add dev wg0 ${CLIENT_IP}/32
echo "Setting the private key"
wg set wg0 private-key /etc/wireguard/private.key
echo "Bringing up the wg0 link"
ip link set wg0 up
echo "Setting up the peer with the server"
wg set wg0 peer ${SERVER_KEY} allowed-ips ${SERVER_IP}/32 endpoint ${ochami_wg_ip}:$SERVER_PORT
rm /etc/wireguard/private.key
rm /etc/wireguard/public.key

Add the Scripts to Your Image

copyfiles:
  - src: '/opt/workdir/images/files/cloud-init-override.conf'
    dest: '/etc/systemd/system/cloud-init.service.d/override.conf'
  - src: '/opt/workdir/images/files/ochami-ci-setup.sh'
    dest: '/usr/local/bin/ochami-ci-setup.sh'

Restart cloud-init-server with WireGuard.

Using Image Layers to Customize Boot Image with a Common Base

Often, we want to allocate nodes for different purposes using different images. Let's use the base image that we created before and create another Kubernetes layer called kubernetes-worker based on the base image we created before. We would need to modify the boot script to use this new Kubernetes image and update cloud-init set up the nodes.

Using kexec to Reboot Nodes For an Upgrade or Specialized Kernel

kexec-load.sh

#!/usr/bin/env sh

if [ 512000000 -gt $(cat /proc/meminfo | grep -F 'MemTotal' | grep -oE '[0-9]+' | tr -d '\n'; echo 000) ]; then
    echo 'Not enough memory to safely load the kernel' >&2
    exit 0
fi

if lspci 2>/dev/null  | grep -qi '3D controller'; then
    echo 'GPUs detected. Not loading kernel to prevent system instability' >&2
    exit 0
fi


# Might need to tweak this if the kernel is in a different spot
exec kexec -l "/boot/vmlinuz-$(uname -r)" --initrd="/boot/initramfs-$(uname -r).img" --reuse-cmdline

kexec-update.sh

#!/bin/bash

#set -x

set -e

# This whole script is a bit heavy on the heuristics.
# It would be much better to patch BSS to do JSON output.

# This gets the MAC address of the first interface with an IP address
MAC="$(ip addr | grep -A10 'state UP' | grep -oP -m1 '(?<=link/ether )[a-f0-9:]+')"

# This gets the bss IP address from the kernel commandline
BSS_IP="$(grep -oP '(?<=bss=)[^:/ ]+' /proc/cmdline | tail -n1)"

# When I use the NID it just returns a script that chains into the MAC address one
echo 'Getting boot script...'
BOOT_SCRIPT="$(curl -s "http://$BSS_IP:8081/apis/bss/boot/v1/bootscript?mac=$MAC&json=1")"

if [ -z "$BOOT_SCRIPT" ]; then
    echo 'Empty boot script! Aborting...'
    exit 1
fi

INITRD="$(echo $BOOT_SCRIPT | jq -r .initrd.path)"
if [ -z "$INITRD" ]; then
    echo 'No initrd URL. Aborting...'
    exit 2
fi

KERNEL="$(echo $BOOT_SCRIPT | jq -r .kernel.path)"
if [ -z "$KERNEL" ]; then
    echo 'No kernel URL. Aborting...'
    exit 3
fi

PARAMS="$(echo $BOOT_SCRIPT | jq -r .params)"
if [ -z "$PARAMS" ]; then
    echo 'No kernel params. Aborting...'
    exit 2
fi

TMP="$(mktemp -d)"
trap 'rm -rf "$TMP"' EXIT

echo 'Getting kernel...'
curl -so "$TMP/kernel" "$KERNEL"
echo 'Getting initrd...'
curl -so "$TMP/initrd" "$INITRD"

kexec -l "$TMP/kernel" --initrd "$TMP/initrd" --command-line "$PARAMS"
echo 'All done!'

Discovering Nodes Dynamically with Redfish

In the tutorial, we used static discovery to populate our inventory in SMD instead of dynamically discovering nodes on our network. Static discovery is good when we know beforehand the MAC address, IP address, xname, and/or node ID of our nodes and guarantees deterministic behavior. However, sometimes we might not know these properties or we may want to check the current state of our hardware, say for a failure. In these scenario, we can probe our hardware dynamically using the scanning feature from magellan and then update the state of our inventory.

For this demonstration, we have two prerequisites before we get started:

  1. Emulate board management controllers (BMCs) with running Redfish services
  2. Have a running instance of SMD or a full running deployment of the OpenCHAMI services

The magellan repository has an emulator included in the project that we can used for quick and dirty testing. This is useful if we want to try out the capabilities of the tool without have to put to much time and effort setting up an environment. However, we want to use multiple BMCs to show how magellan can distinguish between Redfish and non-Redfish services.

TODO: Add content setting up multiple emulated BMCs with Redfish services (the quickstart in the deployment-recipes has this already).

Performing a Scan

A scan sends out requests to all devices on a network specified with the --subnet flag. If the device responds, it is added to a cache database that we'll need for the next section.

Let's do a scan and see what we can find on our network. We should be able to find all of our emulated BMCs without having to worry too much about any other services.

magellan scan --subnet 172.16.0.100/24 --cache ./assets.db

This command should not have any output if it runs successfully. By default, the cache will be stored in /tmp/$USER/magellan/assets.db in a tiny SQLite 3 database. Instead, we stored the cache locally with the --cache flag.

We can see what BMCs with Redfish were found with the list command.

magellan list

You should see the emulated BMCs.

# TODO: add list of emulated BMCs from `magellan list` output

Now that we know the IP addresses of the BMCs, let's collect inventory data using the collect command.

Collecting and Updating Hardware Inventory

We can use the cache to pull in inventory data from the BMCs. If the BMCs require a username and password, we can see them using the secrets store before we run collect.

TEMP_KEY=$(magellan secrets generatekey)  # ...or whatever you want to use for your key
export MASTER_KEY=$TEMP_KEY
magellan secrets store default $default_bmc_username:$default_bmc_password

This stores a default BMC username and password to use across all BMC nodes that do not have credentials specified. If we want to add specific credentials, we just need to change default to the host.

magellan secrets store https://172.16.0.101 $bmc01_username:$bmc01_password

The credentials will be used automatically when collect or crawl and ran. Additionally, when running collect have to add -v flag to see the output and -o to save it to a file.

magellan collect -v -F yaml -o nodes.yaml

There should be a nodes.yaml file in the current directory. The file can be edited to use different values before uploading to SMD. Once done editing, send it off with the send command.

magellan send -F yaml -d @nodes.yaml https://demo.openchami.cluster:8443

This will store the inventory data in SMD like before with the information found from the scan.