-
Notifications
You must be signed in to change notification settings - Fork 4
Advanced Use Cases
After going through the tutorial, you should be familiar and comfortable enough with OpenCHAMI to make changes to the deployment process and configuration. We're going to cover some of the more common use-cases that an OpenCHAMI user would want to pursue.
At this point, we can use what we have learned so far in the tutorial to customize our nodes in various ways such as changing how we serve images, deriving new images, updating our cloud-init config, and running MPI jobs. This sections explores some of the use cases that you may want to explore to utilize OpenCHAMI to fit your own needs.
Some of the use cases include:
- Adding SLURM and MPI to the Compute Node
- Serving the Root Filesystem with NFS
- Enable WireGuard Security for the
cloud-init-server
- Using Image Layers to Customize Boot Image with a Common Base
- Using
kexec
to Reboot Nodes For an Kernel Upgrade or Specialized Kernel - Discovering Nodes Dynamically with Redfish
Note
This guide generally assumes that you have completed the tutorial and already have a working OpenCHAMI deployment.
After getting our nodes to boot using our compute images, let's try to install SLURM and run a test MPI job. We can do this at least two ways here:
-
Create a new
compute-slurm
image similar to thecompute-debug
image using thecompute-base
image as a base. You do not have to rebuild the parent images unless you want to make changes to them, but keep in mind that you will also have to rebuild any derivative images as well. See the Building Into the Image section for this method. -
Change the cloud-init config to install SLURM and OpenMPI (or any other MPI package of choice) on boot. See the Installing via Cloud-Init section for this method.
One thing to note here. We need to install and start munge
and share the munge key before we start SLURM on our nodes. Since we want to protect the key, we will use WireGuard with cloud-init to share the key across compute nodes.
Before we install SLURM, we need to install munge
and set up the munge keys across our cluster. Let's download and install the latest release.
curl -fsSL https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz
tar xJf munge-0.5.16.tar.xz
cd munge-0.5.16
./configure \
--prefix=/usr \
--sysconfdir=/etc \
--localstatedir=/var \
--runstatedir=/run
make
make check
sudo make install
This will install munge
on the head node. Then, create a munge key stored in /etc/munge/munge.key
as a non-root user.
sudo -u munge /usr/sbin/mungekey --verbose
Then we can enable and start the munge
service.
sudo systemctl enable --now munge.service
Warning
The clock must be synced across all of your nodes for munge
to work!
Let's now install the SLURM server using the recommended method for production.
curl -fsSL https://download.schedmd.com/slurm/slurm-25.05.1.tar.bz2
rpmbuild -ta slurm-25.05.1.tar.bz2
We can go ahead and enable and start the slurmctld
service on the head node (aka the "controller" node) since the munge
service is already running.
systemctl enable --now slurmctld
We need to set up the compute nodes similar to the head node with munge and SLURM. Like before, we need to do two things:
- Propagate the
/etc/munge/mungekey
created on the head node - Install SLURM and start
slurmd
service
As mentioned before, we're going to do this in cloud-init to pass around our secrets securely to the nodes.
We can use the image-builder
tool to build a new image with the SLURM and OpenMPI packages directly in the image. Since the new image will be for the compute nodes, we'll base our new image on the compute-base
image definition from the tutorial.
You should already have a directory at /opt/workdir/images
. Make sure you already have a base compute image with s3cmd ls
.
# TODO: put the output of `s3cmd ls` here with the compute-base image
Tip
If you do not already have the compute-base
image, go back to this step from the tutorial, build the image, and push it to S3. Once you have done that, proceed to the next step.
Now, edit a new file at path /opt/workdir/images/compute-slurm-rocky9.yaml
and copy the contents below into it.
options:
layer_type: 'base'
name: 'compute-slurm'
publish_tags:
- 'rocky9'
pkg_manager: 'dnf'
parent: 'demo.openchami.cluster:5000/demo/rocky-base:9'
registry_opts_pull:
- '--tls-verify=false'
# Publish SquashFS image to local S3
publish_s3: 'http://demo.openchami.cluster:9000'
s3_prefix: 'compute/base/'
s3_bucket: 'boot-images'
# Publish OCI image to container registry
#
# This is the only way to be able to re-use this image as
# a parent for another image layer.
publish_registry: 'demo.openchami.cluster:5000/demo'
registry_opts_push:
- '--tls-verify=false'
repos:
- alias: 'Epel9'
url: 'https://dl.fedoraproject.org/pub/epel/9/Everything/x86_64/'
gpg: 'https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-9'
packages:
- slurm
- openmpi
cmds:
# Add 'slurm' and 'munge' users to run 'slurmd' and 'munge' respectively
- cmd: "useradd -mG wheel slurm"
- cmd: "useradd -mG wheel munge"
# Install munge like on head node
```bash
- cmd: "curl -fsSL https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz"
- cmd: "tar xJf munge-0.5.16.tar.xz"
- cmd: "cd munge-0.5.16"
- cmd: "./configure \
--prefix=/usr \
--sysconfdir=/etc \
--localstatedir=/var \
--runstatedir=/run"
- cmd: "make"
- cmd: "make check"
- cmd: "sudo make install"
Notice the changes to the new image definition. We changed the options.name
and added packages
, and cmds
. Since we're basing this image on another image, we only need the packages we want to add to the new image. We can build the image and push it to S3 now.
podman run --rm --device /dev/fuse --network host -e S3_ACCESS=admin -e S3_SECRET=admin123 -v /opt/workdir/images/compute-slurm-rocky9.yaml:/home/builder/config.yaml ghcr.io/openchami/image-build:latest image-build --config config.yaml --log-level DEBUG
Wait until the build finishes and check the S3 bucket to confirm that it is there with s3cmd ls
again. Add a new boot script to /opt/workdir/boot/boot-compute-slurm.yaml
which we will use to boot our compute nodes.
kernel: 'http://172.16.0.254:9000/boot-images/efi-images/compute/debug/vmlinuz-5.14.0-570.21.1.el9_6.x86_64'
initrd: 'http://172.16.0.254:9000/boot-images/efi-images/compute/debug/initramfs-5.14.0-570.21.1.el9_6.x86_64.img'
params: 'nomodeset ro root=live:http://172.16.0.254:9000/boot-images/compute/debug/rocky9.6-compute-slurm-rocky9 ip=dhcp overlayroot=tmpfs overlayroot_cfgdisk=disabled apparmor=0 selinux=0 console=ttyS0,115200 ip6=off cloud-init=enabled ds=nocloud-net;s=http://172.16.0.254:8081/cloud-init'
macs:
- 52:54:00:be:ef:01
- 52:54:00:be:ef:02
- 52:54:00:be:ef:03
- 52:54:00:be:ef:04
- 52:54:00:be:ef:05
Set and confirm that the boot parameters have been set correctly.
ochami bss boot params set -f yaml -d @/opt/workdir/boot/boot-compute-slurm.yaml
ochami bss boot params get -F yaml
Alternatively, we can install the necessary SLURM and OpenMPI packages after booting by adding packages to our cloud-init config and run a couple of commands to configure. This also gives us an opportunity to install and configure munge
in one go instead of installing into the image and then setting up using cloud-init.
Let's start by making changes to the cloud-init config file in /opt/workdir/cloud-init/computes.yaml
that we used previously. Note that we are using a pre-built RPMs to install SLURM and OpenMPI from the Rocky 9 repos.
- name: compute
description: "compute config"
file:
encoding: plain
content: |
## template: jinja
#cloud-config
merge_how:
- name: list
settings: [append]
- name: dict
settings: [no_replace, recurse_list]
users:
- name: root
ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }}
disable_root: false
packages:
- slurm
- openmpi
cmds:
- curl -fsSL https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz
- tar xJf munge-0.5.16.tar.xz
- cd munge-0.5.16
- "./configure \
--prefix=/usr \
--sysconfdir=/etc \
--localstatedir=/var \
--runstatedir=/run"
- make
- make check
- sudo make install
We added the packages
section to tell cloud-init to install the slurm
and openmpi
packages after booting the compute. Then, we install munge
just like we did before on the head node.
TODO: add section about sharing the munge key using cloud-init wireguard
Finally, once we have everything set up, we can boot the compute nodes.
sudo virt-install \
--name compute1 \
--memory 4096 \
--vcpus 1 \
--disk none \
--pxe \
--os-variant centos-stream9 \
--network network=openchami-net,model=virtio,mac=52:54:00:be:ef:01 \
--graphics none \
--console pty,target_type=serial \
--boot network,hd \
--boot loader=/usr/share/OVMF/OVMF_CODE.secboot.fd,loader.readonly=yes,loader.type=pflash,nvram.template=/usr/share/OVMF/OVMF_VARS.fd,loader_secure=no \
--virt-type kvm
Your compute node should start up with iPXE output. If your node does not boot, check the troubleshooting sections for common issues. Both SLURM and OpenMPI should be installed too, but we don't want to start the services yet since we have not set up munge on the node. Start another compute node and call it compute2
using the MAC address specified below.
sudo virt-install \
--name compute2 \
--memory 4096 \
--vcpus 1 \
--disk none \
--pxe \
--os-variant centos-stream9 \
--network network=openchami-net,model=virtio,mac=52:54:00:be:ef:02 \
--graphics none \
--console pty,target_type=serial \
--boot network,hd \
--boot loader=/usr/share/OVMF/OVMF_CODE.secboot.fd,loader.readonly=yes,loader.type=pflash,nvram.template=/usr/share/OVMF/OVMF_VARS.fd,loader_secure=no \
--virt-type kvm
After we have installed both SLURM and OpenMPI on the compute node, let's try and launch a "hello world" MPI job. To do so, we will need three things:
- Source code for MPI program
- Compiled MPI executable binary
- SLURM job script
We create the MPI program in C. First, create a new directory to store our source code. Then, edit the /opt/workdir/apps/hello.c
file.
mkdir -p /opt/workdir/apps/mpi/hello
# edit /opt/workdir/apps/mpi/hello/hello.c
Now copy the contents below into the hello.c
file.
/*The Parallel Hello World Program*/
#include <stdio.h>
#include <mpi.h>
main(int argc, char **argv)
{
int node;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &node);
printf("Hello World from Node %d\n",node);
MPI_Finalize();
}
Compile the program.
cd /opt/workdir/apps/mpi/hello
mpicc hello.c -o hello
You should have an hello
executable in the /opt/workdir/apps/mpi/hello
directory now. We can use this binary executable with SLURM to launch process in parallel.
Let's create a job script to launch the executable we just created. Create a new directory to hold our SLURM job script. Then, edit a new file called launch-hello.sh
in the new /opt/workdir/jobscripts
directory.
mkdir -p /opt/workdir/jobscripts
cd /opt/workdir/jobscripts
# edit launch.sh
Copy the contents below into the launch-hello.sh
job script.
Note
The contents of your job script may vary significantly depending on your cluster. Refer to the documentation for your institution and adjust the script accordingly to your needs.
#!/bin/bash
#SBATCH --job-name=hello
#SBATCH --account=account_name
#SBATCH --partition=partition_name
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:00:30
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK /opt/workdir/apps/mpi/hello/hello
We should now have everything we need to test our MPI job with our compute node(s). Launch the job with the sbatch
command.
sbatch /opt/workdir/jobscripts/launch-hello.sh
We can confirm the job is running with the squeue
command.
squeue
You should see a list with a job named hello
that was given in the launch-hello.sh
job script.
# TODO: add output of squeue above
If you saw the output above, you should now be able to inspect the output of the job when it completes.
# TODO: add output of MPI job (should be something like hello.o and/or hello.e)
And that's it! You have successfully launched an MPI job with SLURM from an OpenCHAMI deployed system.
For the tutorial, we served images via HTTP with a local S3 bucket using MinIO and an OCI registry. We could instead serve our images by network mounting the directories that hold our images with NFS. We can spin up a NFS server on the head node by including NFS tools in our base image and configure our nodes to mount the images.
Configure NFS to serve your SquashFS nfsroot
with as much performance as possible.
sudo mkdir -p /opt/nfsroot && sudo chown rocky /opt/nfsroot
Create a file at path /etc/exports
and copy the following contents to export the /opt/nfsroot
directory for use by our compute nodes.
/opt/nfsroot *(ro,no_root_squash,no_subtree_check,noatime,async,fsid=0)
Reload the NFS daemon to apply the changes.
modprobe -r nfsd && modprobe nfsd
For NFS, we need to update the /etc/exports file and then reload the kernel nfs daemon
Create /opt/nfsroot
to serve our images
sudo mkdir /srv/nfs
sudo chown rocky: /srv/nfs
-
Create the
/etc/exports
file with the following contents to export the/srv/nfs
directory for use by our compute nodes/srv/nfs *(ro,no_root_squash,no_subtree_check,noatime,async,fsid=0)
-
Reload the nfs daemon
sudo modprobe -r nfsd && sudo modprobe nfsd
We expose our NFS directory over https as well to make it easy to serve boot artifacts.
# nginx.container
[Unit]
Description=Serve /srv/nfs over HTTP
After=network-online.target
Wants=network-online.target
[Container]
ContainerName=nginx
Image=docker.io/library/nginx:1.28-alpine
Volume=/srv/nfs:/usr/share/nginx/html:Z
PublishPort=80:80
[Service]
TimeoutStartSec=0
Restart=always
When nodes boot in OpenCHAMI, they make a request out to the cloud-init-server
to retrieve a cloud-init config. The request is not encrypted and can be intercepted and modified.
The OpenCHAMI cloud-init metadata server includes a feature to enable a wireguard tunnel before running cloud-init.
TODO: Add more content on how to do this
[Service]
PassEnvironment=ochami_wg_ip
ExecStartPre=/usr/local/bin/ochami-wg-cloud-init-setup.sh
ExecPostStop=/bin/bash -c "ip link delete wg0"
#!/bin/sh
set -e -o pipefail
# As configured in systemd, we expect to inherit the "ochami_wg_url" cmdline
# parameter as an env var. Exit if this is not the case.
if [ -z "${ochami_wg_ip}" ];
then
echo "ERROR: Failed to find the 'ochami_wg_url' environment variable."
echo "It should be specified on the kernel cmdline, and will be inherited from there."
if [ -f "/etc/cloud/cloud.cfg.d/ochami.cfg" ];
then
echo "Removing ochami-specific cloud-config; cloud-init will use other defaults"
rm /etc/cloud/cloud.cfg.d/ochami.cfg
else
echo "Not writing ochami-specific cloud-config; cloud-init will use other defaults"
fi
exit 0
fi
echo "Found OpenCHAMI cloud-init URL '${ochami_wg_ip}'"
echo "!!!!Starting pre cloud-init config!!!!"
echo "Loading WireGuard kernel mod"
modprobe wireguard
echo "Generating WireGuard keys"
wg genkey | tee /etc/wireguard/private.key | wg pubkey > /etc/wireguard/public.key
echo "Making Request to configure wireguard tunnel"
PUBLIC_KEY=$(cat /etc/wireguard/public.key)
PAYLOAD="{ \"public_key\": \"${PUBLIC_KEY}\" }"
WG_PAYLOAD=$(curl -s -X POST -d "${PAYLOAD}" http://${ochami_wg_ip}:27777/cloud-init/wg-init)
echo $WG_PAYLOAD | jq
CLIENT_IP=$(echo $WG_PAYLOAD | jq -r '."client-vpn-ip"')
SERVER_IP=$(echo $WG_PAYLOAD | jq -r '."server-ip"' | awk -F'/' '{print $1}')
SERVER_PORT=$(echo $WG_PAYLOAD | jq -r '."server-port"')
SERVER_KEY=$(echo $WG_PAYLOAD | jq -r '."server-public-key"')
echo "Setting up local wireguard interface"
echo "Adding wg0 link"
ip link add dev wg0 type wireguard
echo "Adding ip address ${CLIENT_IP}/32"
ip address add dev wg0 ${CLIENT_IP}/32
echo "Setting the private key"
wg set wg0 private-key /etc/wireguard/private.key
echo "Bringing up the wg0 link"
ip link set wg0 up
echo "Setting up the peer with the server"
wg set wg0 peer ${SERVER_KEY} allowed-ips ${SERVER_IP}/32 endpoint ${ochami_wg_ip}:$SERVER_PORT
rm /etc/wireguard/private.key
rm /etc/wireguard/public.key
copyfiles:
- src: '/opt/workdir/images/files/cloud-init-override.conf'
dest: '/etc/systemd/system/cloud-init.service.d/override.conf'
- src: '/opt/workdir/images/files/ochami-ci-setup.sh'
dest: '/usr/local/bin/ochami-ci-setup.sh'
Restart cloud-init-server
with WireGuard.
Often, we want to allocate nodes for different purposes using different images. Let's use the base image that we created before and create another Kubernetes layer called kubernetes-worker
based on the base
image we created before. We would need to modify the boot script to use this new Kubernetes image and update cloud-init set up the nodes.
kexec-load.sh
#!/usr/bin/env sh
if [ 512000000 -gt $(cat /proc/meminfo | grep -F 'MemTotal' | grep -oE '[0-9]+' | tr -d '\n'; echo 000) ]; then
echo 'Not enough memory to safely load the kernel' >&2
exit 0
fi
if lspci 2>/dev/null | grep -qi '3D controller'; then
echo 'GPUs detected. Not loading kernel to prevent system instability' >&2
exit 0
fi
# Might need to tweak this if the kernel is in a different spot
exec kexec -l "/boot/vmlinuz-$(uname -r)" --initrd="/boot/initramfs-$(uname -r).img" --reuse-cmdline
kexec-update.sh
#!/bin/bash
#set -x
set -e
# This whole script is a bit heavy on the heuristics.
# It would be much better to patch BSS to do JSON output.
# This gets the MAC address of the first interface with an IP address
MAC="$(ip addr | grep -A10 'state UP' | grep -oP -m1 '(?<=link/ether )[a-f0-9:]+')"
# This gets the bss IP address from the kernel commandline
BSS_IP="$(grep -oP '(?<=bss=)[^:/ ]+' /proc/cmdline | tail -n1)"
# When I use the NID it just returns a script that chains into the MAC address one
echo 'Getting boot script...'
BOOT_SCRIPT="$(curl -s "http://$BSS_IP:8081/apis/bss/boot/v1/bootscript?mac=$MAC&json=1")"
if [ -z "$BOOT_SCRIPT" ]; then
echo 'Empty boot script! Aborting...'
exit 1
fi
INITRD="$(echo $BOOT_SCRIPT | jq -r .initrd.path)"
if [ -z "$INITRD" ]; then
echo 'No initrd URL. Aborting...'
exit 2
fi
KERNEL="$(echo $BOOT_SCRIPT | jq -r .kernel.path)"
if [ -z "$KERNEL" ]; then
echo 'No kernel URL. Aborting...'
exit 3
fi
PARAMS="$(echo $BOOT_SCRIPT | jq -r .params)"
if [ -z "$PARAMS" ]; then
echo 'No kernel params. Aborting...'
exit 2
fi
TMP="$(mktemp -d)"
trap 'rm -rf "$TMP"' EXIT
echo 'Getting kernel...'
curl -so "$TMP/kernel" "$KERNEL"
echo 'Getting initrd...'
curl -so "$TMP/initrd" "$INITRD"
kexec -l "$TMP/kernel" --initrd "$TMP/initrd" --command-line "$PARAMS"
echo 'All done!'
In the tutorial, we used static discovery to populate our inventory in SMD instead of dynamically discovering nodes on our network. Static discovery is good when we know beforehand the MAC address, IP address, xname, and/or node ID of our nodes and guarantees deterministic behavior. However, sometimes we might not know these properties or we may want to check the current state of our hardware, say for a failure. In these scenario, we can probe our hardware dynamically using the scanning feature from magellan
and then update the state of our inventory.
For this demonstration, we have two prerequisites before we get started:
- Emulate board management controllers (BMCs) with running Redfish services
- Have a running instance of SMD or a full running deployment of the OpenCHAMI services
The magellan
repository has an emulator included in the project that we can used for quick and dirty testing. This is useful if we want to try out the capabilities of the tool without have to put to much time and effort setting up an environment. However, we want to use multiple BMCs to show how magellan
can distinguish between Redfish and non-Redfish services.
TODO: Add content setting up multiple emulated BMCs with Redfish services (the quickstart in the deployment-recipes has this already).
A scan sends out requests to all devices on a network specified with the --subnet
flag. If the device responds, it is added to a cache database that we'll need for the next section.
Let's do a scan and see what we can find on our network. We should be able to find all of our emulated BMCs without having to worry too much about any other services.
magellan scan --subnet 172.16.0.100/24 --cache ./assets.db
This command should not have any output if it runs successfully. By default, the cache will be stored in /tmp/$USER/magellan/assets.db
in a tiny SQLite 3 database. Instead, we stored the cache locally with the --cache
flag.
We can see what BMCs with Redfish were found with the list
command.
magellan list
You should see the emulated BMCs.
# TODO: add list of emulated BMCs from `magellan list` output
Now that we know the IP addresses of the BMCs, let's collect inventory data using the collect
command.
We can use the cache to pull in inventory data from the BMCs. If the BMCs require a username and password, we can see them using the secrets store before we run collect
.
TEMP_KEY=$(magellan secrets generatekey) # ...or whatever you want to use for your key
export MASTER_KEY=$TEMP_KEY
magellan secrets store default $default_bmc_username:$default_bmc_password
This stores a default BMC username and password to use across all BMC nodes that do not have credentials specified. If we want to add specific credentials, we just need to change default
to the host.
magellan secrets store https://172.16.0.101 $bmc01_username:$bmc01_password
The credentials will be used automatically when collect
or crawl
and ran. Additionally, when running collect
have to add -v
flag to see the output and -o
to save it to a file.
magellan collect -v -F yaml -o nodes.yaml
There should be a nodes.yaml
file in the current directory. The file can be edited to use different values before uploading to SMD. Once done editing, send it off with the send
command.
magellan send -F yaml -d @nodes.yaml https://demo.openchami.cluster:8443
This will store the inventory data in SMD like before with the information found from the scan.