Skip to content

Add pingpong test #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Nov 20, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 39 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ All demos use a terraform-deployed cluster with a single control/login node and

git clone git@github.com:stackhpc/openhpc-tests.git
cd openhpc-tests
virtualenv --python $(which python3) venv
virtualenv --system-site-packages --python $(which python3) venv
. venv/bin/activate
pip install -U pip
pip install -U setuptools
Expand All @@ -20,6 +20,8 @@ All demos use a terraform-deployed cluster with a single control/login node and
yum install terraform
terraform init

# TODO: need to add collection in here too.

# Deploy nodes with Terraform

- Modify the keypair in `main.tf` and ensure the required Centos images are available on OpenStack.
Expand All @@ -44,12 +46,13 @@ Available playbooks are:
- `slurm-db.yml`: The basic slurm cluster plus slurmdbd backed by mariadb on the login/control node, which provides more detailed accounting.
- `monitoring-simple.yml`: Add basic monitoring, with prometheus and grafana on the login/control node providing graphical dashboards (over http) showing cpu/network/memory/etc usage for each cluster node. Run `slurm-simple.yml` first.
- `monitoring-db.yml`: Basic monitoring plus statistics and dashboards for Slurm jobs . Run `slurm-db.yml` first.
- `rebuild.yml`: Deploy scripts to enable the reimaging compute nodes controlled by Slurm's `scontrol` command.
- `rebuild.yml`: Deploy scripts to enable the reimaging compute nodes controlled by Slurm's `scontrol` command. Run `slurm-simple.yml` or `slurm-db.yml` first.
- `config-drive.yml` and `main.pkr.hcl`: Packer-based build of compute note images - see separate section below.
- `test.yml`: Run a set of MPI-based tests on the cluster. Run `slurm-simple.yml` or `slurm-db.yml` first.

For additional details see sections below.

# monitoring.yml
# monitoring-simple.yml

Run this using:

Expand All @@ -68,8 +71,40 @@ NB: if grafana's yum repos are down you will see `Errors during downloading meta
exit
ansible-playbook -i inventory monitoring.yml -e grafana_password=<password> --skip-tags grafana_install

# rebuild.yml

Enable the compute nodes of a Slurm-based OpenHPC cluster on Openstack to be reimaged from Slurm.

For full details including the Slurm commmands to use see the [role's README](https://github.com/stackhpc/ansible_collection_slurm_openstack_tools/blob/main/roles/rebuild/README.md)

Ensure you have `~/.config/openstack/clouds.yaml` defining authentication for a a single Openstack cloud (see above README to change location).

Then run:

ansible-playbook -i inventory rebuild.yml

Note this does not rebuild the nodes, only deploys the tools to do so.

# test.yml

This runs MPI-based tests on the cluster:
- `pingpong`: Runs Intel MPI Benchmark's IMB-MPI1 pingpong between a pair of (scheduler-selected) nodes. Reports zero-size message latency and maximum bandwidth.
- `pingmatrix`: Runs a similar pingpong test but between all pairs of nodes. Reports zero-size message latency & maximum bandwidth.
- `hpl-solo`: Runs HPL **separately** on all nodes, using 80% of memory, reporting Gflops on each node.

For full details see the [role's README](https://github.com/stackhpc/ansible_collection_slurm_openstack_tools/blob/main/roles/test/README.md).

First set `openhpc_tests_hpl_NB` in [test.yml](test.yml) to the appropriate the HPL blocksize 'NB' for the compute node processor - for Intel CPUs see [here](https://software.intel.com/content/www/us/en/develop/documentation/mkl-linux-developer-guide/top/intel-math-kernel-library-benchmarks/intel-distribution-for-linpack-benchmark/configuring-parameters.html).

Then run:

ansible-playbook -i inventory test.yml

Results will be reported in the ansible stdout - the pingmatrix test also writes an html results file onto the ansible host.


# Destroying the cluster

When finished, run:

terraform destroy
terraform destroy --auto-approve
22 changes: 22 additions & 0 deletions config-drive.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
- hosts: localhost
vars:
ssh_public_key_path: ~/.ssh/id_rsa.pub
ansible_python_interpreter: /usr/bin/python # avoids lack of python3 bindings for yum
roles:
- role: ansible-role-configdrive # install from https://gitlab.com/davidblaisonneau-orange/ansible-role-configdrive
configdrive_os_family: "RedHat"
configdrive_uuid: "openhpc2"
configdrive_fqdn: "test.example.com"
configdrive_name: "openhpc2"
configdrive_ssh_public_key: "{{ lookup('file', ssh_public_key_path) }}"
configdrive_config_dir: "/tmp/configdrive/"
configdrive_volume_path: "{{ playbook_dir }}"
configdrive_config_dir_delete: True
configdrive_network_device_list: []
tasks:
- name: Ensure configdrive is decoded and decompressed
shell: >
base64 -d {{ playbook_dir }}/openhpc2.gz
| gunzip
> {{ playbook_dir }}/config-drive.iso
4 changes: 3 additions & 1 deletion inventory.tpl
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
[all:vars]
ansible_user=centos
ssh_proxy=${login.network[0].fixed_ip_v4}
ansible_ssh_common_args='-o ProxyCommand="ssh centos@${login.network[0].fixed_ip_v4} -W %h:%p"'

[${cluster_name}_login]
${login.name} ansible_host=${login.network[0].fixed_ip_v4} server_networks='${jsonencode({for net in login.network: net.name => [ net.fixed_ip_v4 ] })}'
Expand All @@ -11,6 +10,9 @@ ${login.name} ansible_host=${login.network[0].fixed_ip_v4} server_networks='${js
${compute.name} ansible_host=${compute.network[0].fixed_ip_v4} server_networks='${jsonencode({for net in compute.network: net.name => [ net.fixed_ip_v4 ] })}'
%{ endfor ~}

[${cluster_name}_compute:vars]
ansible_ssh_common_args='-o ProxyCommand="ssh centos@${login.network[0].fixed_ip_v4} -W %h:%p"'

[cluster:children]
${cluster_name}_login
${cluster_name}_compute
Expand Down
40 changes: 40 additions & 0 deletions main.pkr.hcl
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# "timestamp" template function replacement:s
locals { timestamp = formatdate("YYMMDD-hhmm", timestamp())}

source "qemu" "openhpc2" {
iso_url = "https://cloud.centos.org/centos/8/x86_64/images/CentOS-8-GenericCloud-8.2.2004-20200611.2.x86_64.qcow2"
iso_checksum = "sha256:d8984b9baee57b127abce310def0f4c3c9d5b3cea7ea8451fc4ffcbc9935b640"
disk_image = true # as above is .qcow2 not .iso
disk_size = "20G" # needs to match compute VM
disk_compression = true
accelerator = "kvm" # default, if available
ssh_username = "centos"
ssh_timeout = "20m"
net_device = "virtio-net" # default
disk_interface = "virtio" # default
qemu_binary = "/usr/libexec/qemu-kvm" # fixes GLib-WARNING **: 13:48:38.600: gmem.c:489: custom memory allocation vtable not supported
headless = true
output_directory = "build"
ssh_private_key_file = "~/.ssh/id_rsa"
qemuargs = [
["-monitor", "unix:qemu-monitor.sock,server,nowait"],
["-serial", "pipe:/tmp/qemu-serial"], ["-m", "896M"],
["-cdrom", "config-drive.iso"]
]
vm_name = "openhpc2-${local.timestamp}.qcow2"
shutdown_command = "sudo shutdown -P now"
}

build {
sources = ["source.qemu.openhpc2"]
provisioner "ansible" {
playbook_file = "slurm-simple.yml"
host_alias = "builder"
groups = ["cluster", "cluster_compute"]
extra_arguments = ["-i", "inventory", "--limit", "builder",
"--extra-vars", "openhpc_slurm_service_started=false nfs_client_mnt_state=present", # crucial to avoid trying to start services
"-v"]
keep_inventory_file = true # for debugging
use_proxy = false # see https://www.packer.io/docs/provisioners/ansible#troubleshooting
}
}
11 changes: 8 additions & 3 deletions main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -19,16 +19,21 @@ variable "key_pair" {
default = "centos_at_sb-mol"
}

variable "node_image" {
variable "login_image" {
#default = "CentOS-7-x86_64-GenericCloud-2020-04-22"
default = "CentOS-8-GenericCloud-8.2.2004-20200611.2.x86_64"
#default = "CentOS7.8" #-OpenHPC"
}

variable "compute_image" {
#default = "CentOS-7-x86_64-GenericCloud-2020-04-22"
default = "CentOS-8-GenericCloud-8.2.2004-20200611.2.x86_64"
}

resource "openstack_compute_instance_v2" "login" {

name = "${var.cluster_name}-login-0"
image_name = var.node_image
image_name = var.login_image
flavor_name = "general.v1.small"
key_pair = var.key_pair
network {
Expand All @@ -42,7 +47,7 @@ resource "openstack_compute_instance_v2" "compute" {
for_each = toset(var.compute_names)

name = "${var.cluster_name}-${each.value}"
image_name = var.node_image
image_name = var.compute_image
flavor_name = "general.v1.small"
#flavor_name = "compute-A"
key_pair = var.key_pair
Expand Down
2 changes: 1 addition & 1 deletion slurm-simple.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
become: yes
tasks:
- import_role:
name: stackhpc.nfs
name: ansible-role-cluster-nfs # need `nfs_client_mnt_state`
vars:
nfs_enable:
server: "{{ inventory_hostname in groups['cluster_login'] | first }}"
Expand Down
68 changes: 25 additions & 43 deletions test.yml
Original file line number Diff line number Diff line change
@@ -1,48 +1,30 @@
# NB: this only works on centos 8 / ohpc v2 as we want UCX
- hosts: all
# TODO: add support for groups/partitions
# Currently need to run with something like:
# ANSIBLE_LIBRARY=. ansible-playbook -i inventory test.yml

- hosts: cluster
name: Export/mount /opt via NFS for ohcp and intel packages
become: yes
tasks:
- name: Install gnu 9 + openmpi (w/ ucx) + performance tools
yum:
name: ohpc-gnu9-openmpi4-perf-tools
state: present
- name: Make centos nfs share owner
file:
path: /mnt/nfs
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"
- name: Create sbatch script for IMB pingpong
copy:
dest: "/mnt/nfs/ping.sh"
content: |
#!/usr/bin/bash

#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --output=%x.out
#SBATCH --error=%x.out

module load gnu9/9.3.0
module load openmpi4/4.0.4
module load imb/2019.6

srun --mpi=pmix_v3 IMB-MPI1 pingpong
run_once: True

- import_role:
name: ansible-role-cluster-nfs
tags: nfs
vars:
nfs_enable:
server: "{{ inventory_hostname in groups['cluster_login'] | first }}"
clients: "{{ inventory_hostname in groups['cluster_compute'] }}"
nfs_server: "{{ hostvars[groups['cluster_login'] | first ]['server_networks']['ilab'][0] }}"
nfs_export: "/opt"
nfs_client_mnt_point: "/opt"

- hosts: cluster_login[0]
tags: mpirun
name: Run tests
tags: test
tasks:
- name: Run pingpong
shell: sbatch --wait ping.sh
become: no
args:
chdir: "/mnt/nfs/"

- name: Slurp output
slurp:
src: /mnt/nfs/ping.sh.out
register: ping_out

- debug:
msg: "{{ ping_out['content'] | b64decode }}"

- import_role:
name: stackhpc.slurm_openstack_tools.test
vars:
openhpc_tests_rootdir: /mnt/nfs/ohcp-tests
openhpc_tests_hpl_NB: 192