Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Arcus for CI #182

Merged
merged 37 commits into from
May 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
18187c1
copy smslabs TF to cookiecutter skeleton
sjpb May 4, 2022
c22083e
use skeleton TF for arcus
sjpb May 4, 2022
79e0f8d
define ports in skeleton TF
sjpb May 4, 2022
a6f87f0
add direct-mode ports for arcus env
sjpb May 4, 2022
c0d016b
copy getfaults.py from smslabs to skeleton
sjpb May 4, 2022
c341256
getfaults takes TF state directory
sjpb May 4, 2022
05ab5ec
make security groups idempotent for TF apply
sjpb May 4, 2022
0326aaf
add arcus env config
sjpb May 4, 2022
8104cc2
add arcus CI workflow
sjpb May 4, 2022
ef5f59b
temporarily disable smslabs CI during arcus CI development
sjpb May 4, 2022
632099c
move arcus to same base image as smslabs
sjpb May 4, 2022
2e66f3d
bugfix reading TF state for provisioning errors
sjpb May 5, 2022
31f3b8f
automate slurm partition definition
sjpb May 5, 2022
fa58cce
fix arcus bastion definition for CI
sjpb May 5, 2022
f9d882c
try to fix ansible transfer mechanism failing after login rebuild
sjpb May 5, 2022
5d75fc7
Merge branch 'main' into ci/arcus-basic
sjpb May 5, 2022
a2301cc
use latest base image openhpc-220504-0904 in arcus
sjpb May 5, 2022
ad714e5
try to fix 'Connection timed out during banner exchange' after login …
sjpb May 9, 2022
803568e
use port security groups in skeleton TF
sjpb May 10, 2022
3e123a1
revert reset connection in CI
sjpb May 10, 2022
c7648ef
try to fix 'Connection timed out during banner exchange' after login …
sjpb May 10, 2022
a832a97
try to fix 'Connection timed out during banner exchange' after login …
sjpb May 10, 2022
d6a178b
try to fix 'wait for login' failing login rebuild
sjpb May 11, 2022
157d778
fix pingmatrix HTML format division by zero
sjpb May 11, 2022
922e4b2
move all github workflows to smslabs.yml
sjpb May 11, 2022
4bd48c5
move workflows to stackhpc.yml
sjpb May 11, 2022
dfc96ea
allow CI runs on clouds to continue if one fails
sjpb May 11, 2022
baea143
Merge branch 'fix/pingmatrix_div_by_zero' into ci/arcus-basic
sjpb May 11, 2022
4f3f6fc
move smslabs to use common terraform (as does arcus)
sjpb May 11, 2022
9e72d76
update README status badge to correct workflow
sjpb May 11, 2022
5465209
Merge branch 'main' into ci/arcus-basic
sjpb May 12, 2022
822bc65
remove unneeded builder exclusions from reimage test
sjpb May 12, 2022
4510bf3
allow supplying security group names in skeleton TF and use for smslabs
sjpb May 12, 2022
eb65d95
wait more for control node after reimage in CI
sjpb May 12, 2022
208bf4c
fix 4510bf331f373af4c8c9a05a659f573273af971e
sjpb May 12, 2022
f8b7972
try removing pause from CI reimage test
sjpb May 16, 2022
02f9475
fix divide-by-zero bug in pingmatrix properly
sjpb May 16, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 28 additions & 19 deletions .github/workflows/smslabs.yml → .github/workflows/stackhpc.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,20 @@

name: Test on SMS-Labs OpenStack in stackhpc-ci
name: Test deployment and image build on OpenStack
on:
push:
branches:
- main
pull_request:
concurrency: stackhpc-ci # openstack project
jobs:
smslabs:
openstack:
name: openstack-ci-${{ matrix.cloud }}
strategy:
matrix:
cloud:
- "smslabs" # SMS-Labs OpenStack in stackhpc-ci project
- "arcus" # Arcus OpenStack in rcp-cloud-portal-demo project, with RoCE
fail-fast: false # as want clouds to continue independently
concurrency: ${{ matrix.cloud }}
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v2
Expand All @@ -16,13 +23,14 @@ jobs:
run: |
set -x
mkdir ~/.ssh
echo "$SSH_KEY" > ~/.ssh/id_rsa
echo "${${{ matrix.cloud }}_SSH_KEY}" > ~/.ssh/id_rsa
chmod 0600 ~/.ssh/id_rsa
env:
SSH_KEY: ${{ secrets.SSH_KEY }}
smslabs_SSH_KEY: ${{ secrets.SSH_KEY }}
arcus_SSH_KEY: ${{ secrets.ARCUS_SSH_KEY }}

- name: Add bastion's ssh key to known_hosts
run: cat environments/smslabs/bastion_fingerprint >> ~/.ssh/known_hosts
run: cat environments/${{ matrix.cloud }}/bastion_fingerprint >> ~/.ssh/known_hosts
shell: bash

- name: Install ansible etc
Expand All @@ -33,21 +41,22 @@ jobs:

- name: Initialise terraform
run: terraform init
working-directory: ${{ github.workspace }}/environments/smslabs/terraform
working-directory: ${{ github.workspace }}/environments/${{ matrix.cloud }}/terraform

- name: Write clouds.yaml
run: |
mkdir -p ~/.config/openstack/
echo "$CLOUDS_YAML" > ~/.config/openstack/clouds.yaml
echo "${${{ matrix.cloud }}_CLOUDS_YAML}" > ~/.config/openstack/clouds.yaml
shell: bash
env:
CLOUDS_YAML: ${{ secrets.CLOUDS_YAML }}
smslabs_CLOUDS_YAML: ${{ secrets.CLOUDS_YAML }}
arcus_CLOUDS_YAML: ${{ secrets.ARCUS_CLOUDS_YAML }}

- name: Provision infrastructure
id: provision
run: |
. venv/bin/activate
. environments/smslabs/activate
. environments/${{ matrix.cloud }}/activate
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
terraform apply -auto-approve
env:
Expand All @@ -58,9 +67,9 @@ jobs:
id: provision_failure
run: |
. venv/bin/activate
. environments/smslabs/activate
. environments/${{ matrix.cloud }}/activate
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
echo "::set-output name=messages::$(./getfaults.py)"
echo "::set-output name=messages::$(../../skeleton/\{\{cookiecutter.environment\}\}/terraform/getfaults.py $PWD)"
env:
OS_CLOUD: openstack
TF_VAR_cluster_name: ci${{ github.run_id }}
Expand All @@ -69,7 +78,7 @@ jobs:
- name: Delete infrastructure if failed due to lack of hosts
run: |
. venv/bin/activate
. environments/smslabs/activate
. environments/${{ matrix.cloud }}/activate
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
terraform destroy -auto-approve
env:
Expand All @@ -81,7 +90,7 @@ jobs:
# see pre-hook for the image build
run: |
. venv/bin/activate
. environments/smslabs/activate
. environments/${{ matrix.cloud }}/activate
ansible all -m wait_for_connection
ansible-playbook ansible/adhoc/generate-passwords.yml
echo test_user_password: "$TEST_USER_PASSWORD" > $APPLIANCES_ENVIRONMENT_ROOT/inventory/group_vars/basic_users/defaults.yml
Expand All @@ -94,7 +103,7 @@ jobs:
- name: Confirm Open Ondemand is up (via SOCKS proxy)
run: |
. venv/bin/activate
. environments/smslabs/activate
. environments/${{ matrix.cloud }}/activate

# load ansible variables into shell:
ansible-playbook ansible/ci/output_vars.yml \
Expand Down Expand Up @@ -126,7 +135,7 @@ jobs:
# TODO: test control node reimage
run: |
. venv/bin/activate
. environments/smslabs/activate
. environments/${{ matrix.cloud }}/activate
ansible all -m wait_for_connection
ansible-playbook -vv ansible/ci/test_reimage.yml
env:
Expand All @@ -136,7 +145,7 @@ jobs:
- name: Run MPI-based tests
run: |
. venv/bin/activate
. environments/smslabs/activate
. environments/${{ matrix.cloud }}/activate
ansible-playbook -vv ansible/adhoc/hpctests.yml
env:
ANSIBLE_FORCE_COLOR: True
Expand All @@ -145,7 +154,7 @@ jobs:
- name: Delete infrastructure
run: |
. venv/bin/activate
. environments/smslabs/activate
. environments/${{ matrix.cloud }}/activate
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
terraform destroy -auto-approve
env:
Expand All @@ -156,7 +165,7 @@ jobs:
- name: Delete images
run: |
. venv/bin/activate
. environments/smslabs/activate
. environments/${{ matrix.cloud }}/activate
ansible-playbook -vv ansible/ci/delete_images.yml
env:
OS_CLOUD: openstack
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ config-drive.iso
venv
*.pyc
packer/openhpc2
.vscode
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[![Test on OpenStack via smslabs](https://github.com/stackhpc/ansible-slurm-appliance/actions/workflows/smslabs.yml/badge.svg)](https://github.com/stackhpc/ansible-slurm-appliance/actions/workflows/smslabs.yml)
[![Test deployment and image build on OpenStack](https://github.com/stackhpc/ansible-slurm-appliance/actions/workflows/stackhpc.yml/badge.svg)](https://github.com/stackhpc/ansible-slurm-appliance/actions/workflows/stackhpc.yml)

# StackHPC Slurm Appliance

Expand Down
33 changes: 24 additions & 9 deletions ansible/ci/test_reimage.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
- hosts: all:!builder
- hosts: all
become: no
gather_facts: no
tags:
Expand All @@ -8,7 +8,7 @@
tasks:
- import_tasks: get_image_ids.yml

- hosts: login:!builder
- hosts: login
become: no
gather_facts: no
tags: reimage_login
Expand All @@ -23,19 +23,24 @@
cmd: openstack server show {{ inventory_hostname }} --format value -c image
register: openstack_login
delegate_to: localhost
retries: 5
delay: 30
retries: 36
m-bull marked this conversation as resolved.
Show resolved Hide resolved
delay: 5
until: login_build.artifact_id in openstack_login.stdout
changed_when: false

- name: Wait for login connection
wait_for_connection:
timeout: 800
delay: 5
register: login_connect
retries: 3
delay: 10
until: login_connect is not failed

- name: Check slurm up after reimaging login node
import_tasks: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/hooks/check_slurm.yml"

- hosts: login:!builder
- hosts: login
become: no
gather_facts: no
tags: reimage_compute
Expand All @@ -57,20 +62,25 @@
until: compute_build.artifact_id in openstack_compute.stdout
changed_when: false

- hosts: compute:!builder
- hosts: compute
become: no
gather_facts: no
tags: reimage_compute
tasks:
- name: Wait for compute connection
wait_for_connection:
timeout: 800
delay: 5
register: compute_connect
retries: 3
delay: 10
until: compute_connect is not failed

- name: Check slurm up after reimaging login node
import_tasks: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/hooks/check_slurm.yml"
run_once: true

- hosts: control:!builder
- hosts: control
become: no
gather_facts: no
tags: reimage_control
Expand All @@ -93,12 +103,17 @@
- name: Wait for control connection
wait_for_connection:
timeout: 800

delay: 5
register: control_connect
retries: 3
delay: 10
until: control_connect is not failed

- name: Run slurm playbook again to add partition info
import_playbook: ../slurm.yml
tags: reimage_control

- hosts: control:!builder
- hosts: control
become: no
gather_facts: no
tags: reimage_control
Expand Down
5 changes: 4 additions & 1 deletion ansible/roles/hpctests/library/plot_nxnlatbw.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,10 @@ def html_rows(rankAs, rankBs, nodes, data):
for rankB in rankBs:
val = data.get((rankA, rankB))
if val is not None:
lightness = 50 + (50 - 50 * ((val - minv) / (maxv - minv))) # want value in range LOW = 100 (white) -> HIGH 50(red)
try:
lightness = 50 + (50 - 50 * ((val - minv) / (maxv - minv))) # want value in range LOW = 100 (white) -> HIGH 50(red)
except ZeroDivisionError: # no min-max spread
lightness = 100
outrow += ['<td style="background-color:hsl(0, 100%%, %i%%);">%.1f</td>' % (lightness, val)]
else:
outrow += ['<td>-</td>']
Expand Down
3 changes: 3 additions & 0 deletions environments/arcus/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
partitions.yml
secrets.yml
hosts
23 changes: 23 additions & 0 deletions environments/arcus/activate
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
export APPLIANCES_ENVIRONMENT_ROOT=$(dirname $(realpath ${BASH_SOURCE[0]:-${(%):-%x}}))
echo "Setting APPLIANCES_ENVIRONMENT_ROOT to $APPLIANCES_ENVIRONMENT_ROOT"

APPLIANCES_ENVIRONMENT_NAME=$(basename $APPLIANCES_ENVIRONMENT_ROOT)
export PS1="${APPLIANCES_ENVIRONMENT_NAME}/ ${PS1}"

export APPLIANCES_REPO_ROOT=$(realpath "$APPLIANCES_ENVIRONMENT_ROOT/../..")
echo "Setting APPLIANCES_REPO_ROOT to $APPLIANCES_REPO_ROOT"

export TF_VAR_environment_root=$(realpath "$APPLIANCES_ENVIRONMENT_ROOT")
echo "Setting TF_VAR_environment_root to $TF_VAR_environment_root"

export PKR_VAR_environment_root=$(realpath "$APPLIANCES_ENVIRONMENT_ROOT")
echo "Setting PKR_VAR_environment_root to $PKR_VAR_environment_root"

export PKR_VAR_repo_root=$(realpath "$APPLIANCES_REPO_ROOT")
echo "Setting PKR_VAR_repo_root to $PKR_VAR_repo_root"

if [ -f "$APPLIANCES_ENVIRONMENT_ROOT/ansible.cfg" ]; then
export ANSIBLE_CONFIG=$APPLIANCES_ENVIRONMENT_ROOT/ansible.cfg
fi


14 changes: 14 additions & 0 deletions environments/arcus/ansible.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[defaults]
any_errors_fatal = True
stdout_callback = debug
stderr_callback = debug
gathering = smart
forks = 30
host_key_checking = False
inventory = ../common/inventory,inventory
collections_path = ../../ansible/collections
roles_path = ../../ansible/roles

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=240s -o PreferredAuthentications=publickey -o UserKnownHostsFile=/dev/null
pipelining = True
3 changes: 3 additions & 0 deletions environments/arcus/bastion_fingerprint
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
|1|BwhEZQPqvZcdf9Phmh2mTPmIivU=|bHi1Nf8dYI8z1C+qsqQFPAty1xA= ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQChxwhZggdwj55gNzfDBzah0G8IeTPQjgMZrpboxp2BO4J+o1iZSwDj+2fqyhBGTE43vCJR13uEygz49XIy+t17qBNwHz4fVVR7jdMNymtbZoOsq9oAoBdGEICHrMzQsYZmT9+Wt74ZP2PKOOn+a+f2vg7YdeSy1UhT08iJlbXwCx56fCQnMJMOnZM9MXVLd4NUFN1TeOCIBQHwRiMJyJ7S7CdUKpyUqHOG85peKiPJ07C0RZ/W5HkYKqltwtvPGQd262p5eLC9j3nhOYSG2meRV8yTxYz3lDIPDx0+189CZ5NaxFSPCgqSYA24zavhPVLQqoct7nd7fcEw9JiTs+abZC6GckCONSHDLM+iRtWC/i5u21ZZDLxM9SIqPI96cYFszGeqyZoXxS5qPaIDHbQNAEqJp9ygNXgh9vuBo7E+aWYbFDTG0RuvW02fbmFfZw2/yXIr37+cQX+GPOnkfIRuHE3Hx5eN8C04v+BMrAfK2minawhG3A2ONJs9LI6QoeE=
|1|whGSPLhKW4xt/7PWOZ1treg3PtA=|F5gwV8j0JYWDzjb6DvHHaqO+sxs= ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBCpCG881Gt3dr+nuVIC2uGEQkeVwG6WDdS1WcCoxXC7AG+Oi5bfdqtf4IfeLpWmeuEaAaSFH48ODFr76ViygSjU=
|1|0V6eQ1FKO5NMKaHZeNFbw62mrJs=|H1vuGTbbtZD2MEgZxQf1PXPk+yU= ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEnOtYByM3s2qvRT8SS1sn5z5sbwjzb1alm0B3emPcHJ
7 changes: 7 additions & 0 deletions environments/arcus/builder.pkrvars.hcl
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
flavor = "vm.alaska.cpu.general.small"
networks = ["a262aabd-e6bf-4440-a155-13dbc1b5db0e"] # WCDC-iLab-60
source_image_name = "openhpc-220413-1545.qcow2"
ssh_keypair_name = "slurm-app-ci"
security_groups = ["default", "SSH"]
ssh_bastion_host = "128.232.222.183"
ssh_bastion_username = "slurm-app-ci"
20 changes: 20 additions & 0 deletions environments/arcus/hooks/check_slurm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
- name: Run sinfo
shell: 'sinfo --noheader --format="%N %P %a %l %D %t" | sort' # using --format ensures we control whitespace: Partition,partition_state,max_jobtime,num_nodes,node_state,node_name
register: sinfo
changed_when: false
until: "'boot' not in sinfo.stdout_lines"
retries: 5
delay: 10
- name: Check nodes have expected slurm state
assert:
that: sinfo.stdout_lines == expected_sinfo
fail_msg: |
sinfo output not as expected:
actual:
{{ sinfo.stdout_lines }}
expected:
{{ expected_sinfo }}
<end>
vars:
expected_sinfo:
- "{{ openhpc_cluster_name }}-compute-[0-1] {{ openhpc_slurm_partitions[0].name }}* up 60-00:00:00 2 idle"
m-bull marked this conversation as resolved.
Show resolved Hide resolved
19 changes: 19 additions & 0 deletions environments/arcus/hooks/post.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
- hosts: login:!builder # won't have a slurm control daemon when in build
become: no
gather_facts: false
tasks:
- name: Check slurm up after direct deploy
import_tasks: check_slurm.yml

- hosts: localhost
become: false
tags: build
tasks:
- name: Check Packer build finished
async_status:
jid: "{{ packer_run.ansible_job_id }}"
register: packer_result
until: packer_result.finished
retries: 30 # allow 15 mins
delay: 30
when: packer_run is defined # allows rerunning post.yml
32 changes: 32 additions & 0 deletions environments/arcus/hooks/pre.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
- hosts: localhost
become: false
tags: build
tasks:
- name: Ensure secrets generated
include_role:
name: passwords

- name: Build packer images
shell:
cmd: |
cd packer
PACKER_LOG=1 packer build -on-error=ask -var-file=$PKR_VAR_environment_root/builder.pkrvars.hcl openstack.pkr.hcl
chdir: "{{ lookup('env', 'APPLIANCES_REPO_ROOT') }}"
when: "'builder' not in group_names" # avoid recursion!
register: packer_run
async: 2700 # 45 minutes
poll: 0

- hosts: all
become: true
tags: etc_hosts
tasks:
- name: Create /etc/hosts for all nodes as DNS doesn't work
blockinfile:
path: /etc/hosts
create: yes
state: present
block: |
{% for hostname in groups['all'] %}
{{ hostvars[hostname]['ansible_host'] }} {{ hostname }}
{% endfor %}
1 change: 1 addition & 0 deletions environments/arcus/inventory/everything
6 changes: 6 additions & 0 deletions environments/arcus/inventory/extra_groups
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[basic_users:children]
cluster

[rebuild:children]
control
compute
Empty file.
Loading