Skip to content

Fix jobcompletion logfile existance #103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 13, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ jobs:
- test8
- test9
- test10
- test11
- test12

exclude:
- image: 'centos:7'
Expand All @@ -46,7 +48,10 @@ jobs:
scenario: test9
- image: 'centos:7'
scenario: test10

- image: 'centos:7'
scenario: test11
- image: 'centos:7'
scenario: test12

steps:
- name: Check out the codebase.
Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ package in the image.

#### Accounting

By default, no accounting storage is configured. OpenHPC v1.x and un-updated OpenHPC v2.0 clusters support file-based accounting storage which can be selected by setting the role variable `openhpc_slurm_accounting_storage_type` to `accounting_storage/filetxt`<sup id="accounting_storage">[1](#slurm_ver_footnote)</sup>. Accounting for OpenHPC v2.1 and updated OpenHPC v2.0 clusters requires the Slurm database daemon, `slurmdbd`. To enable this:
By default, no accounting storage is configured. OpenHPC v1.x and un-updated OpenHPC v2.0 clusters support file-based accounting storage which can be selected by setting the role variable `openhpc_slurm_accounting_storage_type` to `accounting_storage/filetxt`<sup id="accounting_storage">[1](#slurm_ver_footnote)</sup>. Accounting for OpenHPC v2.1 and updated OpenHPC v2.0 clusters requires the Slurm database daemon, `slurmdbd` (although job completion may be a limited alternative, see [below](#Job-accounting). To enable accounting:

* Configure a mariadb or mysql server as described in the slurm accounting [documentation](https://slurm.schedmd.com/accounting.html) on one of the nodes in your inventory and set `openhpc_enable.database `to `true` for this node.
* Set `openhpc_slurm_accounting_storage_type` to `accounting_storage/slurmdbd`.
Expand All @@ -86,16 +86,16 @@ For more advanced customisation or to configure another storage type, you might
#### Job accounting

This is largely redundant if you are using the accounting plugin above, but will give you basic
accounting data such as start and end times.
accounting data such as start and end times. By default no job accounting is configured.

`openhpc_slurm_job_comp_type`: Logging mechanism for job accounting. Can be one of
`jobcomp/filetxt`, `jobcomp/none`, `jobcomp/elasticsearch`.

`openhpc_slurm_job_acct_gather_type`: Mechanism for collecting job accounting data. Can be one
of `jobacct_gather/linux`, `jobacct_gather/cgroup` and `jobacct_gather/none`

`openhpc_slurm_job_acct_gather_frequency`: Sampling period for job accounting (seconds)

`openhpc_slurm_job_comp_type`: Logging mechanism for job accounting. Can be one of
`jobcomp/filetxt`, `jobcomp/none`, `jobcomp/elasticsearch`.

`openhpc_slurm_job_comp_loc`: Location to store the job accounting records. Depends on value of
`openhpc_slurm_job_comp_type`, e.g for `jobcomp/filetxt` represents a path on disk.

Expand Down
1 change: 1 addition & 0 deletions molecule/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ test8 | 1 | N | 2x compute node, 2x login-only
test9 | 1 | N | As test8 but uses `--limit=testohpc-control,testohpc-compute-0` and checks login nodes still end up in slurm.conf
test10 | 1 | N | As for #5 but then tries to add an additional node
test11 | 1 | N | As for #5 but then deletes a node (actually changes the partition due to molecule/ansible limitations)
test12 | 1 | N | As for #5 but enabling job completion and testing `sacct -c`

# Local Installation & Running

Expand Down
18 changes: 18 additions & 0 deletions molecule/test12/converge.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
- name: Converge
hosts: all
tasks:
- name: "Include ansible-role-openhpc"
include_role:
name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}"
vars:
openhpc_enable:
control: "{{ inventory_hostname in groups['testohpc_login'] }}"
batch: "{{ inventory_hostname in groups['testohpc_compute'] }}"
runtime: true
openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}"
openhpc_slurm_partitions:
- name: "compute"
openhpc_cluster_name: testohpc
openhpc_slurm_configless: true
openhpc_slurm_job_comp_type: jobcomp/filetxt
48 changes: 48 additions & 0 deletions molecule/test12/molecule.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
name: single partition, group is partition
driver:
name: docker
platforms:
- name: testohpc-login-0
image: ${MOLECULE_IMAGE}
pre_build_image: true
groups:
- testohpc_login
command: /sbin/init
tmpfs:
- /run
- /tmp
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:ro
networks:
- name: net1
- name: testohpc-compute-0
image: ${MOLECULE_IMAGE}
pre_build_image: true
groups:
- testohpc_compute
command: /sbin/init
tmpfs:
- /run
- /tmp
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:ro
networks:
- name: net1
- name: testohpc-compute-1
image: ${MOLECULE_IMAGE}
pre_build_image: true
groups:
- testohpc_compute
command: /sbin/init
tmpfs:
- /run
- /tmp
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:ro
networks:
- name: net1
provisioner:
name: ansible
verifier:
name: ansible
29 changes: 29 additions & 0 deletions molecule/test12/verify.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---

- name: Check slurm hostlist
hosts: testohpc_login
tasks:
- name: Get slurm partition info
command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace
register: sinfo
changed_when: false
- name: Assert slurm running ok
assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
that: "sinfo.stdout_lines == ['compute*,up,60-00:00:00,2,idle,testohpc-compute-[0-1]']"
fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}"
- name: Run a slurm job
command:
cmd: "sbatch -N2 --wrap 'srun hostname'"
register: sbatch
- name: Set fact for slurm jobid
set_fact:
jobid: "{{ sbatch.stdout.split()[-1] }}"
- name: Get job completion info
command:
cmd: "sacct --completion --noheader --parsable2"
changed_when: false
register: sacct
- assert:
that: "(jobid + '|0|wrap|compute|2|testohpc-compute-[0-1]|COMPLETED') in sacct.stdout"
fail_msg: "Didn't find expected output for {{ jobid }} in sacct output: {{ sacct.stdout }}"

10 changes: 10 additions & 0 deletions tasks/runtime.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,16 @@
notify:
- Restart Munge service

- name: Ensure JobComp logfile exists
file:
path: "{{ openhpc_slurm_job_comp_loc }}"
state: touch
owner: slurm
group: slurm
access_time: preserve
modification_time: preserve
when: openhpc_slurm_job_comp_type == 'jobcomp/filetxt'

- name: Template slurmdbd.conf
template:
src: slurmdbd.conf.j2
Expand Down