Skip to content

Commit de7a904

Browse files
sjpbMoteHue
andauthored
Add support for upgrading database (#186)
* add support for upgrading database * fix db upgrade check for initial deploy * add mysql tool to support checking db status * make slurm db service consistent * make backup optional * fix ansible-lint whinging * fix upgrade code for mariadb in CI * fix mariadb connections * fix upgrade logic * Fix readme typo Co-authored-by: Matt Crees <matthew.crees1@gmail.com> * fix upgrade logic * explain upgrade logic --------- Co-authored-by: Matt Crees <matthew.crees1@gmail.com>
1 parent ed717da commit de7a904

File tree

7 files changed

+139
-13
lines changed

7 files changed

+139
-13
lines changed

README.md

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -121,10 +121,12 @@ accounting data such as start and end times. By default no job accounting is con
121121
`openhpc_slurm_job_comp_loc`: Location to store the job accounting records. Depends on value of
122122
`openhpc_slurm_job_comp_type`, e.g for `jobcomp/filetxt` represents a path on disk.
123123

124-
### slurmdbd.conf
124+
### slurmdbd
125125

126-
The following options affect `slurmdbd.conf`. Please see the slurm [documentation](https://slurm.schedmd.com/slurmdbd.conf.html) for more details.
127-
You will need to configure these variables if you have set `openhpc_enable.database` to `true`.
126+
When the slurm database daemon (`slurmdbd`) is enabled by setting
127+
`openhpc_enable.database` to `true` the following options must be configured.
128+
See documentation for [slurmdbd.conf](https://slurm.schedmd.com/slurmdbd.conf.html)
129+
for more details.
128130

129131
`openhpc_slurmdbd_port`: Port for slurmdb to listen on, defaults to `6819`.
130132

@@ -136,6 +138,30 @@ You will need to configure these variables if you have set `openhpc_enable.datab
136138

137139
`openhpc_slurmdbd_mysql_username`: Username for authenticating with the database, defaults to `slurm`.
138140

141+
Before starting `slurmdbd`, the role will check if a database upgrade is
142+
required to due to a Slurm major version upgrade and carry it out if so.
143+
Slurm versions before 24.11 do not support this check and so no upgrade will
144+
occur. The following variables control behaviour during this upgrade:
145+
146+
`openhpc_slurm_accounting_storage_client_package`: Optional. String giving the
147+
name of the database client package to install, e.g. `mariadb`. Default `mysql`.
148+
149+
`openhpc_slurm_accounting_storage_backup_cmd`: Optional. String (possibly
150+
multi-line) giving a command for `ansible.builtin.shell` to run a backup of the
151+
Slurm database before performing the databse upgrade. Default is the empty
152+
string which performs no backup.
153+
154+
`openhpc_slurm_accounting_storage_backup_host`: Optional. Inventory hostname
155+
defining host to run the backup command. Default is `openhpc_slurm_accounting_storage_host`.
156+
157+
`openhpc_slurm_accounting_storage_backup_become`: Optional. Whether to run the
158+
backup command as root. Default `true`.
159+
160+
`openhpc_slurm_accounting_storage_service`: Optional. Name of systemd service
161+
for the accounting storage database, e.g. `mysql`. If this is defined this
162+
service is stopped before the backup and restarted after, to allow for physical
163+
backups. Default is the empty string, which does not stop/restart any service.
164+
139165
## Facts
140166

141167
This role creates local facts from the live Slurm configuration, which can be

defaults/main.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,3 +101,10 @@ openhpc_module_system_install: true
101101

102102
# Auto detection
103103
openhpc_ram_multiplier: 0.95
104+
105+
# Database upgrade
106+
openhpc_slurm_accounting_storage_service: ''
107+
openhpc_slurm_accounting_storage_backup_cmd: ''
108+
openhpc_slurm_accounting_storage_backup_host: "{{ openhpc_slurm_accounting_storage_host }}"
109+
openhpc_slurm_accounting_storage_backup_become: true
110+
openhpc_slurm_accounting_storage_client_package: mysql

handlers/main.yml

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,4 @@
11
---
2-
# NOTE: We need this running before slurmdbd
3-
- name: Restart Munge service
4-
service:
5-
name: "munge"
6-
state: restarted
7-
when: openhpc_slurm_service_started | bool
82

93
# NOTE: we need this running before slurmctld start
104
- name: Issue slurmdbd restart command

molecule/test4/converge.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
openhpc_slurm_partitions:
1616
- name: "compute"
1717
openhpc_cluster_name: testohpc
18+
openhpc_slurm_accounting_storage_client_package: mariadb
1819
tasks:
1920
- name: "Include ansible-role-openhpc"
2021
include_role:

tasks/install.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,8 @@
4949
install_weak_deps: false # avoids getting recommended packages
5050
when: openhpc_slurm_pkglist | default(false, true)
5151

52-
- name: Install packages from openhpc_packages variable
52+
- name: Install other packages
5353
yum:
54-
name: "{{ openhpc_packages }}"
54+
name: "{{ openhpc_packages + [openhpc_slurm_accounting_storage_client_package] }}"
5555

5656
...

tasks/runtime.yml

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,7 @@
5656
owner: munge
5757
group: munge
5858
mode: 0400
59-
notify:
60-
- Restart Munge service
59+
register: _openhpc_munge_key_copy
6160

6261
- name: Ensure JobComp logfile exists
6362
file:
@@ -159,6 +158,24 @@
159158
changed_when: false # so molecule doesn't fail
160159
become: no
161160

161+
- name: Ensure Munge service is running
162+
service:
163+
name: munge
164+
state: "{{ 'restarted' if _openhpc_munge_key_copy.changed else 'started' }}"
165+
when: openhpc_slurm_service_started | bool
166+
167+
- name: Check slurmdbd state
168+
command: systemctl is-active slurmdbd # noqa: command-instead-of-module
169+
changed_when: false
170+
failed_when: false # rc = 0 when active
171+
register: _openhpc_slurmdbd_state
172+
173+
- name: Ensure slurm database is upgraded if slurmdbd inactive
174+
import_tasks: upgrade.yml # need import for conditional support
175+
when:
176+
- "_openhpc_slurmdbd_state.stdout == 'inactive'"
177+
- openhpc_enable.database | default(false)
178+
162179
- name: Notify handler for slurmd restart
163180
debug:
164181
msg: "notifying handlers" # meta: noop doesn't support 'when'

tasks/upgrade.yml

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
- name: Check if slurm database has been initialised
2+
# DB is initialised on the first slurmdbd startup (without -u option).
3+
# If it is not initialised, `slurmdbd -u` errors with something like
4+
# > Slurm Database is somehow higher than expected '4294967294' but I only
5+
# > know as high as '16'. Conversion needed.
6+
community.mysql.mysql_query:
7+
login_db: "{{ openhpc_slurmdbd_mysql_database }}"
8+
login_user: "{{ openhpc_slurmdbd_mysql_username }}"
9+
login_password: "{{ openhpc_slurmdbd_mysql_password }}"
10+
login_host: "{{ openhpc_slurmdbd_host }}"
11+
query: SHOW TABLES
12+
config_file: ''
13+
register: _openhpc_slurmdb_tables
14+
15+
- name: Check if slurm database requires an upgrade
16+
ansible.builtin.command: slurmdbd -u
17+
register: _openhpc_slurmdbd_check
18+
changed_when: false
19+
failed_when: >-
20+
_openhpc_slurmdbd_check.rc > 1 or
21+
'Slurm Database is somehow higher than expected' in _openhpc_slurmdbd_check.stdout
22+
# from https://github.com/SchedMD/slurm/blob/master/src/plugins/accounting_storage/mysql/as_mysql_convert.c
23+
when: _openhpc_slurmdb_tables.query_result | flatten | length > 0 # i.e. when db is initialised
24+
25+
- name: Set fact for slurm database upgrade
26+
# Explanation of ifs below:
27+
# - `slurmdbd -u` rc == 0 then no conversion required (from manpage)
28+
# - default of 0 on rc skips upgrade steps if check was skipped because
29+
# db is not initialised
30+
# - Usage message (and rc == 1) if -u option doesn't exist, in which case
31+
# it can't be a major upgrade due to existing openhpc versions
32+
set_fact:
33+
_openhpc_slurmdb_upgrade: >-
34+
{{ false
35+
if (
36+
( _openhpc_slurmdbd_check.rc | default(0) == 0)
37+
or
38+
( 'Usage: slurmdbd' in _openhpc_slurmdbd_check.stderr )
39+
) else
40+
true
41+
}}
42+
43+
- name: Ensure Slurm database service stopped
44+
ansible.builtin.systemd:
45+
name: "{{ openhpc_slurm_accounting_storage_service }}"
46+
state: stopped
47+
register: _openhpc_slurmdb_state
48+
when:
49+
- _openhpc_slurmdb_upgrade
50+
- openhpc_slurm_accounting_storage_service != ''
51+
52+
- name: Backup Slurm database
53+
ansible.builtin.shell: # noqa: command-instead-of-shell
54+
cmd: "{{ openhpc_slurm_accounting_storage_backup_cmd }}"
55+
delegate_to: "{{ openhpc_slurm_accounting_storage_backup_host }}"
56+
become: "{{ openhpc_slurm_accounting_storage_backup_become }}"
57+
changed_when: true
58+
run_once: true
59+
when:
60+
- _openhpc_slurmdb_upgrade
61+
- openhpc_slurm_accounting_storage_backup_cmd != ''
62+
63+
- name: Ensure Slurm database service started
64+
ansible.builtin.systemd:
65+
name: "{{ openhpc_slurm_accounting_storage_service }}"
66+
state: started
67+
when:
68+
- openhpc_slurm_accounting_storage_service != ''
69+
- _openhpc_slurmdb_state.changed | default(false)
70+
71+
- name: Run slurmdbd in foreground for upgrade
72+
ansible.builtin.expect:
73+
command: /usr/sbin/slurmdbd -D -vvv
74+
responses:
75+
(?i)Everything rolled up:
76+
# See https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrade-slurmdbd
77+
# and
78+
# https://github.com/SchedMD/slurm/blob/0ce058c5adcf63001ec2ad211c65e67b0e7682a8/src/plugins/accounting_storage/mysql/as_mysql_usage.c#L1042
79+
become: true
80+
become_user: slurm
81+
when: _openhpc_slurmdb_upgrade

0 commit comments

Comments
 (0)