You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+67-2Lines changed: 67 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -59,9 +59,11 @@ unique set of homogenous nodes:
59
59
`free --mebi` total * `openhpc_ram_multiplier`.
60
60
*`ram_multiplier`: Optional. An override for the top-level definition
61
61
`openhpc_ram_multiplier`. Has no effect if `ram_mb` is set.
62
-
*`gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict must define:
62
+
*`gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key. See [GRES autodetection](#gres-autodetection) section below.
63
+
*`gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict should define:
63
64
-`conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
64
-
-`file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
65
+
-`file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
66
+
65
67
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used.
66
68
*`features`: Optional. List of [Features](https://slurm.schedmd.com/slurm.conf.html#OPT_Features) strings.
67
69
*`node_params`: Optional. Mapping of additional parameters and values for
@@ -277,7 +279,20 @@ openhpc_nodegroups:
277
279
- conf: gpu:A100:2
278
280
file: /dev/nvidia[0-1]
279
281
```
282
+
or if using the NVML gres_autodection mechamism (NOTE: this requires recompilation of the slurm binaries to link against the [NVIDIA Management libray](#gres-autodetection)):
280
283
284
+
```yaml
285
+
openhpc_cluster_name: hpc
286
+
openhpc_nodegroups:
287
+
- name: general
288
+
- name: large
289
+
node_params:
290
+
CoreSpecCount: 2
291
+
- name: gpu
292
+
gres_autodetect: nvml
293
+
gres:
294
+
- conf: gpu:A100:2
295
+
```
281
296
Now two partitions can be configured - a default one with a short timelimit and
282
297
no large memory nodes for testing jobs, and another with all hardware and longer
283
298
job runtime for "production" jobs:
@@ -309,4 +324,54 @@ openhpc_config:
309
324
-gpu
310
325
```
311
326
327
+
## GRES autodetection
328
+
329
+
Some autodetection mechanisms require recompilation of the slurm packages to
330
+
link against external libraries. Examples are shown in the sections below.
331
+
332
+
### Recompiling slurm binaries against the [NVIDIA Management libray](https://developer.nvidia.com/management-library-nvml)
333
+
334
+
This will allow you to use `gres_autodetect: nvml` in your `nodegroup`
335
+
definitions.
336
+
337
+
First, [install the complete cuda toolkit from NVIDIA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
338
+
You can then recompile the slurm packages from the source RPMS as follows:
339
+
340
+
```sh
341
+
dnf download --source slurm-slurmd-ohpc
342
+
343
+
rpm -i slurm-ohpc-*.src.rpm
344
+
345
+
cd /root/rpmbuild/SPECS
346
+
347
+
dnf builddep slurm.spec
348
+
349
+
rpmbuild -bb -D "_with_nvml --with-nvml=/usr/local/cuda-12.8/targets/x86_64-linux/" slurm.spec | tee /tmp/build.txt
350
+
```
351
+
352
+
NOTE: This will need to be adapted for the version of CUDA installed (12.8 is used in the example).
353
+
354
+
The RPMs will be created in ` /root/rpmbuild/RPMS/x86_64/`. The method to distribute these RPMs to
355
+
each compute node is out of scope of this document. You can either use a custom package repository
356
+
or simply install them manually on each node with Ansible.
357
+
358
+
#### Configuration example
359
+
360
+
A configuration snippet is shown below:
361
+
362
+
```yaml
363
+
openhpc_cluster_name: hpc
364
+
openhpc_nodegroups:
365
+
- name: general
366
+
- name: large
367
+
node_params:
368
+
CoreSpecCount: 2
369
+
- name: gpu
370
+
gres_autodetect: nvml
371
+
gres:
372
+
- conf: gpu:A100:2
373
+
```
374
+
for additional context refer to the GPU example in: [Multiple Nodegroups](#multiple-nodegroups).
375
+
376
+
312
377
<b id="slurm_ver_footnote">1</b> Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [↩](#accounting_storage)
NodeName={{ hostlist_string }} Name={{ gres_name }} Type={{ gres_type }} File={{ gres.file | mandatory('The gres configuration dictionary: ' ~ gres ~ ' is missing the file key, but gres_autodetect is set to off. The error occured on node group: ' ~ nodegroup.name ~ '. Please add the file key or set gres_autodetect.') }}
0 commit comments