-
-
Notifications
You must be signed in to change notification settings - Fork 149
Description
What happened:
(Reposting from SO)
I'm using Dask Jobequeue on a Slurm supercomputer (I'll note that this is also a Cray machine). My workload includes a mix of threaded (i.e. numpy) and python workloads, so I think a balance of threads and processes would be best for my deployment (which is the default behaviour). However, in order for my jobs to run I need to use this basic configuration:
cluster = SLURMCluster(cores=20,
processes=1,
memory="60GB",
walltime='12:00:00',
...
)
cluster.adapt(minimum=0, maximum=20)
client = Client(cluster)
which is entirely threaded. The tasks also seem to take longer than I would naively expect (a large part of this is a lot of file reading/writing). Switching to purely processes, i.e.
cluster = SLURMCluster(cores=20,
processes=20,
memory="60GB",
walltime='12:00:00',
...
)
results in slurm jobs which are immediately killed by Slurm as they are launched, with the only output like:
slurmstepd: error: *** JOB 11116133 ON nid00201 CANCELLED AT 2021-04-29T17:23:25 ***
Choosing a balanced configuration (i.e. default)
cluster = SLURMCluster(cores=20,
memory="60GB",
walltime='12:00:00',
...
)
results in a strange intermediate behaviour. The task will run near to completion (i.e. 900/1000 work tasks) then a number of the workers will be killed, and the progress will drop back down to, say, 400/1000 tasks.
Further, I've found that using cluster.scale, rather than cluster.adapt, results in a successful run of the work. Perhaps the issue here is how adapt is trying to scale the number of jobs?
What you expected to happen:
I would expect that changing the balance of processes / threads shouldn't change the lifetime of a worker.
Anything else we need to know?:
Possibly related to #20 and #363
As an aside, the current configuration of processes / threads confusing, and seems to conflict with how e.g. a LocalCluster is specified. Is there any progress on #231?
Environment:
- Dask version: 2021.4.1
- Python version: 3.8.8
- Operating System: SUSE Linux Enterprise Server 12 SP3
- Install method (conda, pip, source): conda