Skip to content

Adaptive scaling and dask-jobqueue goes into endless loop when a job launches several worker processes (was: Different configs result in worker death) #498

@AlecThomson

Description

@AlecThomson

What happened:
(Reposting from SO)

I'm using Dask Jobequeue on a Slurm supercomputer (I'll note that this is also a Cray machine). My workload includes a mix of threaded (i.e. numpy) and python workloads, so I think a balance of threads and processes would be best for my deployment (which is the default behaviour). However, in order for my jobs to run I need to use this basic configuration:

cluster = SLURMCluster(cores=20,
                    processes=1,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )
cluster.adapt(minimum=0, maximum=20)
client = Client(cluster)

which is entirely threaded. The tasks also seem to take longer than I would naively expect (a large part of this is a lot of file reading/writing). Switching to purely processes, i.e.

cluster = SLURMCluster(cores=20,
                    processes=20,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )

results in slurm jobs which are immediately killed by Slurm as they are launched, with the only output like:

slurmstepd: error: *** JOB 11116133 ON nid00201 CANCELLED AT 2021-04-29T17:23:25 ***

Choosing a balanced configuration (i.e. default)

cluster = SLURMCluster(cores=20,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )

results in a strange intermediate behaviour. The task will run near to completion (i.e. 900/1000 work tasks) then a number of the workers will be killed, and the progress will drop back down to, say, 400/1000 tasks.

Further, I've found that using cluster.scale, rather than cluster.adapt, results in a successful run of the work. Perhaps the issue here is how adapt is trying to scale the number of jobs?

What you expected to happen:
I would expect that changing the balance of processes / threads shouldn't change the lifetime of a worker.

Anything else we need to know?:
Possibly related to #20 and #363

As an aside, the current configuration of processes / threads confusing, and seems to conflict with how e.g. a LocalCluster is specified. Is there any progress on #231?

Environment:

  • Dask version: 2021.4.1
  • Python version: 3.8.8
  • Operating System: SUSE Linux Enterprise Server 12 SP3
  • Install method (conda, pip, source): conda

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingusage questionQuestion about using jobqueue

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions