Adaptive scaling and dask-jobqueue goes into endless loop when a job launches several worker processes (was: Different configs result in worker death)

**What happened**:
(Reposting from [SO](https://stackoverflow.com/questions/67326553/dask-jobqueue-why-does-using-processes-result-in-cancelled-jobs/67328294?noredirect=1#comment119067830_67328294))

I'm using Dask Jobequeue on a Slurm supercomputer (I'll note that this is also a Cray machine). My workload includes a mix of threaded (i.e. numpy) and python workloads, so I think a balance of threads and processes would be best for my deployment (which is the default behaviour). However, in order for my jobs to run I need to use this basic configuration:
```
cluster = SLURMCluster(cores=20,
                    processes=1,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )
cluster.adapt(minimum=0, maximum=20)
client = Client(cluster)
```
which is entirely threaded. The tasks also seem to take longer than I would naively expect (a large part of this is a lot of file reading/writing). Switching to purely processes, i.e.
```
cluster = SLURMCluster(cores=20,
                    processes=20,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )
```
results in slurm jobs which are immediately killed by Slurm as they are launched, with the only output like:
```
slurmstepd: error: *** JOB 11116133 ON nid00201 CANCELLED AT 2021-04-29T17:23:25 ***
```
Choosing a balanced configuration (i.e. default)
```
cluster = SLURMCluster(cores=20,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )
```
results in a strange intermediate behaviour. The task will run near to completion (i.e. 900/1000 work tasks) then a number of the workers will be killed, and the progress will drop back down to, say, 400/1000 tasks.

Further, I've found that using `cluster.scale`, rather than `cluster.adapt`, results in a successful run of the work. Perhaps the issue here is how adapt is trying to scale the number of jobs?

**What you expected to happen**:
I would expect that changing the balance of processes / threads shouldn't change the lifetime of a worker.

**Anything else we need to know?**:
Possibly related to #20 and #363

As an aside, the current configuration of processes / threads confusing, and seems to conflict with how e.g. a `LocalCluster` is specified. Is there any progress on #231?

**Environment**:

- Dask version: 2021.4.1
- Python version: 3.8.8
- Operating System: SUSE Linux Enterprise Server 12 SP3
- Install method (conda, pip, source): conda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adaptive scaling and dask-jobqueue goes into endless loop when a job launches several worker processes (was: Different configs result in worker death) #498

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Adaptive scaling and dask-jobqueue goes into endless loop when a job launches several worker processes (was: Different configs result in worker death) #498

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions