flaky test_nanny_worker_port_range

I recently overhauled test_nanny_worker_port_range in https://github.com/dask/distributed/pull/5956, as it was hanging on Windows due to the stdout/stderr pipes being too small for the output - so I changed it to have a thread read from them asynchronously.

The change made it stable on Windows but it made it fairly flaky on Ubuntu 3.9, and Ubuntu 3.9 only. It seems stable on Ubuntu 3.8 (3.10 is too young to tell).

![Screenshot from 2022-04-01 10-26-47](https://user-images.githubusercontent.com/6213168/161236075-63e56f14-f4c4-426a-a2c6-7bb4efee1b39.png)

As much the graph shows the contrary, I don't think my change is causing the flakiness.
Rather, I'm persuaded that the test is highlighting a genuine design flaw.

The workers connect successfully to the scheduler, successfully receive the Client.run request, and then hang. I have no way to figure out if they all hang or only one (the cluster dump tool could tell me, but by the time I reach it the nanny subprocess has already been killed).

nanny stdout/stderr is captured effectively, and is shown at the end of this post. I can spot a bunch of worker restarts that should not be there. On the last lines, I can see FOUR notices for Client.run, whereas I was expecting three. In the cluster dump, scheduler.workers shows 3 workers.

What I think is happening is:
1. The three Nanny objects are all passing the raw string ``port="10000:11000"`` to ``Worker.__init__`` (the type annotation in ``worker.py`` which states ``port: int | None = None`` is wrong)
2. The three worker processes are individually trying to bind on the same port range. They sequentially try ports from left to right (I can see from Scheduler.workers that they use ports 10000, 10001, and 10002) and, unavoidably, end up fighting for the same ports. Unlike when binding on port=0, which is thoroughly supported at OS level, this triggers a race condition.

I don't know why Ubuntu 3.8 is unaffected. Both the 3.8 and 3.9 environment install tornado 6.1.

## Proposed design
``dask-worker`` should parse the port range and split it in equal parts. It should also spot and prevent overlaps between the nanny port range and worker port range.
In other words,
```bash
dask-worker --nworkers 3 --worker-port 10000:11000 --nanny-port 11000:12000
```
must spawn
```
Nanny(port="11000:11333", worker_port="10000:10333")
Nanny(port="11334:11666", worker_port="10334:10666")
Nanny(port="11667:12000", worker_port="10667:10999")
```
Note the last worker_port ends at 10999, because 11000 is dedicated to the nannies.

## nanny/worker stdout/stderr
```
022-03-28 18:06:11,115 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:11000'
2022-03-28 18:06:11,125 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:11001'
2022-03-28 18:06:11,135 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:11002'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
2022-03-28 18:06:13,398 - distributed.nanny - INFO - Worker process 3229 exited with status 1
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
2022-03-28 18:06:16,551 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43139
2022-03-28 18:06:16,551 - distributed.worker - INFO - -------------------------------------------------
2022-03-28 18:06:16,554 - distributed.core - INFO - Starting established connection
2022-03-28 18:06:16,582 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,584 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,587 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,597 - distributed.worker - INFO - Run out-of-band function 'lambda'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

flaky test_nanny_worker_port_range #6045

Proposed design

nanny/worker stdout/stderr

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

flaky test_nanny_worker_port_range #6045

Description

Proposed design

nanny/worker stdout/stderr

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions