Skip to content

flaky test_nanny_worker_port_range #6045

@crusaderky

Description

@crusaderky

I recently overhauled test_nanny_worker_port_range in #5956, as it was hanging on Windows due to the stdout/stderr pipes being too small for the output - so I changed it to have a thread read from them asynchronously.

The change made it stable on Windows but it made it fairly flaky on Ubuntu 3.9, and Ubuntu 3.9 only. It seems stable on Ubuntu 3.8 (3.10 is too young to tell).

Screenshot from 2022-04-01 10-26-47

As much the graph shows the contrary, I don't think my change is causing the flakiness.
Rather, I'm persuaded that the test is highlighting a genuine design flaw.

The workers connect successfully to the scheduler, successfully receive the Client.run request, and then hang. I have no way to figure out if they all hang or only one (the cluster dump tool could tell me, but by the time I reach it the nanny subprocess has already been killed).

nanny stdout/stderr is captured effectively, and is shown at the end of this post. I can spot a bunch of worker restarts that should not be there. On the last lines, I can see FOUR notices for Client.run, whereas I was expecting three. In the cluster dump, scheduler.workers shows 3 workers.

What I think is happening is:

  1. The three Nanny objects are all passing the raw string port="10000:11000" to Worker.__init__ (the type annotation in worker.py which states port: int | None = None is wrong)
  2. The three worker processes are individually trying to bind on the same port range. They sequentially try ports from left to right (I can see from Scheduler.workers that they use ports 10000, 10001, and 10002) and, unavoidably, end up fighting for the same ports. Unlike when binding on port=0, which is thoroughly supported at OS level, this triggers a race condition.

I don't know why Ubuntu 3.8 is unaffected. Both the 3.8 and 3.9 environment install tornado 6.1.

Proposed design

dask-worker should parse the port range and split it in equal parts. It should also spot and prevent overlaps between the nanny port range and worker port range.
In other words,

dask-worker --nworkers 3 --worker-port 10000:11000 --nanny-port 11000:12000

must spawn

Nanny(port="11000:11333", worker_port="10000:10333")
Nanny(port="11334:11666", worker_port="10334:10666")
Nanny(port="11667:12000", worker_port="10667:10999")

Note the last worker_port ends at 10999, because 11000 is dedicated to the nannies.

nanny/worker stdout/stderr

022-03-28 18:06:11,115 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:11000'
2022-03-28 18:06:11,125 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:11001'
2022-03-28 18:06:11,135 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:11002'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
2022-03-28 18:06:13,398 - distributed.nanny - INFO - Worker process 3229 exited with status 1
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
2022-03-28 18:06:16,551 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43139
2022-03-28 18:06:16,551 - distributed.worker - INFO - -------------------------------------------------
2022-03-28 18:06:16,554 - distributed.core - INFO - Starting established connection
2022-03-28 18:06:16,582 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,584 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,587 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,597 - distributed.worker - INFO - Run out-of-band function 'lambda'

Metadata

Metadata

Assignees

No one assigned

    Labels

    flaky testIntermittent failures on CI.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions