-
-
Notifications
You must be signed in to change notification settings - Fork 734
Description
I recently overhauled test_nanny_worker_port_range in #5956, as it was hanging on Windows due to the stdout/stderr pipes being too small for the output - so I changed it to have a thread read from them asynchronously.
The change made it stable on Windows but it made it fairly flaky on Ubuntu 3.9, and Ubuntu 3.9 only. It seems stable on Ubuntu 3.8 (3.10 is too young to tell).
As much the graph shows the contrary, I don't think my change is causing the flakiness.
Rather, I'm persuaded that the test is highlighting a genuine design flaw.
The workers connect successfully to the scheduler, successfully receive the Client.run request, and then hang. I have no way to figure out if they all hang or only one (the cluster dump tool could tell me, but by the time I reach it the nanny subprocess has already been killed).
nanny stdout/stderr is captured effectively, and is shown at the end of this post. I can spot a bunch of worker restarts that should not be there. On the last lines, I can see FOUR notices for Client.run, whereas I was expecting three. In the cluster dump, scheduler.workers shows 3 workers.
What I think is happening is:
- The three Nanny objects are all passing the raw string
port="10000:11000"
toWorker.__init__
(the type annotation inworker.py
which statesport: int | None = None
is wrong) - The three worker processes are individually trying to bind on the same port range. They sequentially try ports from left to right (I can see from Scheduler.workers that they use ports 10000, 10001, and 10002) and, unavoidably, end up fighting for the same ports. Unlike when binding on port=0, which is thoroughly supported at OS level, this triggers a race condition.
I don't know why Ubuntu 3.8 is unaffected. Both the 3.8 and 3.9 environment install tornado 6.1.
Proposed design
dask-worker
should parse the port range and split it in equal parts. It should also spot and prevent overlaps between the nanny port range and worker port range.
In other words,
dask-worker --nworkers 3 --worker-port 10000:11000 --nanny-port 11000:12000
must spawn
Nanny(port="11000:11333", worker_port="10000:10333")
Nanny(port="11334:11666", worker_port="10334:10666")
Nanny(port="11667:12000", worker_port="10667:10999")
Note the last worker_port ends at 10999, because 11000 is dedicated to the nannies.
nanny/worker stdout/stderr
022-03-28 18:06:11,115 - distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:11000'
2022-03-28 18:06:11,125 - distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:11001'
2022-03-28 18:06:11,135 - distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:11002'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
2022-03-28 18:06:13,398 - distributed.nanny - INFO - Worker process 3229 exited with status 1
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
2022-03-28 18:06:16,551 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43139
2022-03-28 18:06:16,551 - distributed.worker - INFO - -------------------------------------------------
2022-03-28 18:06:16,554 - distributed.core - INFO - Starting established connection
2022-03-28 18:06:16,582 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,584 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,587 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,597 - distributed.worker - INFO - Run out-of-band function 'lambda'