Description
I just reviewed our code around closing workers and noticed many, many timeouts of which I doubt they actually are very useful. I would even go as far as call them code smells or anti patterns and believe that if they ever trigger we're in a potentially inconsistent state.
BaseWorker timeout
distributed/distributed/worker.py
Line 1529 in 5635bc0
BaseWorker is probably not the ideal class name for this. BaseWorker is a class that handles the async instructions emitted from the WorkerState. For instance, this handles signals like Execute
or GatherDep
. It implements a timeout for the specific case (that is also tested) when instruction handlers are catching CancelledErrors
and might block indefinitely. Two reason why I believe this is not necessary
- We control all instruction handlers. If we did something like that, we should fix it and not raise a timeout error. This is clearly a bug
- Even if we did this, shouldn't we cancel a task repeatedly until it is in fact closed?
BatchedSend close timeout
distributed/distributed/worker.py
Line 1595 in 5635bc0
which goes to
distributed/distributed/batched.py
Line 172 in 5635bc0
This timeout controls the BatchedSend background task. If the background task doesn't finish in the provided time, this will raise. However, for the background task to actually not finish we'd need a comm.write
to basically block for at least timeout
seconds without raising an exception after the remove aborted the comm. I'm not sure if this is realistic, at least not with TCP
Threadpool timeout
distributed/distributed/worker.py
Line 1604 in 5635bc0
Last but not least, there is a threadpool timeout which is thrown into Thread.join
in our threadpool, see
distributed/distributed/threadpoolexecutor.py
Line 107 in 5635bc0
This timeout does not exist in the stdlib Threadpool. Joining a thread with a timeout is actually not raising but will just block for at most that many seconds.
This timeout more or less makes sense because we do not want to block the shutdown of a worker if it is still running user tasks. However, if this timeout is hit, we'd basically just leak the thread and we rely that they are daemon threads (which is not the case for the stdlib pool) such that the python process terminates and takes the threads with them.
A couple of questions here
- Why do we wait at all? If the worker is closing there is no point in waiting for the user tasks to finish, is there?
- If we want to wait, why wait only for a bit? Is this one of these situations where we decide to wait a bit because in most cases we'll have a clean state?
My gut tells me right now that we can throw away all timeout
s and the executor_wait
kwarg and reduce complexity significantly.