Are timeouts in Worker.close antipatterns?

I just reviewed our code around closing workers and noticed many, many timeouts of which I doubt they actually are very useful. I would even go as far as call them code smells or anti patterns and believe that if they ever trigger we're in a potentially inconsistent state.

## BaseWorker timeout 

https://github.com/dask/distributed/blob/5635bc01965db48b06800764e11d0069be38e670/distributed/worker.py#L1529

BaseWorker is probably not the ideal class name for this. BaseWorker is a class that handles the async instructions emitted from the WorkerState. For instance, this handles signals like `Execute` or `GatherDep`. It implements a timeout for the specific case (that is also tested) when instruction handlers are catching `CancelledErrors` and might block indefinitely. Two reason why I believe this is not necessary
1. We control all instruction handlers. If we did something like that, we should fix it and not raise a timeout error. This is clearly a bug
2. Even if we did this, shouldn't we cancel a task repeatedly until it is in fact closed?

## BatchedSend close timeout

https://github.com/dask/distributed/blob/5635bc01965db48b06800764e11d0069be38e670/distributed/worker.py#L1595
which goes to https://github.com/dask/distributed/blob/5635bc01965db48b06800764e11d0069be38e670/distributed/batched.py#L172

This timeout controls the BatchedSend background task. If the background task doesn't finish in the provided time, this will raise. However, for the background task to actually not finish we'd need a `comm.write` to basically block for at least `timeout` seconds without raising an exception after the remove aborted the comm. I'm not sure if this is realistic, at least not with TCP

## Threadpool timeout

https://github.com/dask/distributed/blob/5635bc01965db48b06800764e11d0069be38e670/distributed/worker.py#L1604

Last but not least, there is a threadpool timeout which is thrown into `Thread.join` in our threadpool, see https://github.com/dask/distributed/blob/5635bc01965db48b06800764e11d0069be38e670/distributed/threadpoolexecutor.py#L107

This timeout does not exist in the stdlib Threadpool. Joining a thread with a timeout is actually not raising but will just block for at most that many seconds. 
This timeout more or less makes sense because we do not want to block the shutdown of a worker if it is still running user tasks. However, if this timeout is hit, we'd basically just leak the thread and we rely that they are daemon threads (which is not the case for the stdlib pool) such that the python process terminates and takes the threads with them.

A couple of questions here
1. Why do we wait at all? If the worker is closing there is no point in waiting for the user tasks to finish, is there?
2. If we want to wait, why wait only for a bit? Is this one of these situations where we decide to wait a bit because in _most_ cases we'll have a clean state?


My gut tells me right now that we can throw away all `timeout`s and the `executor_wait` kwarg and reduce complexity significantly.

cc @gjoseph92 @crusaderky @graingert 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Are timeouts in Worker.close antipatterns? #7318

BaseWorker timeout

BatchedSend close timeout

Threadpool timeout

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Are timeouts in Worker.close antipatterns? #7318

Description

BaseWorker timeout

BatchedSend close timeout

Threadpool timeout

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions