Skip to content

Are timeouts in Worker.close antipatterns? #7318

Open
@fjetter

Description

@fjetter

I just reviewed our code around closing workers and noticed many, many timeouts of which I doubt they actually are very useful. I would even go as far as call them code smells or anti patterns and believe that if they ever trigger we're in a potentially inconsistent state.

BaseWorker timeout

await BaseWorker.close(self, timeout=timeout)

BaseWorker is probably not the ideal class name for this. BaseWorker is a class that handles the async instructions emitted from the WorkerState. For instance, this handles signals like Execute or GatherDep. It implements a timeout for the specific case (that is also tested) when instruction handlers are catching CancelledErrors and might block indefinitely. Two reason why I believe this is not necessary

  1. We control all instruction handlers. If we did something like that, we should fix it and not raise a timeout error. This is clearly a bug
  2. Even if we did this, shouldn't we cancel a task repeatedly until it is in fact closed?

BatchedSend close timeout

await self.batched_stream.close(timedelta(seconds=timeout))

which goes to
self.please_stop = True

This timeout controls the BatchedSend background task. If the background task doesn't finish in the provided time, this will raise. However, for the background task to actually not finish we'd need a comm.write to basically block for at least timeout seconds without raising an exception after the remove aborted the comm. I'm not sure if this is realistic, at least not with TCP

Threadpool timeout

executor.shutdown(wait=wait, timeout=timeout)

Last but not least, there is a threadpool timeout which is thrown into Thread.join in our threadpool, see

t.join(timeout=timeout2)

This timeout does not exist in the stdlib Threadpool. Joining a thread with a timeout is actually not raising but will just block for at most that many seconds.
This timeout more or less makes sense because we do not want to block the shutdown of a worker if it is still running user tasks. However, if this timeout is hit, we'd basically just leak the thread and we rely that they are daemon threads (which is not the case for the stdlib pool) such that the python process terminates and takes the threads with them.

A couple of questions here

  1. Why do we wait at all? If the worker is closing there is no point in waiting for the user tasks to finish, is there?
  2. If we want to wait, why wait only for a bit? Is this one of these situations where we decide to wait a bit because in most cases we'll have a clean state?

My gut tells me right now that we can throw away all timeouts and the executor_wait kwarg and reduce complexity significantly.

cc @gjoseph92 @crusaderky @graingert

Metadata

Metadata

Assignees

No one assigned

    Labels

    asynciodiscussionDiscussing a topic with no specific actions yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions