Skip to content

Eliminate partially-removed-worker state on scheduler (comms open, state removed) #6390

@gjoseph92

Description

@gjoseph92

Scheduler.remove_worker removes state regarding the worker (self.workers[addr], self.stream_comms[addr], etc.), but does not close the actual network connections to the worker. This is even codified in the close=False option, which supports removing the worker state, but not telling the worker to shut down or to disconnect.

Keeping the network connections open (and listening to them) is essentially a half-removed state. The scheduler no longer knows about the worker, but if the worker sends it updates over the open connection, it will respond to them (potentially invoking handlers that assume the worker state is there).

There are two things to figure out:

  • What does it mean for a worker to be "there" or "not there", from the scheduler's perspective?
    • i.e. is it only that self.workers[addr] exists? Or also self.stream_comms[addr], and other such fields? Is there a self.handle_worker coroutine running for that worker too?
    • Can there be a single point of truth for this? A single dict to check? Or method to call?
  • How can Scheduler.remove_worker ensure that:
    • after it returns, the worker is fully "not there"
    • if it yields control while it's running (via await), things are in a well-defined state (worker is either "there", or "not there", or maybe even in a "closing" state, but no half-removed state like we have currently)
    • if multiple remove_worker coroutines run concurrently, everything remains consistent
    • if multiple remove_worker coroutines run concurrently, the second one does not return until the worker is actually removed (i.e. the first coroutine has completed)

Addresses #6354

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions