Skip to content

Add back worker reconnection #6391

@gjoseph92

Description

@gjoseph92

If the network connection between the worker and scheduler was broken, workers used to try to re-connect and negotiate their state with the scheduler.

It turned out that the logic around re-estabilshing the network connection (#5481), re-negotiating the state (#6341), and handling the disconnect on the scheduler side (#6354) was all buggy and a source of deadlocks. Though disruptive, for short-term stability, we opted to remove the reconnection option entirely (#6350).

However, in the long term, we do want workers to be resilient to temporary network failures. We'll want to add worker reconnection back in once contracts around BatchedSend and worker disconnection are tightened up.

Requires:

Note that I'm intentionally not tracking this in #6384, since those are only meant to be short-term tasks. This is likely not something we'll tackle for a bit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions