-
-
Notifications
You must be signed in to change notification settings - Fork 734
Description
If the network connection between the worker and scheduler was broken, workers used to try to re-connect and negotiate their state with the scheduler.
It turned out that the logic around re-estabilshing the network connection (#5481), re-negotiating the state (#6341), and handling the disconnect on the scheduler side (#6354) was all buggy and a source of deadlocks. Though disruptive, for short-term stability, we opted to remove the reconnection option entirely (#6350).
However, in the long term, we do want workers to be resilient to temporary network failures. We'll want to add worker reconnection back in once contracts around BatchedSend
and worker disconnection are tightened up.
Requires:
- Add validation to
BatchedSend
and convert to asyncio #6389 - Eliminate partially-removed-worker state on scheduler (comms open, state removed) #6390
- and probably some other things
Note that I'm intentionally not tracking this in #6384, since those are only meant to be short-term tasks. This is likely not something we'll tackle for a bit.