-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker reconnection deadlock #5480
Comments
I'm not certain that it's the same problem, but I have some evidence that LightGBM's unit tests ran into this bug starting last night, triggered by the release of Documented it at microsoft/LightGBM#4771. |
Turns out this was not the cause of the LightGBM deadlock; that's #5497. |
With worker reconnection removed #6361, this deadlock should no longer be possible. |
I've experienced a deadlock around a worker's connection to the scheduler breaking and trying to restart. I'm opening this general issue to track all the problems and explain the situation; I'll open sub-issues for specific things to fix.
I believe the flow is something like this:
handle_compute_task
in this case).Server.handle_stream
terminates, and closes the comm as its final act.BatchedSend
to abort itself, settingself.please_stop=True
handle_stream
causeshandle_scheduler
to try to restart the connection to the scheduler_background_send
coroutine isn’t actually running (becauseplease_stop
was never set back to False)!batched_stream.send
appears to succeed, but the messages actually just sit in the buffer foreverFull narrative of me trying to figure out what's going on
The scheduler thinks a couple workers are processing some tasks, but those workers aren’t actually running any tasks and have nothing in ready. Additionally, both workers have 40-60 messages in their BatchedSend buffer but are not sending them. Those messages include things like
'task-finished'
and'task-erred'
. So that’s causing the deadlock: the info the scheduler needs to assign new tasks is stuck in the BatchedStream buffer.Why? I did notice
Connection to scheduler broken. Reconnecting...
log messages on those workers. Reading throughWorker.handle_scheduler
where that message prints, the error recovery seems to be scheduling a newheartbeat
to run soon. I think the intent is that this will call into_register_with_scheduler
once again, which will restart the comm and callself.batched_stream.start
again with the new comm, restarting it?However I don’t think a
BatchedSend
is actually restartable. When the_background_send
coroutine hits an exception, it callsself.abort()
. This setsplease_stop=True
(and drops the buffer containing potentially critical messages!).If you call
BatchedSend.start
again after this though,please_stop
is still True. So the_background_send
coroutine will terminate immediately without sending anything.Any subsequent messages written to the
BatchedSend
will just sit there and never be sent. However,send
won’t actually raise an error, becausesend
only checks if the comm is closed, not if theBatchedSend
itself is closed.I’m a little confused by this because if I’m reading it right, how was reconnection ever supposed to work? The BatchedSend doesn’t seem at all restartable, yet we’re acting like it is.
As for why the connection broke in the first place? I see this in the logs:
The worker’s
handle_compute_task
is trying to setnbytes
for a dependency that has no task state yet. This seems broken but I don’t know why it can happen yet.But anyway, this exception propagates up to the core
Server.handle_stream
method. And interestingly, we don’t try-except each call to stream handlers. If a stream handler ever fails, we abort stream handling, close the comm, and raise an error.And that error from
handle_stream
is what causeshandle_scheduler
to try to reconnect, which doesn’t actually work because the BatchedSend is stopped.There are two problems here:
BatchedSend
#5481. I believe that any unhandled error in a stream handler would putWorker.batched_stream
into a broken state upon reconnection, where nosends
actually get sent.Worker.handle_compute_task
(causes deadlock) #5482I also notice a couple other issues:
BatchedSend
is dropping all buffered messages when it hits an error. Those messages still could (and must!) be sent upon reconnection. Dropping these messages will cause other deadlocks. Do not drop BatchedSend payload if worker reconnects #5457 addresses this.Server.handle_stream
close the comm? #5483 Should we not have try-excepts around eachhandler()
call?As it currently stands, I think #5457 will fix this deadlock. However, I'm concerned about the overall design of
BatchedSend
. It has neither an interface nor tests to support restarting after it's been closed. However, we're using it this way. If we want it to support reconnection, we should refactor and test it to do so. It also does not seem to have clear invariants about what its internal state means, and does not validate parts of its state when it probably should. It's a very old piece of code, still using Tornado. Overall, it might be due for a refresh.cc @fjetter @jcrist @jrbourbeau
The text was updated successfully, but these errors were encountered: