-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[ty] Abort process if worker thread panics #18211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| // Channel buffer capacity is between 2 and 4, depending on the pool size. | ||
| let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::min(threads * 2, 4)); | ||
| let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::max(threads * 2, 4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main motivation of the limit is to apply some form of back pressure. However, limiting the queue to 4 on e.g. a 12 core system feels overly strict because it means we'll drop messages as soon as 4 out of 12 threads have one message queued. We should at least allow a backlog of 2 tasks per thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own edification, can you say more about the relationship between the channel buffer size and dropping messages? Does that mean that if a channel send would block (i.e., there's no receiving ready and waiting to synchronize) then that message is blocked dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I was wrong here.
It's a crossbeam bounded channel that starts blocking the sender if the thread pool falls behind.
I thought that this wouldn't be the case because the version on main starts to fail with Disconnected sender when all threads panicked (which drops all channel receivers). Let me revert this change.
|
|
BurntSushi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me!
|
|
||
| // Channel buffer capacity is between 2 and 4, depending on the pool size. | ||
| let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::min(threads * 2, 4)); | ||
| let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::max(threads * 2, 4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own edification, can you say more about the relationship between the channel buffer size and dropping messages? Does that mean that if a channel send would block (i.e., there's no receiving ready and waiting to synchronize) then that message is blocked dropped?
dhruvmanila
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today, a panic in the worker thread pool gets logged (with tracing) but it tears down the worker thread on which the background task ran.
The panic will only get surfaced once the thread pool shuts down (whenJoinHandle::joinis called). Eventually,job_sender.sendpanics
because the thread pool ran out of worker threads and the job queue overflows.
I'm a bit unsure of what this means in practice specifically the "panic will only get surfaced once the thread pool shuts down". I tried adding a panic to the hover handler on main and it does surface the panic in the logs. Or, am I misunderstanding?
I added a
panicto the hover request handler and it aborted the process (which VS code then restarts up to 5 times).
I'm still not sure why should we abort the process if there's a panic in a specific handler. Wouldn't that degrade the user experience? Like, today even if there's a panic the server keeps running and users can keep using other capabilities.
Is there a way to handle it gracefully? I might need to spend some time understanding the scheduler but I don't want to block this PR for that. Happy to go ahead with this.
Thanks to our global panic handler, it does surface the panic in the logs, but it also aborts the thread and we'll eventually run out. The eror value of the panic will not be dropped until we join the threads (which can be problematic if it needs to release any resources). For that reason, I think it's the right decision to abort the process.
Sort of. It works for as long as there are still enough worker threads. Ruff/ty will abort once all threads are used up. But I agree that the experience is worse. I plan to add specific |
|
This is actually a more sever problem than I thought. Threads panicking has the result that the server never responds to that client request. The client might decide to not send any new request for the same method and parameters because there's already a pending request. I think we should backport my changes to ruff |
|
Thank you for the explanation. I think that makes sense and we should do the same for Ruff as well, it might be useful to do that after your planned follow-up work? |
|
I plan to back port all changes of this stack to ruff |
8fcdf96 to
ccd14a5
Compare
Summary
This PR changes how ty's LSP handles panics in background worker threads.
Today, a panic in the worker thread pool gets logged (with tracing) but it tears down the worker thread on which the background task ran.
The panic will only get surfaced once the thread pool shuts down (when
JoinHandle::joinis called). Eventually,job_sender.sendpanicsbecause the thread pool ran out of worker threads and the job queue overflows.
This PR aligns the behavior with rayon by aborting the entire process when any background task unexpectedly panics.
My reasoning is that containing errors shouldn't be the responsibility of the thread pool. Instead, the
request dispatching should be wrapped in a
catch_unwindand handle any potential recovery there. This alsoreveals that we don't have the same recovery for tasks running locally (on the main thread).
I plan on adding such recovery in the server dispatch logic as a follow up (which also adds retry logic).
Test Plan
I added a
panicto the hover request handler and it aborted the process (which VS code then restarts up to 5 times).