Skip to content

Conversation

@MichaReiser
Copy link
Member

Summary

This PR changes how ty's LSP handles panics in background worker threads.

Today, a panic in the worker thread pool gets logged (with tracing) but it tears down the worker thread on which the background task ran.
The panic will only get surfaced once the thread pool shuts down (when JoinHandle::join is called). Eventually, job_sender.send panics
because the thread pool ran out of worker threads and the job queue overflows.

This PR aligns the behavior with rayon by aborting the entire process when any background task unexpectedly panics.
My reasoning is that containing errors shouldn't be the responsibility of the thread pool. Instead, the
request dispatching should be wrapped in a catch_unwind and handle any potential recovery there. This also
reveals that we don't have the same recovery for tasks running locally (on the main thread).

I plan on adding such recovery in the server dispatch logic as a follow up (which also adds retry logic).

Test Plan

I added a panic to the hover request handler and it aborted the process (which VS code then restarts up to 5 times).

@MichaReiser MichaReiser added server Related to the LSP server ty Multi-file analysis & type inference labels May 20, 2025

// Channel buffer capacity is between 2 and 4, depending on the pool size.
let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::min(threads * 2, 4));
let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::max(threads * 2, 4));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main motivation of the limit is to apply some form of back pressure. However, limiting the queue to 4 on e.g. a 12 core system feels overly strict because it means we'll drop messages as soon as 4 out of 12 threads have one message queued. We should at least allow a backlog of 2 tasks per thread.

Copy link
Member

@BurntSushi BurntSushi May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own edification, can you say more about the relationship between the channel buffer size and dropping messages? Does that mean that if a channel send would block (i.e., there's no receiving ready and waiting to synchronize) then that message is blocked dropped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I was wrong here.

It's a crossbeam bounded channel that starts blocking the sender if the thread pool falls behind.

I thought that this wouldn't be the case because the version on main starts to fail with Disconnected sender when all threads panicked (which drops all channel receivers). Let me revert this change.

@MichaReiser MichaReiser requested review from BurntSushi and dhruvmanila and removed request for AlexWaygood, carljm, dcreager and sharkdp May 20, 2025 06:19
@github-actions
Copy link
Contributor

github-actions bot commented May 20, 2025

mypy_primer results

No ecosystem changes detected ✅

@github-actions
Copy link
Contributor

github-actions bot commented May 20, 2025

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me!


// Channel buffer capacity is between 2 and 4, depending on the pool size.
let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::min(threads * 2, 4));
let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::max(threads * 2, 4));
Copy link
Member

@BurntSushi BurntSushi May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own edification, can you say more about the relationship between the channel buffer size and dropping messages? Does that mean that if a channel send would block (i.e., there's no receiving ready and waiting to synchronize) then that message is blocked dropped?

Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, a panic in the worker thread pool gets logged (with tracing) but it tears down the worker thread on which the background task ran.
The panic will only get surfaced once the thread pool shuts down (when JoinHandle::join is called). Eventually, job_sender.send panics
because the thread pool ran out of worker threads and the job queue overflows.

I'm a bit unsure of what this means in practice specifically the "panic will only get surfaced once the thread pool shuts down". I tried adding a panic to the hover handler on main and it does surface the panic in the logs. Or, am I misunderstanding?

I added a panic to the hover request handler and it aborted the process (which VS code then restarts up to 5 times).

I'm still not sure why should we abort the process if there's a panic in a specific handler. Wouldn't that degrade the user experience? Like, today even if there's a panic the server keeps running and users can keep using other capabilities.

Is there a way to handle it gracefully? I might need to spend some time understanding the scheduler but I don't want to block this PR for that. Happy to go ahead with this.

@MichaReiser
Copy link
Member Author

I'm a bit unsure of what this means in practice specifically the "panic will only get surfaced once the thread pool shuts down". I tried adding a panic to the hover handler on main and it does surface the panic in the logs. Or, am I misunderstanding?

Thanks to our global panic handler, it does surface the panic in the logs, but it also aborts the thread and we'll eventually run out. The eror value of the panic will not be dropped until we join the threads (which can be problematic if it needs to release any resources).

For that reason, I think it's the right decision to abort the process.

I'm still not sure why should we abort the process if there's a panic in a specific handler. Wouldn't that degrade the user experience? Like, today even if there's a panic the server keeps running and users can keep using other capabilities.

Sort of. It works for as long as there are still enough worker threads. Ruff/ty will abort once all threads are used up. But I agree that the experience is worse. I plan to add specific catch_panic handlers to the request / notification handlers which will give us the old behavior (except that we never run out of threads). The last step is then to also implement a retry logic if a thread unwinds due to a salsa::Cancelled, which also needs the catch_unwind in the request handler.

@MichaReiser
Copy link
Member Author

This is actually a more sever problem than I thought. Threads panicking has the result that the server never responds to that client request. The client might decide to not send any new request for the same method and parameters because there's already a pending request.

I think we should backport my changes to ruff

@dhruvmanila
Copy link
Member

Thank you for the explanation. I think that makes sense and we should do the same for Ruff as well, it might be useful to do that after your planned follow-up work?

@MichaReiser
Copy link
Member Author

I plan to back port all changes of this stack to ruff

@MichaReiser MichaReiser force-pushed the micha/server-panic branch from 8fcdf96 to ccd14a5 Compare May 26, 2025 11:59
@MichaReiser MichaReiser merged commit 66b082f into main May 26, 2025
35 checks passed
@MichaReiser MichaReiser deleted the micha/server-panic branch May 26, 2025 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

server Related to the LSP server ty Multi-file analysis & type inference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants