[ty] Abort process if worker thread panics #18211

MichaReiser · 2025-05-20T06:17:00Z

Summary

This PR changes how ty's LSP handles panics in background worker threads.

Today, a panic in the worker thread pool gets logged (with tracing) but it tears down the worker thread on which the background task ran.
The panic will only get surfaced once the thread pool shuts down (when JoinHandle::join is called). Eventually, job_sender.send panics
because the thread pool ran out of worker threads and the job queue overflows.

This PR aligns the behavior with rayon by aborting the entire process when any background task unexpectedly panics.
My reasoning is that containing errors shouldn't be the responsibility of the thread pool. Instead, the
request dispatching should be wrapped in a catch_unwind and handle any potential recovery there. This also
reveals that we don't have the same recovery for tasks running locally (on the main thread).

I plan on adding such recovery in the server dispatch logic as a follow up (which also adds retry logic).

Test Plan

I added a panic to the hover request handler and it aborted the process (which VS code then restarts up to 5 times).

MichaReiser · 2025-05-20T06:19:14Z

crates/ty_server/src/server/schedule/thread/pool.rs


-        // Channel buffer capacity is between 2 and 4, depending on the pool size.
-        let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::min(threads * 2, 4));
+        let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::max(threads * 2, 4));


The main motivation of the limit is to apply some form of back pressure. However, limiting the queue to 4 on e.g. a 12 core system feels overly strict because it means we'll drop messages as soon as 4 out of 12 threads have one message queued. We should at least allow a backlog of 2 tasks per thread.

For my own edification, can you say more about the relationship between the channel buffer size and dropping messages? Does that mean that if a channel send would block (i.e., there's no receiving ready and waiting to synchronize) then that message is ~~blocked~~ dropped?

You're right. I was wrong here.

It's a crossbeam bounded channel that starts blocking the sender if the thread pool falls behind.

I thought that this wouldn't be the case because the version on main starts to fail with Disconnected sender when all threads panicked (which drops all channel receivers). Let me revert this change.

github-actions · 2025-05-20T06:21:56Z

`mypy_primer` results

No ecosystem changes detected ✅

github-actions · 2025-05-20T06:27:32Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

BurntSushi

This makes sense to me!

BurntSushi · 2025-05-20T12:13:31Z

crates/ty_server/src/server/schedule/thread/pool.rs


-        // Channel buffer capacity is between 2 and 4, depending on the pool size.
-        let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::min(threads * 2, 4));
+        let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::max(threads * 2, 4));


For my own edification, can you say more about the relationship between the channel buffer size and dropping messages? Does that mean that if a channel send would block (i.e., there's no receiving ready and waiting to synchronize) then that message is ~~blocked~~ dropped?

dhruvmanila

Today, a panic in the worker thread pool gets logged (with tracing) but it tears down the worker thread on which the background task ran.
The panic will only get surfaced once the thread pool shuts down (when JoinHandle::join is called). Eventually, job_sender.send panics
because the thread pool ran out of worker threads and the job queue overflows.

I'm a bit unsure of what this means in practice specifically the "panic will only get surfaced once the thread pool shuts down". I tried adding a panic to the hover handler on main and it does surface the panic in the logs. Or, am I misunderstanding?

I added a panic to the hover request handler and it aborted the process (which VS code then restarts up to 5 times).

I'm still not sure why should we abort the process if there's a panic in a specific handler. Wouldn't that degrade the user experience? Like, today even if there's a panic the server keeps running and users can keep using other capabilities.

Is there a way to handle it gracefully? I might need to spend some time understanding the scheduler but I don't want to block this PR for that. Happy to go ahead with this.

MichaReiser · 2025-05-20T15:11:57Z

I'm a bit unsure of what this means in practice specifically the "panic will only get surfaced once the thread pool shuts down". I tried adding a panic to the hover handler on main and it does surface the panic in the logs. Or, am I misunderstanding?

Thanks to our global panic handler, it does surface the panic in the logs, but it also aborts the thread and we'll eventually run out. The eror value of the panic will not be dropped until we join the threads (which can be problematic if it needs to release any resources).

For that reason, I think it's the right decision to abort the process.

I'm still not sure why should we abort the process if there's a panic in a specific handler. Wouldn't that degrade the user experience? Like, today even if there's a panic the server keeps running and users can keep using other capabilities.

Sort of. It works for as long as there are still enough worker threads. Ruff/ty will abort once all threads are used up. But I agree that the experience is worse. I plan to add specific catch_panic handlers to the request / notification handlers which will give us the old behavior (except that we never run out of threads). The last step is then to also implement a retry logic if a thread unwinds due to a salsa::Cancelled, which also needs the catch_unwind in the request handler.

MichaReiser · 2025-05-23T11:58:11Z

This is actually a more sever problem than I thought. Threads panicking has the result that the server never responds to that client request. The client might decide to not send any new request for the same method and parameters because there's already a pending request.

I think we should backport my changes to ruff

dhruvmanila · 2025-05-26T10:17:54Z

Thank you for the explanation. I think that makes sense and we should do the same for Ruff as well, it might be useful to do that after your planned follow-up work?

MichaReiser · 2025-05-26T11:57:59Z

I plan to back port all changes of this stack to ruff

MichaReiser requested review from AlexWaygood, carljm, dcreager and sharkdp as code owners May 20, 2025 06:17

MichaReiser added server Related to the LSP server ty Multi-file analysis & type inference labels May 20, 2025

MichaReiser commented May 20, 2025

View reviewed changes

MichaReiser requested review from BurntSushi and dhruvmanila and removed request for AlexWaygood, carljm, dcreager and sharkdp May 20, 2025 06:19

BurntSushi approved these changes May 20, 2025

View reviewed changes

dhruvmanila approved these changes May 20, 2025

View reviewed changes

MichaReiser added 2 commits May 26, 2025 13:59

[ty] Abort process if worker thread panics

987fcf7

Increase job_sender queue

ccd14a5

MichaReiser force-pushed the micha/server-panic branch from 8fcdf96 to ccd14a5 Compare May 26, 2025 11:59

MichaReiser merged commit 66b082f into main May 26, 2025
35 checks passed

MichaReiser deleted the micha/server-panic branch May 26, 2025 12:09

MichaReiser mentioned this pull request Jun 11, 2025

Support cancellation requests #18627

Merged

[ty] Abort process if worker thread panics #18211

[ty] Abort process if worker thread panics #18211

Uh oh!

Conversation

MichaReiser commented May 20, 2025

Summary

Test Plan

Uh oh!

MichaReiser May 20, 2025

Choose a reason for hiding this comment

Uh oh!

BurntSushi May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichaReiser May 20, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mypy_primer results

Uh oh!

github-actions bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

Uh oh!

BurntSushi left a comment

Choose a reason for hiding this comment

Uh oh!

BurntSushi May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhruvmanila left a comment

Choose a reason for hiding this comment

Uh oh!

MichaReiser commented May 20, 2025

Uh oh!

MichaReiser commented May 23, 2025

Uh oh!

dhruvmanila commented May 26, 2025

Uh oh!

MichaReiser commented May 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BurntSushi May 20, 2025 •

edited

Loading

github-actions bot commented May 20, 2025 •

edited

Loading

`mypy_primer` results

github-actions bot commented May 20, 2025 •

edited

Loading

`ruff-ecosystem` results

BurntSushi May 20, 2025 •

edited

Loading