Remove deadlocks by making the scheduler signal handler signal-safe #94

AlexJones0 · 2026-02-10T16:06:41Z

See also the relevant signal tests in #93. Ideally that PR is merged and this PR is rebased on top so that we can mark the relevant signal tests as passing, rather than a non-strict xfail. Doing things in this order would also provide confidence that this isn't breaking anything else.

We can get deadlocks rarely due to logging and threading primitives in the scheduler's signal handler, which cause the process to hang sometimes on a SIGINT/SIGTERM. We also want to be able to have the signal interrupt our poll wait/sleep without busy waiting (for performance), which means we also cannot use time.sleep (as an early signal will not interrupt this, and a pre-check could lead to TOC/TOU races), nor can we use signal.sigtimedwait (which registers its own handlers to handle signals inside the wait, but misses signals outside).

This leaves us with one workable solution - use an OS pipe and define a selector on the read file descriptor, and have the signal handler set a flag with the signal number and write to the write file descriptor. By querying the flag we always know if we have handled a signal in our main loop, and by using a fd we reliably skip the wait on a signal, where the wait is blocking (i.e. not a busy wait). The signal handler is then minimal and async-signal-safe, just setting a flag and writing to the pipe. The relevant logging logic is moved to be dispatched by the main loop instead.

Edit: The diff is unfortunately not very nice - it might be easier to view with your Git tooling of preference, or just compare the old and new code side-by-side.

If run on top of #93, this can be tested by running e.g. pytest -k test_signal --count 100 -n auto:

Before this PR, I got a result of: 17 xfailed, 586 xpassed in 30.84s
With this PR, I get a result of: 600 xpassed in 29.82s

See relevant comments - we can get deadlocks with logging and threading primitives rarely which causes the process to hang around 5% of the time on a SIGINT/SIGTERM, but we also want to be able to have the signal interrupt our poll wait/sleep without busy waiting (for performance), which means we also cannot use time.sleep (an early signal will not interrupt this, and a pre-check leads to ToC-ToU races), nor signal.sigtimedwait (registers its own handlers to handle signals inside the wait, but misses signals outside). This leaves us with one clear solution - use an OS pipe and define a selector on the read file descriptor, and have the signal handler set a flag with the signal number and write to the write file descriptor. By querying the flag we always know if we have handled a signal in our main loop, and by using a fd we reliably skip the wait on a signal, where the wait is blocking (i.e. not a busy wait). The signal handler is then minimal and async-signal-safe, just setting a flag and writing to the pipe. The relevant logging logic is moved to be dispatched by the main loop instead. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>

AlexJones0 requested review from hcallahan-lowrisc, machshev and rswarbrick February 10, 2026 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove deadlocks by making the scheduler signal handler signal-safe #94

Remove deadlocks by making the scheduler signal handler signal-safe #94

AlexJones0 commented Feb 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Remove deadlocks by making the scheduler signal handler signal-safe #94

Are you sure you want to change the base?

Remove deadlocks by making the scheduler signal handler signal-safe #94

Conversation

AlexJones0 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AlexJones0 commented Feb 10, 2026 •

edited

Loading