Remove deadlocks by making the scheduler signal handler signal-safe #94
+48
−33
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See also the relevant signal tests in #93. Ideally that PR is merged and this PR is rebased on top so that we can mark the relevant signal tests as passing, rather than a non-strict xfail. Doing things in this order would also provide confidence that this isn't breaking anything else.
We can get deadlocks rarely due to logging and threading primitives in the scheduler's signal handler, which cause the process to hang sometimes on a SIGINT/SIGTERM. We also want to be able to have the signal interrupt our poll wait/sleep without busy waiting (for performance), which means we also cannot use
time.sleep(as an early signal will not interrupt this, and a pre-check could lead to TOC/TOU races), nor can we usesignal.sigtimedwait(which registers its own handlers to handle signals inside the wait, but misses signals outside).This leaves us with one workable solution - use an OS pipe and define a selector on the read file descriptor, and have the signal handler set a flag with the signal number and write to the write file descriptor. By querying the flag we always know if we have handled a signal in our main loop, and by using a fd we reliably skip the wait on a signal, where the wait is blocking (i.e. not a busy wait). The signal handler is then minimal and async-signal-safe, just setting a flag and writing to the pipe. The relevant logging logic is moved to be dispatched by the main loop instead.
Edit: The diff is unfortunately not very nice - it might be easier to view with your Git tooling of preference, or just compare the old and new code side-by-side.
If run on top of #93, this can be tested by running e.g.
pytest -k test_signal --count 100 -n auto:17 xfailed, 586 xpassed in 30.84s600 xpassed in 29.82s