Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread 'stdin_handler' panicked along suspending system into sleep mode #1429

Open
matu3ba opened this issue May 22, 2022 · 6 comments
Open
Labels
stability Issues in relation to stability suspected bug

Comments

@matu3ba
Copy link
Contributor

matu3ba commented May 22, 2022

Unfortunately the issue is not easily reproducable and I dont have a backtrace/the panic handler did not tell what code broke.
Further, I broke the logs in /tmp/zellij-1000/zellij-log/ by starting a fresh instance, but I expect those would not tell much either.

The following output information were printed:

  × Thread 'stdin_handler' panicked.
  ├─▶ At /home/misterspoon/.cargo/registry/src/github.com-1ecc6299db9ec823/zellij-client-0.29.1/src/os_input_output.rs:256:14
  ╰─▶ could not poll stdin for readiness: Os { code: 4, kind: Interrupted, message: "Interrupted system call" }
  help: If you are seeing this message, it means that something went wrong.
        Please report this error to the github issue.
        (https://github.com/zellij-org/zellij/issues)

        Also, if you want to see the backtrace, you can set the `RUST_BACKTRACE` environment variable to `1`.

Basic information

zellij --version: zellij 0.29.1
tput lines: 30 (not relevant)
tput cols: 126 (not relevant)
uname -av or ver(Windows): Linux pc 5.17.9-arch1-1

List of programs you interact with as, PROGRAM --version: output cropped meaningful, for example:
nvim --version: NVIM v0.5.0-dev+1299-g1c2e504d5 (used the appimage release)
alacritty --version: alacritty 0.7.2 (5ac8060b)

Further information
To reproduce, I suspect that suspending system into sleep mode and waking up should reproduce the issue eventually.
Signaling code is very hard, so I would not be surprised that some dependency does not handle the signal interrupt mask or handlers correctly.

If you give me some pointers what you use to poll stdin I'll fix the signaling code myself. Ideally a portable solution should be used.
My suspection is that signal interrupts are not handled at all such that the bubbled up failure to poll crashes the application.

@a-kenji a-kenji added the stability Issues in relation to stability label May 22, 2022
@tlinford
Copy link
Contributor

i think it could be related to #1334, which might explain why you are having trouble reproducing it consistently.

@raphCode
Copy link
Contributor

Does this crash your session / the zellij server or are you able to reattach to your session afterwards?

@raphCode
Copy link
Contributor

We actually have a dedicated signal handling thread here:

let _signal_thread = thread::Builder::new()

fn handle_signals(&self, sigwinch_cb: Box<dyn Fn()>, quit_cb: Box<dyn Fn()>) {

So I am not sure why any signal still interrupts the polling syscall (probably poll or select returning EINTR) here:

self.poll
.poll(&mut self.events, Some(self.timeout))
.expect("could not poll stdin for readiness");

@matu3ba
Copy link
Contributor Author

matu3ba commented May 24, 2022

Thanks for the pointers!

  1. Is EINTR already handled in Rust by retrying or are C semantics used?
  2. SIGSTOP and SIGCONT signals are not handled, but I need to check first what the signaling semantics of mio poll are.
  3. .expect("could not poll stdin for readiness"); reads like it doesnt handle this edge case (but I must check first)

@raphCode
Copy link
Contributor

Disclaimer: I do not know this area of the code well, take the following with a grain of salt:

Judging from my Rust knowlegde, poll() very likely returns a Result which indicates success or failure of the call and contains the return value (on success) or error information (on failure).
unwrap() called on a Result returns the wrapped success value, or panics if the Result indicates failure. Since the latter happened, I assume that poll() returned a failure. To answer question 1, I think nothing is retried automatically, but instead the error is bubbled up.

The best way to fix this would be in my opinion to make sure the poll() doesn't get interrupted / all signals are delivered only to the signal thread (do signals work this way?).
Second best option is probably to just retry the poll if we receive that error.

@matu3ba
Copy link
Contributor Author

matu3ba commented May 24, 2022

do signals work this way?

Signal semantics on the OS work in the following way: There is a process mask and thread mask of "always guaranteed to be delivered signals" and realtime signals (user-definable), which are defined as list.
Signal handlers are registered process-wide, while signals can be send to each individual thread. Each individual thread has a signalmask it (or other threads) can can modify (sigprocmask). Except for pid 1, SIGKILL can not be ignored and leads to thread process teardown after Kernel tasks succeded (no custom handler possible).
SIGABRT can neither be ignored (unless pid 1), but the application can install a custom handler with LONGJMP or do nothing to not run execute the process teardown.
man signal-safety and man 7 signal are a good read.

Overall, signal handling is a pile of non-portable mess as signals are kinda like small interrupt handlers. So its best to mask everything and only allow necessary things in specified regions of code.

See also my PR on the abort() call in Zig ziglang/zig#11565 or musl abort().
One of the worst nightmares is that its not UB to call the signal handler start code from the signal handler (leading to recursive, potentially infinite, loops) and the horrible recursive locking and unlocking code in glibc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stability Issues in relation to stability suspected bug
Projects
None yet
Development

No branches or pull requests

4 participants