Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zombie Process on Tab/Pane Close #1286

Closed
dt665m opened this issue Mar 31, 2022 · 8 comments
Closed

Zombie Process on Tab/Pane Close #1286

dt665m opened this issue Mar 31, 2022 · 8 comments

Comments

@dt665m
Copy link

dt665m commented Mar 31, 2022

I'm seeing zombie processes on tab/pane close. Found out about it through vim open buffers on reopening a previously closed tab that had the same file open. (duplicate/closed issue #518)

Basics

zellij: 0.26.1
os/arch: Darwin/arm64 (M1)
alacritty: 0.11.0-dev
shell: zsh 5.8

Repro Steps:

  • start zellij
  • create new pane or tab
  • open vim
  • close pane or tab
  • ps aux | grep vim
@raphCode
Copy link
Contributor

raphCode commented Apr 4, 2022

I bet this is related to #1029 where whole zellij client processes stay running...

@imsnif
Copy link
Member

imsnif commented Apr 5, 2022

Hey, just FYI: I'm not reproducing this. Could you tell us a bit more about your OS? Can you reproduce this with other non-vim panes? Does it happen all the time?

@raphCode
Copy link
Contributor

raphCode commented Apr 5, 2022

I can reproduce on v0.26.1 and v0.27.0, it happens with all processes started in any pane.
I got it to work every time in ~10 tries.

If the zellij session is closed, all child processes eventually exit.
This means it is important to not close the last pane/tab to reproduce.

I even tried yes and closed its pane. The process and the zellij server happily continue consuming CPU time.

@raphCode
Copy link
Contributor

raphCode commented Apr 5, 2022

I bet this is related to #1029 where whole zellij client processes stay running...

Ok, I was wrong. This actually seems to be an older regression.

bisect log
git bisect start
# good: [2e0e22cdb4e5c667924758f7b7126115c9fab28f] chore(release): v0.14.0
git bisect good 2e0e22cdb4e5c667924758f7b7126115c9fab28f
# bad: [59a9ba08e4d69b0a626dd165fc3b966028bc195a] chore(release): v0.25.0
git bisect bad 59a9ba08e4d69b0a626dd165fc3b966028bc195a
# good: [70acfe74f284f50f69554391050e74a546140478] docs(changelog): wide char midline fix
git bisect good 70acfe74f284f50f69554391050e74a546140478
# bad: [0ed8f06b9fef7a025bfc1750729350742e227243] Prevent a zellij session from attaching to itself. (#911)
git bisect bad 0ed8f06b9fef7a025bfc1750729350742e227243
# bad: [050d6aa9821c2a06a6a8494280ece902de88865d] docs(changelog): add e2e instructions for darwin
git bisect bad 050d6aa9821c2a06a6a8494280ece902de88865d
# bad: [55c5b640ed80b025d65787e4c0fd83e54fe99891] docs(changelog): update cwd fix
git bisect bad 55c5b640ed80b025d65787e4c0fd83e54fe99891
# good: [13f3e747e41ce42daca95aeb67678f0dc7834c0a] docs(changelog): fix unused import on darwin
git bisect good 13f3e747e41ce42daca95aeb67678f0dc7834c0a
# bad: [a14a2f6fb06358f83ed26ed8ef03faab3084dd84] fix(unix): forkpty => openpty (#830)
git bisect bad a14a2f6fb06358f83ed26ed8ef03faab3084dd84
# good: [5e720b02a9eab86cbe732e6a6d400e305a63bfb5] fix(docs): Fix a typo and some grammatical errors in bug_report.md (#826)
git bisect good 5e720b02a9eab86cbe732e6a6d400e305a63bfb5
# good: [35c566f15c55724efc0004a787f6b6ccec8fc7c5] add: `rust-version` (msrv) field to `Cargo.toml` (#828)
git bisect good 35c566f15c55724efc0004a787f6b6ccec8fc7c5
# good: [043a3cf388bbe041d9900a431ae0d718cfc6ac92] docs(changelog): add `rust-version` to `Cargo.toml`
git bisect good 043a3cf388bbe041d9900a431ae0d718cfc6ac92
# first bad commit: [a14a2f6fb06358f83ed26ed8ef03faab3084dd84] fix(unix): forkpty => openpty (#830)

first bad commit: a14a2f6

@imsnif
Copy link
Member

imsnif commented Apr 6, 2022

Hey @raphCode - this was very helpful, thanks!
Since I still can't reproduce this, I wonder if you'll be willing to help debug this further?
If you re-add this line, does it still happen? a14a2f6#diff-740f8b11c48ca7376c998f2382492e722f0c3574959580961c99a6b200aca8caL328

@raphCode
Copy link
Contributor

raphCode commented Apr 13, 2022

With the waitpid call zellij hangs upon closing a pane.
I actually expected this to happen, since the process does not exit in the first place, but now we wait for that to happen.

Here is the diff so you can check if I did it correctly:
diff --git a/zellij-server/src/os_input_output.rs b/zellij-server/src/os_input_output.rs
index f0b49c68..7659a5fd 100644
--- a/zellij-server/src/os_input_output.rs
+++ b/zellij-server/src/os_input_output.rs
@@ -20,6 +20,7 @@ use sysinfo::{ProcessExt, ProcessRefreshKind, System, SystemExt};
 use nix::pty::{openpty, OpenptyResult, Winsize};
 use nix::sys::signal::{kill, Signal};
 use nix::sys::termios;
+use nix::sys::wait::waitpid;

 use nix::unistd;
 use signal_hook::consts::*;
@@ -301,6 +302,7 @@ impl ServerOsApi for ServerOsInputOutput {
     }
     fn kill(&self, pid: Pid) -> Result<(), nix::Error> {
         let _ = kill(pid, Some(Signal::SIGTERM));
+        waitpid(pid, None).unwrap();
         Ok(())
     }
     fn force_kill(&self, pid: Pid) -> Result<(), nix::Error> {

Turns out, bash seems to ignore SIGTERM entirely. This is indicated in the manpage:

SIGNALS
When bash is interactive, in the absence of any traps, it ignores SIGTERM (so that kill 0 does not kill an interac‐
tive shell)

I think sending SIGHUP would be a better choice, first, because we literally "hang up" the pty connection which is exactly the intention behind this signal.
Second, because bash distributes the signal to all its jobs:

The shell exits by default upon receipt of a SIGHUP. Before exiting, an interactive shell resends the SIGHUP to
all jobs, running or stopped. Stopped jobs are sent SIGCONT to ensure that they receive the SIGHUP. To prevent
the shell from sending the signal to a particular job, it should be removed from the jobs table with the disown
builtin (see SHELL BUILTIN COMMANDS below) or marked to not receive SIGHUP using disown -h.

Also, with disown or nohup it is possible to create processes that survive the shell exiting. This is (maybe?) not possible with other signals (or they are outright ignored by the shell).

I believe the regression came from removing the force_kill() call which was executed after a timeout:
a14a2f6#diff-3b917a76d4084cfc0c6717b07324ca62802b509712c7b6a9d53b5436b8435f6dL392

I am not even sure if this function is called anywhere in the new codebase? What would be its use anyway?


I'm not reproducing this.

You are using the fish shell which does not catch SIGTERM and exits upon receiving that.

raphCode added a commit to raphCode/zellij that referenced this issue Apr 13, 2022
SIGHUP correctly states the intention behind sending a signal when a
pane is closed: The controlling terminal is "hung up".
Also, SIGHUP is better suited than SIGTERM since bash ignores the
latter. This led to the zombie processes observed by some users.
Furthermore, SIGHUP has a special meaning in bash's job control, namely
re-sending the signal to all owned jobs before exiting.
@raphCode
Copy link
Contributor

Opened a PR to change to SIGHUP
#1320

@raphCode
Copy link
Contributor

My PR should have solved the problem, so we can close this.

@a-kenji a-kenji closed this as completed Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants