Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zellij Automatically Exits when SSH is disconnected #1029

Closed
Tracked by #1100
buchenglei opened this issue Jan 27, 2022 · 23 comments
Closed
Tracked by #1100

Zellij Automatically Exits when SSH is disconnected #1029

buchenglei opened this issue Jan 27, 2022 · 23 comments
Labels
stability Issues in relation to stability suspected bug

Comments

@buchenglei
Copy link

I connected to the machine using SSH and started an new instance using 'zellij-s workspace'. Every time I quit, I need to detach manually. If SHH is disconnected or the network is interrupted, I need to re-use' zellij -s workspace', but this will lose the layout you created earlier. Is there a solution?

@Delamare2112
Copy link

Could you try systemd-run --scope --user zellij -s workspace? And to reconnect, systemd-run --scope --user zellij a workspace? It could be that the server you are connecting to is setup to stop any processes owned by a user when that user is no longer signed in on the system. If so, I think the server can be configured differently (I don't remember how) or you could alias zellij="systemd-run --scope --user zellij" in your shell profile

@raphCode
Copy link
Contributor

raphCode commented Mar 3, 2022

I never had problems with tmux session disappearing after a ssh disconnect, in fact on my test machine tmux sessions survive every time. Zellij should also reach this state.

Curiously, the zellij session vanishes for me only in some situations:
Establish the ssh session, start zellij, then...

  • disconnect ssh via <Enter>~. escape sequence
    ✔️ session survives
  • disconnect network, let ssh session timeout
    ❌ session crashes
  • detach from session, re-attach, disconnect network and let ssh time out
    ❌ session crashes
  • detach from session, disconnect network and let ssh time out
    ✔️ session survives

  • the systemd-run trick does not change any of the above
  • starting this way: zellij 2> zellij.log and letting ssh time out:
    ✔️ session survives but zellij.log contains:
    thread panicked while processing panic. aborting.

When the session crashes upon ssh disconnect, I have not found a way to get the crash message.
See ServerAliveInterval and ServerAliveCountMax ssh options to speed up the timeout tests.

@a-kenji a-kenji added suspected bug stability Issues in relation to stability labels Mar 4, 2022
@imsnif
Copy link
Member

imsnif commented Mar 25, 2022

Hey @raphCode - I'm having a hard time reproducing this. Let's assume I don't have access to a server - any ideas how I can simulate a timeout? Closest thing I have is a docker container.

@raphCode
Copy link
Contributor

I think it should be possible to spin up a docker container with sshd running and connect there.
I bet there also is a way to disconnect the network of a running docker container.

In fact, I guess these steps can be performed on any container infrastructure, not just docker.
Maybe we can add an regression test for this issuse in the CI as well.

@raphCode
Copy link
Contributor

While trying to reproduce this locally and couldn't getting it to work. My attempt was to connect to the local sshd tunneled through socat:
socat tcp:localhost:22 tcp-listen:2222, then pausing socat (Ctrl-Z in the shell) until ssh times out.

I tried again with my server: I found that the crashes happen random in every case, so my analysis above is not correct, some cases just seem to crash more likely than others.
I even could observe some coredumps in journalctl: "trap invalid opcode", these seem to be mostly the zellij client.

It kinda reminded me of #882 since this is the only random-looking crash I am aware of. Indeed, running with only one CPU does improve the chance of crashing upon ssh disconnect.

I tried to compile my own versions to test whether my merged PR fixes the problem, but my self-compiled versions never crashed, even versions prior to the PR! I guess this might be due to debug builds which alter the execution timings slightly, not triggering the race condition.

Pinging @tlinford as well since he commented on my fix PR. Does the above make any sense / Can #882 really be the cause of this issue as well?

Tell me if I should try some release builds tomorrow to make sure my PR solves this issue.

@raphCode
Copy link
Contributor

Wrote a small test loop to get some numbers:
Basically, connect to the server, attach to existing zellij session or start new one, exit the ssh connection and check if a zellij process is still alive.

#!/bin/bash

# assumption: no other zellij processes running beside these created by this script
zellij_cmd="zellij"
host="<insert host here>"
iterations=200
crash_count=0

for (( i = 1; i <= iterations; i++ ))
do
    { sleep 1; echo ~.; } | ssh "$host" -tt "$zellij_cmd attach || $zellij_cmd"
    # assumption: 'zellij' in program name
    ssh "$host" "pgrep [z]ellij" || ((crash_count++))
done

echo "Crashed $crash_count/$iterations times."

Results on my server:

  • v0.25.0 from Arch repos, both CPU cores active: 70/402 crashes
  • v0.25.0 from Arch repos, one CPU core active: 402/402 crashes
  • v0.25.0 release build with custom --data-dir, one CPU core active: 11/401
    crashes
  • v0.25.0 debug build with custom --data-dir, one CPU core active: 0/200 crashes (increased and added sleep delays becaus the debug builds run pretty slow)

With that 100% crash rate I am certain this is #882 which in turn was fixed by #1051

I am not sure if I did the release builds correctly, I expected the same results as with the system package zellij.
The binary sizes differ too, after stripping my build is even smaller than the official release.
I ran cargo make build-release and set $zellij_cmd in the script to .../target/release/zellij --data-dir .../target/dev-data

I also failed to get a release build at 79421fb, to see if the issue is fixed by /pull/1051. But executing the binary yields this panic immediately:
Error occurred in server:

 × Thread 'wasm' panicked.
 ├─▶ Originating Thread(s)
 │     1. ipc_server: NewClient
 │     2. pty_thread: NewTab
 │     3. screen_thread: NewTab
 │     4. plugin_thread: Update
 │
 ├─▶ At zellij-server/src/wasm_vm.rs:145:42
 ╰─▶ called `Result::unwrap()` on an `Err` value: RuntimeError { source: Trap(UnreachableCodeReached), wasm_trace: [FrameInfo { module_name: "<module>", func_index: 154, function_name: None,
     func_start: SourceLoc(91129), instr: SourceLoc(94366) }, FrameInfo { module_name: "<module>", func_index: 86, function_name: None, func_start: SourceLoc(50562), instr: SourceLoc(50791) },
     FrameInfo { module_name: "<module>", func_index: 68, function_name: None, func_start: SourceLoc(42221), instr: SourceLoc(42342) }, FrameInfo { module_name: "<module>", func_index: 343,
     function_name: None, func_start: SourceLoc(209940), instr: SourceLoc(231960) }], native_trace:    0: <unknown>
        1: <unknown>
        2: <unknown>
        3: <unknown>
        4: <unknown>
        5: <unknown>
        6: <unknown>
      }
 help: If you are seeing this message, it means that something went wrong.
       Please report this error to the github issue.
       (https://github.com/zellij-org/zellij/issues)

       Also, if you want to see the backtrace, you can set the `RUST_BACKTRACE` environment variable to `1`.```

</details>

@a-kenji
Copy link
Contributor

a-kenji commented Mar 29, 2022

What happens if you manually set the data dir again, or delete you data dir? We only update plugins, we don't downgrade them.

@tlinford
Copy link
Contributor

I also failed to get a release build at 79421fb, to see if the issue is fixed by /pull/1051. But executing the binary yields this panic immediately:

This happens because the plugins that zellij installs on run are taken from the assets dir, and those are currently not up to date. You can get around this by either manually replacing the plugins in assets with ones from ./target/e2e-data (./target/dev-data would be also be ok but they aren't built with --release so are much slower), and recompiling, or manually placing them on your server and running with --data-dir

@raphCode
Copy link
Contributor

I got it to run, but I can't test it with the script since the client process seems to stay alive upon ssh disconnect. This means a bunch of zellij attach client processes pile up and stay running.
This is weird, I expected them to get killed upon SIGHUP or when ssh closes the pty (does this even happen?).
This happens on v0.26.1 and main branch, v0.25.0 is fine.

@a-kenji
Copy link
Contributor

a-kenji commented Mar 30, 2022

Do they get killed with zellij ka?

@colemickens
Copy link

I'm also not seeing crashes, but I'm also seeing attaches piling up. zellij ka doesn't affect the ones piled up or prevent more from piling up.

Also, when testing this out, I keep randomly seeing this a bunch, but it's not even necessarily when running this script to attempt to repro:

  ╰─▶ called `Result::unwrap()` on an `Err` value: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })

@colemickens
Copy link

When I pointed that test script at localhost, the first three "crashed" and then I got one crash with that same buffer error, and then a few started working and attaches started piling up. ^C.

@raphCode
Copy link
Contributor

Do they get killed with zellij ka?

No, only the server process is killed.
The client processes linger around, no matter what. They even do not react to SIGHUP, SIGINT or SIGTERM.

Would identifying the commit that introduced this behavior via git bisect help?

@a-kenji
Copy link
Contributor

a-kenji commented Apr 1, 2022

Would identifying the commit that introduced this behavior via git bisect help?

I do think so!

@raphCode
Copy link
Contributor

raphCode commented Apr 4, 2022

git bisect start
# good: [59a9ba08e4d69b0a626dd165fc3b966028bc195a] chore(release): v0.25.0
git bisect good 59a9ba08e4d69b0a626dd165fc3b966028bc195a
# bad: [9c7d13984f498c9d6c712098bd27eb1ed211dbb9] chore(release): v0.26.1
git bisect bad 9c7d13984f498c9d6c712098bd27eb1ed211dbb9
# good: [e0685f65481188ba3f632be6036895984108d298] add(nix): add binary cache `zellij` (#1157)
git bisect good e0685f65481188ba3f632be6036895984108d298
# good: [7de77536abc2e8e2f2c3a2a3ec97dff081008258] refactor(tab): simplify mouse hold and release (#1185)
git bisect good 7de77536abc2e8e2f2c3a2a3ec97dff081008258
# good: [93642b08bf7a7e20caec5f653227dc9d35839e46] chore(release): v0.26.0
git bisect good 93642b08bf7a7e20caec5f653227dc9d35839e46
# bad: [54b0859e401acc4c0f78aa55b53c79f852b07632] fix(nix): fix `makeDesktopItem` (#1215)
git bisect bad 54b0859e401acc4c0f78aa55b53c79f852b07632
# bad: [9f9c16d60b4a07d2a7537c5eca3a3533cf391f78] docs(changelog): add error reporting system
git bisect bad 9f9c16d60b4a07d2a7537c5eca3a3533cf391f78
# good: [29d1ccfdfe06fdf9f355b8c934c37b54cd579aeb] fix: `.envrc`
git bisect good 29d1ccfdfe06fdf9f355b8c934c37b54cd579aeb
# bad: [0b74604a9f069086aa491f46d0f4a3f34520d153] feat: improve error reporting system (#1038)
git bisect bad 0b74604a9f069086aa491f46d0f4a3f34520d153
# first bad commit: [0b74604a9f069086aa491f46d0f4a3f34520d153] feat: improve error reporting system (#1038)

first bad commit: 0b74604
A logging change causes the app to hang upon exit? This is crazy :D

My first bisect produced nonsense results, then I tried again but deleted the data-dir directory prior to each run. Maybe this is a hint?

@a-kenji
Copy link
Contributor

a-kenji commented Apr 4, 2022

My first bisect produced nonsense results, then I tried again but deleted the data-dir directory prior to each run. Maybe this is a hint?

This should happen if you go in reverse. If you go from low release number to high, it can happen on non release commits. If you build with cargo make build and clear the data-dir, then it should work every time.

@raphCode
Copy link
Contributor

raphCode commented Apr 4, 2022

Yes, I did
killall -9 zellij; rm -r target/dev-data/; cargo make build build-dev-data-dir
then I connected through ssh, killed the ssh session and watched for zombie client processes.
zombies found -> git bisect bad

@a-kenji
Copy link
Contributor

a-kenji commented Apr 4, 2022

Awesome, thanks for the bisect! This gives us a good start.

@roland-5
Copy link

roland-5 commented Apr 13, 2022

I tried now 0.28.1 version and in remote machine created normal zellij session, put htop and detached session, log out from this machine and ssh again and session was still there with htop in it. In 0.27 and previously versions it was big problem for me (especially I never remembered Delamare2112 solution :P).

@raphCode
Copy link
Contributor

Great to hear it works now!

In 0.27 and previously versions it was big problem for me

Actually, the fix was already merged into 0.27.0, are you sure you experience the problem with that version?


The issue wtih client processes lingering around after a ssh disconnect still persists, should we keep it in this issue or open a new one?

@roland-5
Copy link

Great to hear it works now!

In 0.27 and previously versions it was big problem for me

Actually, the fix was already merged into 0.27.0, are you sure you experience the problem with that version?

I won't give my finger for whether it was version 0.27 or an older version though. :P

The issue wtih client processes lingering around after a ssh disconnect still persists, should we keep it in this issue or open a new one?

I think it should be new issue for this and close this.

@a-kenji
Copy link
Contributor

a-kenji commented Apr 14, 2022

Thanks for reaching out!

@palto42
Copy link

palto42 commented Jan 16, 2023

... It could be that the server you are connecting to is setup to stop any processes owned by a user when that user is no longer signed in on the system. If so, I think the server can be configured differently (I don't remember how) ...

@Delamare2112 Thanks for this hint, I found that it is related to a change of systemd v230. Since this version the default setting in /etc/systemd/logind.conf of KillUserProcesses= changed from "no" to "yes", causing all user processes to be killed after ssh disconnect. See Systemd defaults KillUserProcesses to ‘yes’ in logind.conf with v230

Changing this to KillUserProcesses=no fixed the problem for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stability Issues in relation to stability suspected bug
Projects
None yet
Development

No branches or pull requests

9 participants