Zellij Automatically Exits when SSH is disconnected #1029

buchenglei · 2022-01-27T02:43:00Z

I connected to the machine using SSH and started an new instance using 'zellij-s workspace'. Every time I quit, I need to detach manually. If SHH is disconnected or the network is interrupted, I need to re-use' zellij -s workspace', but this will lose the layout you created earlier. Is there a solution?

Delamare2112 · 2022-02-03T19:07:20Z

Could you try systemd-run --scope --user zellij -s workspace? And to reconnect, systemd-run --scope --user zellij a workspace? It could be that the server you are connecting to is setup to stop any processes owned by a user when that user is no longer signed in on the system. If so, I think the server can be configured differently (I don't remember how) or you could alias zellij="systemd-run --scope --user zellij" in your shell profile

raphCode · 2022-03-03T22:27:37Z

I never had problems with tmux session disappearing after a ssh disconnect, in fact on my test machine tmux sessions survive every time. Zellij should also reach this state.

Curiously, the zellij session vanishes for me only in some situations:
Establish the ssh session, start zellij, then...

disconnect ssh via <Enter>~. escape sequence
✔️ session survives
disconnect network, let ssh session timeout
❌ session crashes
detach from session, re-attach, disconnect network and let ssh time out
❌ session crashes
detach from session, disconnect network and let ssh time out
✔️ session survives

the systemd-run trick does not change any of the above
starting this way: zellij 2> zellij.log and letting ssh time out:
✔️ session survives but zellij.log contains:
thread panicked while processing panic. aborting.

When the session crashes upon ssh disconnect, I have not found a way to get the crash message.
See ServerAliveInterval and ServerAliveCountMax ssh options to speed up the timeout tests.

imsnif · 2022-03-25T16:21:42Z

Hey @raphCode - I'm having a hard time reproducing this. Let's assume I don't have access to a server - any ideas how I can simulate a timeout? Closest thing I have is a docker container.

raphCode · 2022-03-28T17:27:00Z

I think it should be possible to spin up a docker container with sshd running and connect there.
I bet there also is a way to disconnect the network of a running docker container.

In fact, I guess these steps can be performed on any container infrastructure, not just docker.
Maybe we can add an regression test for this issuse in the CI as well.

raphCode · 2022-03-28T22:49:07Z

While trying to reproduce this locally and couldn't getting it to work. My attempt was to connect to the local sshd tunneled through socat:
socat tcp:localhost:22 tcp-listen:2222, then pausing socat (Ctrl-Z in the shell) until ssh times out.

I tried again with my server: I found that the crashes happen random in every case, so my analysis above is not correct, some cases just seem to crash more likely than others.
I even could observe some coredumps in journalctl: "trap invalid opcode", these seem to be mostly the zellij client.

It kinda reminded me of #882 since this is the only random-looking crash I am aware of. Indeed, running with only one CPU does improve the chance of crashing upon ssh disconnect.

I tried to compile my own versions to test whether my merged PR fixes the problem, but my self-compiled versions never crashed, even versions prior to the PR! I guess this might be due to debug builds which alter the execution timings slightly, not triggering the race condition.

Pinging @tlinford as well since he commented on my fix PR. Does the above make any sense / Can #882 really be the cause of this issue as well?

Tell me if I should try some release builds tomorrow to make sure my PR solves this issue.

raphCode · 2022-03-29T10:00:21Z

Wrote a small test loop to get some numbers:
Basically, connect to the server, attach to existing zellij session or start new one, exit the ssh connection and check if a zellij process is still alive.

#!/bin/bash

# assumption: no other zellij processes running beside these created by this script
zellij_cmd="zellij"
host="<insert host here>"
iterations=200
crash_count=0

for (( i = 1; i <= iterations; i++ ))
do
    { sleep 1; echo ~.; } | ssh "$host" -tt "$zellij_cmd attach || $zellij_cmd"
    # assumption: 'zellij' in program name
    ssh "$host" "pgrep [z]ellij" || ((crash_count++))
done

echo "Crashed $crash_count/$iterations times."

Results on my server:

v0.25.0 from Arch repos, both CPU cores active: 70/402 crashes
v0.25.0 from Arch repos, one CPU core active: 402/402 crashes
v0.25.0 release build with custom --data-dir, one CPU core active: 11/401
crashes
v0.25.0 debug build with custom --data-dir, one CPU core active: 0/200 crashes (increased and added sleep delays becaus the debug builds run pretty slow)

With that 100% crash rate I am certain this is #882 which in turn was fixed by #1051

I am not sure if I did the release builds correctly, I expected the same results as with the system package zellij.
The binary sizes differ too, after stripping my build is even smaller than the official release.
I ran cargo make build-release and set $zellij_cmd in the script to .../target/release/zellij --data-dir .../target/dev-data

I also failed to get a release build at 79421fb, to see if the issue is fixed by /pull/1051. But executing the binary yields this panic immediately:

Error occurred in server:

 × Thread 'wasm' panicked.
 ├─▶ Originating Thread(s)
 │     1. ipc_server: NewClient
 │     2. pty_thread: NewTab
 │     3. screen_thread: NewTab
 │     4. plugin_thread: Update
 │
 ├─▶ At zellij-server/src/wasm_vm.rs:145:42
 ╰─▶ called `Result::unwrap()` on an `Err` value: RuntimeError { source: Trap(UnreachableCodeReached), wasm_trace: [FrameInfo { module_name: "<module>", func_index: 154, function_name: None,
     func_start: SourceLoc(91129), instr: SourceLoc(94366) }, FrameInfo { module_name: "<module>", func_index: 86, function_name: None, func_start: SourceLoc(50562), instr: SourceLoc(50791) },
     FrameInfo { module_name: "<module>", func_index: 68, function_name: None, func_start: SourceLoc(42221), instr: SourceLoc(42342) }, FrameInfo { module_name: "<module>", func_index: 343,
     function_name: None, func_start: SourceLoc(209940), instr: SourceLoc(231960) }], native_trace:    0: <unknown>
        1: <unknown>
        2: <unknown>
        3: <unknown>
        4: <unknown>
        5: <unknown>
        6: <unknown>
      }
 help: If you are seeing this message, it means that something went wrong.
       Please report this error to the github issue.
       (https://github.com/zellij-org/zellij/issues)

       Also, if you want to see the backtrace, you can set the `RUST_BACKTRACE` environment variable to `1`.```

</details>

a-kenji · 2022-03-29T10:11:10Z

What happens if you manually set the data dir again, or delete you data dir? We only update plugins, we don't downgrade them.

tlinford · 2022-03-29T10:58:38Z

I also failed to get a release build at 79421fb, to see if the issue is fixed by /pull/1051. But executing the binary yields this panic immediately:

This happens because the plugins that zellij installs on run are taken from the assets dir, and those are currently not up to date. You can get around this by either manually replacing the plugins in assets with ones from ./target/e2e-data (./target/dev-data would be also be ok but they aren't built with --release so are much slower), and recompiling, or manually placing them on your server and running with --data-dir

raphCode · 2022-03-29T23:51:30Z

I got it to run, but I can't test it with the script since the client process seems to stay alive upon ssh disconnect. This means a bunch of zellij attach client processes pile up and stay running.
This is weird, I expected them to get killed upon SIGHUP or when ssh closes the pty (does this even happen?).
This happens on v0.26.1 and main branch, v0.25.0 is fine.

a-kenji · 2022-03-30T09:17:21Z

Do they get killed with zellij ka?

colemickens · 2022-03-30T09:36:48Z

I'm also not seeing crashes, but I'm also seeing attaches piling up. zellij ka doesn't affect the ones piled up or prevent more from piling up.

Also, when testing this out, I keep randomly seeing this a bunch, but it's not even necessarily when running this script to attempt to repro:

  ╰─▶ called `Result::unwrap()` on an `Err` value: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })

colemickens · 2022-03-30T09:38:23Z

When I pointed that test script at localhost, the first three "crashed" and then I got one crash with that same buffer error, and then a few started working and attaches started piling up. ^C.

raphCode · 2022-03-30T22:41:18Z

Do they get killed with zellij ka?

No, only the server process is killed.
The client processes linger around, no matter what. They even do not react to SIGHUP, SIGINT or SIGTERM.

Would identifying the commit that introduced this behavior via git bisect help?

a-kenji · 2022-04-01T06:46:14Z

Would identifying the commit that introduced this behavior via git bisect help?

I do think so!

raphCode · 2022-04-04T19:16:24Z

git bisect start
# good: [59a9ba08e4d69b0a626dd165fc3b966028bc195a] chore(release): v0.25.0
git bisect good 59a9ba08e4d69b0a626dd165fc3b966028bc195a
# bad: [9c7d13984f498c9d6c712098bd27eb1ed211dbb9] chore(release): v0.26.1
git bisect bad 9c7d13984f498c9d6c712098bd27eb1ed211dbb9
# good: [e0685f65481188ba3f632be6036895984108d298] add(nix): add binary cache `zellij` (#1157)
git bisect good e0685f65481188ba3f632be6036895984108d298
# good: [7de77536abc2e8e2f2c3a2a3ec97dff081008258] refactor(tab): simplify mouse hold and release (#1185)
git bisect good 7de77536abc2e8e2f2c3a2a3ec97dff081008258
# good: [93642b08bf7a7e20caec5f653227dc9d35839e46] chore(release): v0.26.0
git bisect good 93642b08bf7a7e20caec5f653227dc9d35839e46
# bad: [54b0859e401acc4c0f78aa55b53c79f852b07632] fix(nix): fix `makeDesktopItem` (#1215)
git bisect bad 54b0859e401acc4c0f78aa55b53c79f852b07632
# bad: [9f9c16d60b4a07d2a7537c5eca3a3533cf391f78] docs(changelog): add error reporting system
git bisect bad 9f9c16d60b4a07d2a7537c5eca3a3533cf391f78
# good: [29d1ccfdfe06fdf9f355b8c934c37b54cd579aeb] fix: `.envrc`
git bisect good 29d1ccfdfe06fdf9f355b8c934c37b54cd579aeb
# bad: [0b74604a9f069086aa491f46d0f4a3f34520d153] feat: improve error reporting system (#1038)
git bisect bad 0b74604a9f069086aa491f46d0f4a3f34520d153
# first bad commit: [0b74604a9f069086aa491f46d0f4a3f34520d153] feat: improve error reporting system (#1038)

first bad commit: 0b74604
A logging change causes the app to hang upon exit? This is crazy :D

My first bisect produced nonsense results, then I tried again but deleted the data-dir directory prior to each run. Maybe this is a hint?

a-kenji · 2022-04-04T19:25:17Z

My first bisect produced nonsense results, then I tried again but deleted the data-dir directory prior to each run. Maybe this is a hint?

This should happen if you go in reverse. If you go from low release number to high, it can happen on non release commits. If you build with cargo make build and clear the data-dir, then it should work every time.

raphCode · 2022-04-04T19:28:49Z

Yes, I did
killall -9 zellij; rm -r target/dev-data/; cargo make build build-dev-data-dir
then I connected through ssh, killed the ssh session and watched for zombie client processes.
zombies found -> git bisect bad

a-kenji · 2022-04-04T20:18:47Z

Awesome, thanks for the bisect! This gives us a good start.

roland-5 · 2022-04-13T22:19:56Z

I tried now 0.28.1 version and in remote machine created normal zellij session, put htop and detached session, log out from this machine and ssh again and session was still there with htop in it. In 0.27 and previously versions it was big problem for me (especially I never remembered Delamare2112 solution :P).

raphCode · 2022-04-14T10:30:50Z

Great to hear it works now!

In 0.27 and previously versions it was big problem for me

Actually, the fix was already merged into 0.27.0, are you sure you experience the problem with that version?

The issue wtih client processes lingering around after a ssh disconnect still persists, should we keep it in this issue or open a new one?

roland-5 · 2022-04-14T11:15:46Z

Great to hear it works now!

In 0.27 and previously versions it was big problem for me

Actually, the fix was already merged into 0.27.0, are you sure you experience the problem with that version?

I won't give my finger for whether it was version 0.27 or an older version though. :P

The issue wtih client processes lingering around after a ssh disconnect still persists, should we keep it in this issue or open a new one?

I think it should be new issue for this and close this.

a-kenji · 2022-04-14T11:18:07Z

Thanks for reaching out!

palto42 · 2023-01-16T19:22:23Z

... It could be that the server you are connecting to is setup to stop any processes owned by a user when that user is no longer signed in on the system. If so, I think the server can be configured differently (I don't remember how) ...

@Delamare2112 Thanks for this hint, I found that it is related to a change of systemd v230. Since this version the default setting in /etc/systemd/logind.conf of KillUserProcesses= changed from "no" to "yes", causing all user processes to be killed after ssh disconnect. See Systemd defaults KillUserProcesses to ‘yes’ in logind.conf with v230

Changing this to KillUserProcesses=no fixed the problem for me.

raphCode mentioned this issue Mar 3, 2022

Tracking issue for server stability (old 1) #1100

Closed

18 tasks

a-kenji added suspected bug stability Issues in relation to stability labels Mar 4, 2022

raphCode mentioned this issue Apr 4, 2022

Zombie Process on Tab/Pane Close #1286

Closed

a-kenji closed this as completed Apr 14, 2022

raphCode mentioned this issue Apr 14, 2022

Client process lingers around after ssh disconnect #1326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zellij Automatically Exits when SSH is disconnected #1029

Zellij Automatically Exits when SSH is disconnected #1029

buchenglei commented Jan 27, 2022

Delamare2112 commented Feb 3, 2022

raphCode commented Mar 3, 2022 •

edited

Loading

imsnif commented Mar 25, 2022

raphCode commented Mar 28, 2022

raphCode commented Mar 28, 2022

raphCode commented Mar 29, 2022

a-kenji commented Mar 29, 2022

tlinford commented Mar 29, 2022

raphCode commented Mar 29, 2022

a-kenji commented Mar 30, 2022

colemickens commented Mar 30, 2022

colemickens commented Mar 30, 2022

raphCode commented Mar 30, 2022

a-kenji commented Apr 1, 2022

raphCode commented Apr 4, 2022 •

edited

Loading

a-kenji commented Apr 4, 2022

raphCode commented Apr 4, 2022

a-kenji commented Apr 4, 2022

roland-5 commented Apr 13, 2022 •

edited

Loading

raphCode commented Apr 14, 2022

roland-5 commented Apr 14, 2022

a-kenji commented Apr 14, 2022

palto42 commented Jan 16, 2023 •

edited

Loading

Zellij Automatically Exits when SSH is disconnected #1029

Zellij Automatically Exits when SSH is disconnected #1029

Comments

buchenglei commented Jan 27, 2022

Delamare2112 commented Feb 3, 2022

raphCode commented Mar 3, 2022 • edited Loading

imsnif commented Mar 25, 2022

raphCode commented Mar 28, 2022

raphCode commented Mar 28, 2022

raphCode commented Mar 29, 2022

a-kenji commented Mar 29, 2022

tlinford commented Mar 29, 2022

raphCode commented Mar 29, 2022

a-kenji commented Mar 30, 2022

colemickens commented Mar 30, 2022

colemickens commented Mar 30, 2022

raphCode commented Mar 30, 2022

a-kenji commented Apr 1, 2022

raphCode commented Apr 4, 2022 • edited Loading

a-kenji commented Apr 4, 2022

raphCode commented Apr 4, 2022

a-kenji commented Apr 4, 2022

roland-5 commented Apr 13, 2022 • edited Loading

raphCode commented Apr 14, 2022

roland-5 commented Apr 14, 2022

a-kenji commented Apr 14, 2022

palto42 commented Jan 16, 2023 • edited Loading

raphCode commented Mar 3, 2022 •

edited

Loading

raphCode commented Apr 4, 2022 •

edited

Loading

roland-5 commented Apr 13, 2022 •

edited

Loading

palto42 commented Jan 16, 2023 •

edited

Loading