Skip to content

Chat & Shell crash with SIGKILL on first user message — "Starting async generator loop" then process dies (Linux, all versions 1.28.0 → 1.30.0) #707

@fg59-flo

Description

@fg59-flo

Summary

On a clean Linux deployment (Ubuntu 24.04 LTS, Node 22.22.2), the CloudCLI Node process is killed with SIGKILL ~1 second after the user sends any Chat message or starts a Shell PTY session. The UI hangs (no response visible) and systemd respawns the service in a loop.

This appears to be the same root issue as #496 ("Claude Code process exited with code 1, but cli ok") and #486 (closed). Reproduced on every released version from 1.28.0 (oldest available on npm) through 1.30.0 (latest).

Environment

Item Value
OS Ubuntu 24.04.3 LTS
Node v22.22.2 (via nvm)
@cloudcli-ai/cloudcli versions tested 1.28.0, 1.29.5, 1.30.0 (all KO)
@anthropic-ai/claude-agent-sdk (embedded) 0.2.119 (same in all 3 versions)
claude binary 2.1.119 (Claude Code, in $PATH)
Auth OAuth Claude Max ×20 plan via ~/.claude/.credentials.json (not API key)
systemd unit cloudcli@flo.service (Type=simple, User=flo, Restart=always)

Reproduction

  1. Start cloudcli (any version 1.28.0+) on Linux with valid Claude OAuth credentials in ~/.claude/.credentials.json.
  2. Open the Web UI, log in, select a project, click New Session in Chat mode.
  3. Send any message (even a single word like ping).
  4. Observe: no response appears, the UI hangs.
  5. Server-side log:
[DEBUG] User message: ping
📁 Project: /home/flo
🔄 Session: New
Starting async generator loop for session: NEW
                                                  ← ~1 second later ↓
systemd[1]: cloudcli@flo.service: Main process exited, code=killed, status=9/KILL
systemd[1]: cloudcli@flo.service: Failed with result 'signal'.

The same SIGKILL happens with Shell mode as soon as the user confirms "Yes, I trust this folder" (after ~µs of valid Claude UI rendering). Service then restarts; UI reconnects to the new instance and shows the same trust-folder prompt → looks like a "loop" to the user.

What is NOT the cause (excluded by bisection)

I went through every plausible angle. None explains the crash:

  • claude binaryclaude -p "ping" in SSH returns pong, exit 0, in ~1.5s.
  • @anthropic-ai/claude-agent-sdk — running it standalone outside cloudcli works perfectly:
    // Standalone test (replaces cloudcli's queryClaudeSDK)
    import { query } from "@anthropic-ai/claude-agent-sdk";
    for await (const m of query({ prompt: "ping", options: { model: "opus" } })) {
      if (m.type === "result") console.log("RESULT:", m.result);
    }
    // → RESULT: pong  (in ~6.9s, exit 0, hooks fire correctly, no crash)
  • claude SessionStart hooks (claude-mem in our case) — disabled them, crash identical.
  • systemd / cgroup / OOM:
    • MemoryPeak = 82 MB
    • MemoryMax = infinity
    • WatchdogUSec = 0
    • kernel dmesg has no OOM event
    • coredumpctl list is empty
    • No entries in journalctl -u systemd-oomd
  • systemd itself — running cloudcli in foreground via nohup env PORT=3001 cloudcli & (no systemd) → process dies silently within 1s of "Starting async generator loop", same way. So the SIGKILL is reported by systemd but originates inside the Node process (or its native deps).
  • React StrictMode + double /shell WS — yes, 2 simultaneous shell WS are observed (StrictMode bit ON), but handleShellConnection in server/index.js shares a single PTY across WS via ptySessionsMap, so this is a non-issue causally.
  • CloudCLI version regression — bug exists in every release tested (1.28.0, 1.29.5, 1.30.0). Embedded SDK is the same 0.2.119 in all of them.

What might be the cause

I can't conclusively isolate it without instrumenting the SDK or running under a debugger, but the strongest remaining hypothesis is:

The way queryClaudeSDK() (in server/claude-sdk.js) invokes the SDK from inside a WebSocket 'message' handler triggers a fatal abort in a native dependency (likely related to stdio piping, signals, or process forking) that does not happen when the SDK is invoked from a plain Node script.

Since the standalone SDK call works and the cloudcli call dies silently with no stack trace, the death is most likely an abort() from a native binding (e.g. node-pty, better-sqlite3, or something the SDK loads transitively). A minimal repro that wraps the same query() call inside a WebSocketServer 'message' handler would help.

Workaround in use

Until this is fixed, we use claude directly via SSH/RustDesk on the VM. The web UI is unusable for both Chat and Shell.

Logs

1. Crash signature — Chat mode (cloudcli 1.30.0, repeats identically on 1.29.5 and 1.28.0)
Apr 26 16:35:33 claude-flo bash[1039599]: [DEBUG] User message: Réponds juste ping
Apr 26 16:35:33 claude-flo bash[1039599]: 📁 Project: /home/flo
Apr 26 16:35:33 claude-flo bash[1039599]: 🔄 Session: New
Apr 26 16:35:33 claude-flo bash[1039599]: Starting async generator loop for session: NEW
Apr 26 16:35:34 claude-flo systemd[1]: cloudcli@flo.service: Main process exited, code=killed, status=9/KILL
Apr 26 16:35:35 claude-flo systemd[1]: cloudcli@flo.service: Failed with result 'signal'.
Apr 26 16:35:35 claude-flo systemd[1]: cloudcli@flo.service: Consumed 3.396s CPU time.
Apr 26 16:35:40 claude-flo systemd[1]: cloudcli@flo.service: Scheduled restart job, restart counter is at 1.
2. Crash signature — Shell mode (after "Yes I trust this folder")
Apr 26 15:53:35 bash[400358]: [INFO] Using Claude Agents SDK for Claude integration
Apr 26 15:53:35 bash[400358]: 📨 Shell message received: init
Apr 26 15:53:35 bash[400358]: [INFO] Starting shell in: /home/flo
Apr 26 15:53:35 bash[400358]: 🔧 Executing shell command: claude
Apr 26 15:53:35 bash[400358]: 📐 Using terminal dimensions: 149 x 43
Apr 26 15:53:35 bash[400358]: 🟢 Shell process started with PTY, PID: …
Apr 26 15:54:00 bash[400358]: 📨 Shell message received: input    ← user confirms Yes
Apr 26 15:54:00 bash[400358]: 📨 Shell message received: input
Apr 26 15:54:01 bash[400358]: 🔚 Shell process exited with code: 1 signal: 0
Apr 26 15:54:15 systemd[1]: cloudcli@flo.service: Main process exited, code=killed, status=9/KILL
3. SDK works perfectly in standalone (same Node, same SDK, same claude binary, same project dir)
# Node script using the SDK directly (skips cloudcli's WebSocket layer)
$ cat > /tmp/test-sdk.mjs << 'EOF'
import { query } from "/path/to/@anthropic-ai/claude-agent-sdk/sdk.mjs";
console.log("Calling query...");
for await (const m of query({ prompt: "ping", options: { model: "opus" } })) {
  if (m.type === "result") console.log("RESULT:", m.result, "duration:", m.duration_ms, "ms");
}
console.log("DONE");
EOF
$ node /tmp/test-sdk.mjs
Calling query...
MSG: {"type":"system","subtype":"hook_started","hook_name":"SessionStart:startup", ...}
MSG: {"type":"system","subtype":"init","cwd":"/home/flo","session_id":"...","tools":[...]}
MSG: {"type":"assistant","message":{"model":"claude-opus-4-7", ...,"content":[{"type":"text","text":"pong"}], ...}}
MSG: {"type":"result","subtype":"success","is_error":false,"duration_ms":6886, ...,"result":"pong","session_id":"..."}
DONE
$ echo $?
0
4. systemd service unit (irrelevant since crash also happens in foreground)
[Service]
Type=simple
User=flo
Environment=NODE_ENV=production
Environment=PORT=3001
Environment=HOST=127.0.0.1
ExecStart=/bin/bash -c 'export NVM_DIR=/home/%i/.nvm && . $NVM_DIR/nvm.sh && exec cloudcli'
Restart=always
RestartSec=5

Confirmed crash is independent of systemd: launching via nohup env PORT=3001 cloudcli > /tmp/cloudcli-fg.log 2>&1 & then sending a Chat message reproduces the same silent process death within ~1s of "Starting async generator loop". The PID disappears, no stack trace, no coredump.

5. Memory / OOM is NOT the cause
$ systemctl show cloudcli@flo | grep -E "Memory|Watchdog"
MemoryCurrent=50593792
MemoryPeak=82640896           ← 82 MB peak, way under any limit
MemoryMax=infinity
MemoryHigh=infinity
WatchdogUSec=0                ← no watchdog
ManagedOOMMemoryPressure=auto

$ sudo dmesg -T | grep -iE "oom|kill"
(empty)

$ sudo journalctl -u systemd-oomd --since "1 hour ago"
-- No entries --

$ sudo coredumpctl list --since "2h ago"
(empty)

Happy to provide more logs, diff the SDK invocation paths, or run additional diagnostics. This blocks the entire web UI for our team — we currently fall back to running claude directly via SSH/RustDesk.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions