fix(tmux): don't mark live sessions dead on transient tmux stalls#1216
fix(tmux): don't mark live sessions dead on transient tmux stalls#1216Kevsosmooth wants to merge 2 commits into
Conversation
When the tmux server is briefly contended (e.g. another session is being torn down), agent-deck could misread the transient as "session gone" and flip live sessions to StatusError (red X) all at once, needing a manual restart. Two contributing paths, both fixed: - Session.Exists() ran `tmux has-session` with no timeout and treated any non-success as absent. Bound the probe; treat a timeout as indeterminate (assume the session still exists, a later poll resolves it). A probe that completes with a non-success status is unchanged, preserving the asheshgoplani#755 isolated-socket behavior. - The pipe reconnect loop deleted a pipe on the FIRST failed has-session probe, ignoring its own 5-retry/backoff budget. Retry early misses and only conclude the session is gone if still absent on the final attempt. Adds regression tests for both. Follow-up tracked separately: the session cache can still be overwritten by a transient partial `list-windows` result.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds a timeout mechanism to tmux session existence probes and implements retry logic to handle transient probe failures during reconnection. A timeout of 2 seconds bounds the ChangesSession Existence Probe Timeout and Reconnection Retry
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 7✅ Passed checks (7 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/tmux/pipemanager.go`:
- Around line 419-420: The reconnect path in PipeManager.watchPipe() currently
calls tmuxSessionExistsOnSocket(sessionName) which blocks on tmux has-session
without a timeout; change it to use the same timeout-bounded probe as
Session.Exists() by creating a context with hasSessionProbeTimeout (via
context.WithTimeout) and calling tmuxExecContext (or a context-aware wrapper) so
the has-session probe is bounded; ensure you treat a context deadline/timeout as
an indeterminate result (same semantics as Session.Exists()) before passing the
value to shouldConcludeSessionGone(), keeping the callsite in
PipeManager.watchPipe() and the probe logic in
tmuxSessionExistsOnSocket()/tmuxExecContext consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 30023a8c-58ef-42df-b783-4adc5a71deeb
📒 Files selected for processing (4)
internal/tmux/exists_timeout_test.gointernal/tmux/pipemanager.gointernal/tmux/pipemanager_reconnect_test.gointernal/tmux/tmux.go
|
Thanks @Kevsosmooth — this targets a real recurring pain (transient tmux contention flipping live sessions to error ✕ even though the agents are alive). Verified locally: internal/tmux + internal/session tests pass, golangci clean, and the reconnect/timeout regression tests are a good guard. As a first-time-contributor PR it needs a maintainer approve-workflows click before CI runs, then it can merge. Appreciated — this pairs well with the notifier-reliability work we just shipped (v1.9.43). |
watchPipe()'s reconnect loop probed existence via tmuxSessionExistsOnSocket(), which ran `tmux has-session` with no timeout. A stalled tmux server would block the reconnect goroutine before the retry/backoff and shouldConcludeSessionGone() logic could run, leaving a live session's pipe unreconnected. Bound it with hasSessionProbeTimeout and treat a timeout as indeterminate (assume the session still exists), mirroring Session.Exists(). Adds a regression test. Addresses review feedback on asheshgoplani#1216.
|
Actionable comments posted: 0 |
|
Hey @asheshgoplani — thanks for the review! I'm new to contributing here so bear with me — whenever you get a sec, mind kicking off the workflow run? The checks are stuck on the first-time-contributor approval, so CI hasn't run yet. Just want to see if it all passes and works out. No rush, happy to tweak anything. |
Summary
Under brief tmux-server contention — e.g. when one session is torn down by an external process while others are active — agent-deck can misread the transient as "session gone" and flip multiple live sessions to
error(red ✕) at the same instant, requiring a manual restart (Shift+R). The agent processes are still alive; agent-deck loses contact and mislabels them.Root cause: two existence-check paths treat any
tmux has-sessionfailure as definitive absence, with no distinction between "the server answered: no such session" and "the server was briefly unreachable/slow."Changes
1.
Session.Exists()(internal/tmux/tmux.go) rantmux has-sessionwith no timeout and returnedcmd.Run() == nil. A hung/slow probe could block a status poll, and a transient failure was treated as a dead session. The probe is now bounded by a timeout, and a timed-out probe is treated as indeterminate (assume the session still exists; a later poll resolves it). A probe that completes with a non-success status is unchanged — so the isolated-socket behavior from #755 is preserved (existing tests still pass).2. Pipe reconnect loop (
internal/tmux/pipemanager.go) deleted a session's control pipe on the first failedhas-sessionprobe, ignoring its own 5-attempt retry/backoff budget. One transiently-failing probe therefore permanently dropped a live session's pipe — and since a single torn-down session contends the server, this cascaded to neighboring sessions. The give-up decision is now factored intoshouldConcludeSessionGone(...): early misses are retried, and a session is only concluded gone if still absent on the final attempt.Both changes have new regression tests.
Not included (follow-up)
A third path is deliberately left out because fixing it correctly is riskier and deserves its own review:
RefreshSessionCachecan overwrite the shared session cache with a transient partialtmux list-windows -aresult, evicting live sessions. Telling "the server returned a short list under load" apart from "the user actually closed sessions" needs more care. Happy to do this as a separate PR.Test plan
go test ./internal/tmux/ -run 'TestSession_Exists'— new timeout test + existingtmux_socket_namehonored on spawn but ignored by status queries #755 guards greengo test ./internal/tmux/ -run TestShouldConcludeSessionGone_RetriesEarlyMisses— greengo build ./...andgo vet ./internal/tmux/cleaninternal/tmuxsuite green (one unrelated, pre-existing, environment-specific failure —TestStart_Service_SuccessPath— also fails onmainin my environment)Summary by CodeRabbit
Bug Fixes
Tests