Skip to content

fix(tmux): don't mark live sessions dead on transient tmux stalls#1216

Open
Kevsosmooth wants to merge 2 commits into
asheshgoplani:mainfrom
Kevsosmooth:fix/tmux-has-session-false-positive
Open

fix(tmux): don't mark live sessions dead on transient tmux stalls#1216
Kevsosmooth wants to merge 2 commits into
asheshgoplani:mainfrom
Kevsosmooth:fix/tmux-has-session-false-positive

Conversation

@Kevsosmooth
Copy link
Copy Markdown

@Kevsosmooth Kevsosmooth commented May 28, 2026

Summary

Under brief tmux-server contention — e.g. when one session is torn down by an external process while others are active — agent-deck can misread the transient as "session gone" and flip multiple live sessions to error (red ✕) at the same instant, requiring a manual restart (Shift+R). The agent processes are still alive; agent-deck loses contact and mislabels them.

Root cause: two existence-check paths treat any tmux has-session failure as definitive absence, with no distinction between "the server answered: no such session" and "the server was briefly unreachable/slow."

Changes

1. Session.Exists() (internal/tmux/tmux.go) ran tmux has-session with no timeout and returned cmd.Run() == nil. A hung/slow probe could block a status poll, and a transient failure was treated as a dead session. The probe is now bounded by a timeout, and a timed-out probe is treated as indeterminate (assume the session still exists; a later poll resolves it). A probe that completes with a non-success status is unchanged — so the isolated-socket behavior from #755 is preserved (existing tests still pass).

2. Pipe reconnect loop (internal/tmux/pipemanager.go) deleted a session's control pipe on the first failed has-session probe, ignoring its own 5-attempt retry/backoff budget. One transiently-failing probe therefore permanently dropped a live session's pipe — and since a single torn-down session contends the server, this cascaded to neighboring sessions. The give-up decision is now factored into shouldConcludeSessionGone(...): early misses are retried, and a session is only concluded gone if still absent on the final attempt.

Both changes have new regression tests.

Not included (follow-up)

A third path is deliberately left out because fixing it correctly is riskier and deserves its own review: RefreshSessionCache can overwrite the shared session cache with a transient partial tmux list-windows -a result, evicting live sessions. Telling "the server returned a short list under load" apart from "the user actually closed sessions" needs more care. Happy to do this as a separate PR.

Test plan

  • go test ./internal/tmux/ -run 'TestSession_Exists' — new timeout test + existing tmux_socket_name honored on spawn but ignored by status queries #755 guards green
  • go test ./internal/tmux/ -run TestShouldConcludeSessionGone_RetriesEarlyMisses — green
  • go build ./... and go vet ./internal/tmux/ clean
  • Full internal/tmux suite green (one unrelated, pre-existing, environment-specific failure — TestStart_Service_SuccessPath — also fails on main in my environment)

Summary by CodeRabbit

  • Bug Fixes

    • Avoids incorrectly marking active sessions as absent when session-probe commands stall or time out.
    • Adds bounded probe timeouts so slow responses are treated as indeterminate, not definitive absence.
    • Improves reconnection retry logic to handle transient probe misses without premature session deletion.
  • Tests

    • Added tests validating probe-timeout handling and reconnection retry behavior.

Review Change Stack

When the tmux server is briefly contended (e.g. another session is being
torn down), agent-deck could misread the transient as "session gone" and
flip live sessions to StatusError (red X) all at once, needing a manual
restart.

Two contributing paths, both fixed:
- Session.Exists() ran `tmux has-session` with no timeout and treated any
  non-success as absent. Bound the probe; treat a timeout as indeterminate
  (assume the session still exists, a later poll resolves it). A probe that
  completes with a non-success status is unchanged, preserving the asheshgoplani#755
  isolated-socket behavior.
- The pipe reconnect loop deleted a pipe on the FIRST failed has-session
  probe, ignoring its own 5-retry/backoff budget. Retry early misses and
  only conclude the session is gone if still absent on the final attempt.

Adds regression tests for both. Follow-up tracked separately: the session
cache can still be overwritten by a transient partial `list-windows`
result.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4ba14856-fc59-47d6-9b5d-9be0f2f7ea81

📥 Commits

Reviewing files that changed from the base of the PR and between 43c1190 and cfc2eba.

📒 Files selected for processing (2)
  • internal/tmux/pipemanager.go
  • internal/tmux/pipemanager_reconnect_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/tmux/pipemanager.go

📝 Walkthrough

Walkthrough

This PR adds a timeout mechanism to tmux session existence probes and implements retry logic to handle transient probe failures during reconnection. A timeout of 2 seconds bounds the has-session subprocess, treating timeouts as indeterminate rather than session absence. The reconnection loop now distinguishes between transient probe failures and permanent session disappearance based on retry attempt count, allowing retries with backoff before concluding the session is gone.

Changes

Session Existence Probe Timeout and Reconnection Retry

Layer / File(s) Summary
Session existence probe with timeout handling
internal/tmux/tmux.go, internal/tmux/pipemanager.go, internal/tmux/exists_timeout_test.go, internal/tmux/pipemanager_reconnect_test.go
Package-level hasSessionProbeTimeout (default 2s) bounds the tmux has-session subprocess. Session.Exists() and tmuxSessionExistsOnSocket now run the probe via a context with this timeout; if the deadline is exceeded the probe is treated as indeterminate (returns true) instead of false. Regression tests inject a hanging tmux to validate timeout behavior.
Reconnection retry logic for transient probe failures
internal/tmux/pipemanager.go, internal/tmux/pipemanager_reconnect_test.go
Adds unexported shouldConcludeSessionGone helper to decide when a probe miss should be treated as permanent. watchPipe now probes existence, uses the helper to gate the "session gone" delete path, and on early misses logs a retry, increases/clamps backoff, and continues retrying. Unit test verifies retry vs final-miss behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 7
✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title follows Conventional Commits format with 'fix:' prefix and clearly describes the main change: preventing live sessions from being marked dead due to transient tmux server stalls.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 70.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Remote_parity ✅ Passed PR changes only internal/tmux/ (Session.Exists timeout and pipe reconnect logic). No changes to TUI in internal/ui/ or cmd/agent-deck/, so remote_parity requirement doesn't apply.
Test_coverage_per_surface ✅ Passed This PR is a bug fix to internal tmux session existence checking, not a new feature. The check requires tests for "new features" across all surfaces (TUI/Web/CLI/remote), which is not applicable here.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/tmux/pipemanager.go`:
- Around line 419-420: The reconnect path in PipeManager.watchPipe() currently
calls tmuxSessionExistsOnSocket(sessionName) which blocks on tmux has-session
without a timeout; change it to use the same timeout-bounded probe as
Session.Exists() by creating a context with hasSessionProbeTimeout (via
context.WithTimeout) and calling tmuxExecContext (or a context-aware wrapper) so
the has-session probe is bounded; ensure you treat a context deadline/timeout as
an indeterminate result (same semantics as Session.Exists()) before passing the
value to shouldConcludeSessionGone(), keeping the callsite in
PipeManager.watchPipe() and the probe logic in
tmuxSessionExistsOnSocket()/tmuxExecContext consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 30023a8c-58ef-42df-b783-4adc5a71deeb

📥 Commits

Reviewing files that changed from the base of the PR and between b098f0f and 43c1190.

📒 Files selected for processing (4)
  • internal/tmux/exists_timeout_test.go
  • internal/tmux/pipemanager.go
  • internal/tmux/pipemanager_reconnect_test.go
  • internal/tmux/tmux.go

Comment thread internal/tmux/pipemanager.go
@asheshgoplani
Copy link
Copy Markdown
Owner

Thanks @Kevsosmooth — this targets a real recurring pain (transient tmux contention flipping live sessions to error ✕ even though the agents are alive). Verified locally: internal/tmux + internal/session tests pass, golangci clean, and the reconnect/timeout regression tests are a good guard. As a first-time-contributor PR it needs a maintainer approve-workflows click before CI runs, then it can merge. Appreciated — this pairs well with the notifier-reliability work we just shipped (v1.9.43).

watchPipe()'s reconnect loop probed existence via tmuxSessionExistsOnSocket(),
which ran `tmux has-session` with no timeout. A stalled tmux server would block
the reconnect goroutine before the retry/backoff and shouldConcludeSessionGone()
logic could run, leaving a live session's pipe unreconnected.

Bound it with hasSessionProbeTimeout and treat a timeout as indeterminate
(assume the session still exists), mirroring Session.Exists(). Adds a regression
test. Addresses review feedback on asheshgoplani#1216.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@Kevsosmooth
Copy link
Copy Markdown
Author

Hey @asheshgoplani — thanks for the review! I'm new to contributing here so bear with me — whenever you get a sec, mind kicking off the workflow run? The checks are stuck on the first-time-contributor approval, so CI hasn't run yet. Just want to see if it all passes and works out. No rush, happy to tweak anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants