Skip to content

feat(sshx): persistent, self-healing pooled SSH connections for the fleet#16

Merged
mrdhira merged 1 commit into
mainfrom
feat/ssh-connection-pool
Jul 1, 2026
Merged

feat(sshx): persistent, self-healing pooled SSH connections for the fleet#16
mrdhira merged 1 commit into
mainfrom
feat/ssh-connection-pool

Conversation

@mrdhira

@mrdhira mrdhira commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Why

kay fleet dialed a fresh TCP+SSH connection to every host on every tick
(default 5s) and closed it. For N hosts that is N full key-exchange +
public-key-auth handshakes every 5s — wasting CPU (curve25519 + signature
verify), network round-trips, and, most visibly on the server, connection churn
that floods auth.log and pressures sshd MaxStartups (which drops
connections past its limit). The single-host dashboard already reused one
connection; the fleet now does too.

What changed

  • internal/sshx/pool.goPool + Managed, stdlib + x/crypto only:
    • Managed: one self-healing connection per host — exponential backoff with
      full jitter, a deadline-guarded keepalive probe (guards x/crypto/ssh: no simple way to implement efficient keep alives golang/go#21478),
      and Run returns ErrNotReady instantly instead of blocking while down.
    • Pool: a shared dial-concurrency cap (16) so a fleet's cold start can't
      self-trip MaxStartups or exhaust sockets.
  • internal/sshx/client.go — deadline-guarded Ping(); keepalive uses it.
  • internal/fleetSession owns one pool across drill-ins; collects only
    ready hosts; a host is enterable the instant it connects.
  • cmd/kay — drill-in reuses the pooled connection (no second
    handshake) and inherits its self-healing (Selection.Client is
    metrics.Runner).

Benchmarks & testing

  • In-repo micro-benchmark internal/sshx/pool_bench_test.go — pool dispatch
    overhead: ~24 ns/op, 0 allocs (the abstraction is free; the cost is SSH
    I/O). Design + the full library-comparison / load / stress methodology live in
    the Camelot technical-design vault ([4]ssh-connection-pool.md).
  • make ci green — gofmt, vet, -race, golangci-lint 0 issues, gosec Issues:0,
    govulncheck clean. internal/sshx coverage 82%. Black-box tests cover the
    exported pool API; white-box tests drive the self-healing paths via an injected
    fake connection.

Security

No safe-by-default invariant changes: public-key auth only, TOFU host-key
pinning, 0600 files. Backoff+jitter and the dial cap are defensive (no
reconnect storm, no self-inflicted MaxStartups).

🤖 Generated with Claude Code

…leet

kay fleet dialed a fresh TCP+SSH connection to every host on every tick and
closed it — N full KEX + public-key-auth handshakes every 5s. Reuse the
transport instead: one long-lived connection per host, a new session per refresh.

New internal/sshx types (stdlib + x/crypto only, so the package stays
extractable):
  - Managed: one self-healing connection per host — exponential backoff with
    full jitter, deadline-guarded keepalive probe (golang/go#21478), returns
    ErrNotReady instantly instead of blocking while down.
  - Pool: a shared dial-concurrency cap (pdsh-style fanout) so a large fleet's
    cold start can't trip sshd MaxStartups or exhaust local sockets.

fleet.Session owns one pool for the whole interactive session; drilling into a
host reuses its live connection (Selection.Client is metrics.Runner) with no
second handshake and inherits the pool's self-healing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Test coverage: 75.0%

per package
  cmd/kay              40.8%
  internal/config      83.9%
  internal/dashboard   81.1%
  internal/fleet       67.7%
  internal/keys        77.4%
  internal/metrics     89.9%
  internal/sshx        81.6%
  internal/tui         84.8%

@mrdhira mrdhira self-assigned this Jul 1, 2026
@mrdhira mrdhira merged commit 45b5d1d into main Jul 1, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant