feat(sshx): persistent, self-healing pooled SSH connections for the fleet#16
Merged
Conversation
…leet
kay fleet dialed a fresh TCP+SSH connection to every host on every tick and
closed it — N full KEX + public-key-auth handshakes every 5s. Reuse the
transport instead: one long-lived connection per host, a new session per refresh.
New internal/sshx types (stdlib + x/crypto only, so the package stays
extractable):
- Managed: one self-healing connection per host — exponential backoff with
full jitter, deadline-guarded keepalive probe (golang/go#21478), returns
ErrNotReady instantly instead of blocking while down.
- Pool: a shared dial-concurrency cap (pdsh-style fanout) so a large fleet's
cold start can't trip sshd MaxStartups or exhaust local sockets.
fleet.Session owns one pool for the whole interactive session; drilling into a
host reuses its live connection (Selection.Client is metrics.Runner) with no
second handshake and inherits the pool's self-healing.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Test coverage: 75.0%per package |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
kay fleetdialed a fresh TCP+SSH connection to every host on every tick(default 5s) and closed it. For N hosts that is N full key-exchange +
public-key-auth handshakes every 5s — wasting CPU (curve25519 + signature
verify), network round-trips, and, most visibly on the server, connection churn
that floods
auth.logand pressures sshdMaxStartups(which dropsconnections past its limit). The single-host dashboard already reused one
connection; the fleet now does too.
What changed
internal/sshx/pool.go—Pool+Managed, stdlib +x/cryptoonly:Managed: one self-healing connection per host — exponential backoff withfull jitter, a deadline-guarded keepalive probe (guards x/crypto/ssh: no simple way to implement efficient keep alives golang/go#21478),
and
RunreturnsErrNotReadyinstantly instead of blocking while down.Pool: a shared dial-concurrency cap (16) so a fleet's cold start can'tself-trip
MaxStartupsor exhaust sockets.internal/sshx/client.go— deadline-guardedPing(); keepalive uses it.internal/fleet—Sessionowns one pool across drill-ins; collects onlyready hosts; a host is enterable the instant it connects.
cmd/kay— drill-in reuses the pooled connection (no secondhandshake) and inherits its self-healing (
Selection.Clientismetrics.Runner).Benchmarks & testing
internal/sshx/pool_bench_test.go— pool dispatchoverhead: ~24 ns/op, 0 allocs (the abstraction is free; the cost is SSH
I/O). Design + the full library-comparison / load / stress methodology live in
the Camelot technical-design vault (
[4]ssh-connection-pool.md).make cigreen — gofmt, vet,-race, golangci-lint 0 issues, gosec Issues:0,govulncheck clean.
internal/sshxcoverage 82%. Black-box tests cover theexported pool API; white-box tests drive the self-healing paths via an injected
fake connection.
Security
No safe-by-default invariant changes: public-key auth only, TOFU host-key
pinning,
0600files. Backoff+jitter and the dial cap are defensive (noreconnect storm, no self-inflicted
MaxStartups).🤖 Generated with Claude Code