Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ internal/
├── fleet multi-host fleet overview (kay fleet) [app]
├── keys key generation + PEM I/O [app]
├── metrics remote metric collection + parsing [library]
├── sshx the single SSH client path (dial/run/shell) [library]
├── sshx SSH client path + self-healing connection pool [library]
└── tui minimal terminal UI toolkit [library]
```

Expand Down
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,18 @@ adheres to [Semantic Versioning](https://semver.org/).

### Changed

- **Persistent fleet connections** — `kay fleet` now keeps one long-lived,
self-healing SSH connection per host and reuses it for every refresh, instead
of dialing a brand-new connection each tick. Reusing the transport skips the
KEX + public-key-auth handshake on all but the first connect, which cuts CPU,
network round-trips, and — most visibly on the server — the connection churn
that spams `auth.log` and pressures sshd's `MaxStartups`. A shared dial cap
bounds concurrent connects so a large fleet's cold start can't self-throttle,
and reconnects use exponential backoff with jitter. Drilling into a host now
**reuses** the connection the fleet already established (no second handshake),
and the drilled-in dashboard inherits the connection's self-healing. New
`internal/sshx` types `Pool` and `Managed` implement this, stdlib-only.

- **Responsive startup** — the dashboard's first metric collection now runs
asynchronously behind a "connecting…" screen instead of blocking, and both the
dashboard and `kay fleet` ignore input (except quit) until the first data
Expand Down
36 changes: 33 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,14 +156,43 @@ State lives in `<user-config-dir>/kay/` (`config.json`, `known_hosts`, and a

### Fleet

`kay fleet` dials every registered server concurrently and renders one live row
per host — alias, reachability, CPU, memory, load, and Docker container counts
so you can scan the whole realm at a glance. Press **Enter** on a host to drill
`kay fleet` connects to every registered server concurrently and renders one live
row per host — alias, reachability, CPU, memory, load, and Docker container counts
so you can scan the whole realm at a glance. Press **Enter** on a host to drill
straight into its full dashboard, and **Esc**/**q** to return to the overview —
the terminal is handed over seamlessly (one screen, one input reader, no flicker),
and **Ctrl-C** exits the whole app. It shares the same refresh controls as the
dashboard (`r`, `+/-`, `q`) and honours `--anonymize`.

#### Persistent, self-healing connections

The fleet keeps **one long-lived SSH connection per host** and reuses it for every
refresh, rather than reconnecting each tick. This matters at scale:

- **Cheap refreshes.** An SSH connection multiplexes many sessions. After the
first connect pays the key-exchange + public-key-auth handshake, each refresh is
just a new session over the existing transport — no re-handshake, negligible CPU
and network. The single-host `kay dashboard` has always worked this way; the
fleet now does too.
- **Kinder to your servers.** Reconnecting every few seconds runs a full
auth/PAM cycle per connection, floods each host's `auth.log`, and — for a whole
fleet reconnecting in lockstep — pushes against sshd's `MaxStartups` throttle
(which starts *dropping* connections past its limit). Persistent connections
eliminate that churn.
- **Bounded cold start.** Concurrent connects are capped (16 at a time) so
bringing up a large fleet can't self-throttle or exhaust local sockets.
- **Self-healing.** A dropped connection is detected (by a periodic keepalive
probe or a failed refresh) and re-established automatically, with exponential
backoff + jitter so many hosts recovering from one blip don't stampede. A host
that is still connecting or offline shows a brief message on **Enter** instead of
drilling in; a ready host opens instantly.
- **Zero-handshake drill-in.** Pressing **Enter** hands the dashboard the exact
connection the fleet already holds — no second handshake — and that dashboard
inherits the same self-healing.

Design notes and the benchmark/load/stress methodology live in the Camelot
technical-design vault (`docs/technical-design/[4]ssh-connection-pool.md`).

### Verifying locally with your own sshd

You can exercise the full flow against a local SSH server without a remote box:
Expand Down Expand Up @@ -259,6 +288,7 @@ standard tools.
| Assisted key install over an existing connection | ✅ Done | `install --push` (password bootstrap) |
| Per-pane titles on two-column Overview | ✅ Done | System \| Top processes |
| Multi-server fleet overview (one row per host) | ✅ Done | `kay fleet` — concurrent multi-host live table |
| Persistent, self-healing fleet SSH connections | ✅ Done | v0.2 — one long-lived connection per host (`sshx.Pool`/`Managed`); reuse, backoff+jitter, dial cap, zero-handshake drill-in |
| Richer Overview (docker health counts, sparklines) | ✅ Done | More than gauges |
| Demo/anonymize mode (`--anonymize` / `KAY_DEMO`) | ✅ Done | Masks host/user/alias/Docker names for screenshots |
| CI quality gates (lint · gosec · govulncheck) | ✅ Done | golangci-lint 0 issues + gosec + govulncheck in CI and `make ci` |
Expand Down
31 changes: 16 additions & 15 deletions cmd/kay/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -467,21 +467,25 @@ func cmdFleet(args []string) error {
if !term.IsTerminal(int(os.Stdin.Fd())) {
return fleet.Run(hosts, fopts)
}
return fleetDrill(st, hosts, fopts, *insecure, *readonly)
return fleetDrill(hosts, fopts, *readonly)
}

// fleetDrill runs the interactive fleet overview with drill-in: it owns a single
// screen and input reader for the whole session, so pressing Enter on a host
// hands the terminal to that host's dashboard and back with no flicker and no
// competing stdin readers.
func fleetDrill(st *config.Store, hosts []fleet.Host, fopts fleet.Options, insecure, readOnly bool) error {
// screen, input reader, and connection pool for the whole session, so pressing
// Enter on a host hands the terminal to that host's dashboard and back with no
// flicker, no competing stdin readers, and no second SSH handshake — the
// dashboard reuses the connection the fleet already established.
func fleetDrill(hosts []fleet.Host, fopts fleet.Options, readOnly bool) error {
tui.SetColorMode(fopts.Color)
scr, err := tui.NewScreen()
if err != nil {
return err
}
defer scr.Close()

sess := fleet.NewSession(hosts)
defer sess.Close()

events := make(chan tui.Event, 16)
go func() {
r := tui.NewReader(os.Stdin)
Expand All @@ -495,27 +499,24 @@ func fleetDrill(st *config.Store, hosts []fleet.Host, fopts fleet.Options, insec
}()

for {
host, err := fleet.RunView(scr, events, hosts, fopts)
sel, err := sess.RunView(scr, events, fopts)
if err != nil {
return err
}
if host == nil {
if sel == nil {
return nil // user quit the fleet
}
srv := host.Server
client, derr := host.Dial()
if derr != nil {
continue // can't reach it right now; back to the overview
}
srv := sel.Host.Server
dopts := dashboard.Options{
Interval: fopts.Interval,
Color: fopts.Color,
ReadOnly: readOnly,
Anonymize: fopts.Anonymize,
Redial: func() (dashboard.Client, error) { return dial(st, &srv, insecure) },
// No Redial: the reused connection is pool-managed and self-heals, so
// the dashboard just retries its metrics over the same seam.
}
exitApp, derr := dashboard.RunView(scr, events, client, srv, dopts)
_ = client.Close()
// Reuse the pooled connection; the pool owns it, so we must NOT close it here.
exitApp, derr := dashboard.RunView(scr, events, sel.Client, srv, dopts)
if derr != nil {
return derr
}
Expand Down
Loading
Loading