Skip to content

Commit 7dc8b2d

Browse files
🚀 release(v1.5): rc.28 — handshake cold-start race + perf/security/refactor batch (#396)
## Summary Cuts v1.5.0-rc.28 — 19 commits since rc.27, all on `feature/v1.5-rc28`. ### Highlights - **#386 follow-through (`7ef4aa61`)** — `AgentClient._doHandshake` now defers `pruneOldContainers` when handshake returns 0 containers, eliminating the cold-start race where a freshly-restarted agent would wipe the controller's last-known container count between handshake and first watch cycle. Combined with the rc.25 (`d02080ae`) snapshot suppression and rc.26 (`512c3751`) stats-changed broadcast, the agent count display is now robust across the full restart lifecycle. Cesc1986's rc.27 reporter log proved the failure mode (`agent-client.ml Handshake → 0` after restart, wiping 5 known containers). - **#386 concurrent-handshake guard (`3da42587`)** — in-flight `_doHandshake` Promise reused across overlapping reconnects. - **Security caps (`5e02f6df`, `9b6e1b97`)** — auth JSON body limit + global SSE connection cap; doc note that Command trigger inherits all `DD_*` env. - **Perf batch (`07e3580e`, `b2999eb7`, `dceec898`)** — store/watcher/SSE replay hot paths, bundle splitting, SSE jitter, OpenAPI cache, alert fan-out, containers menu scan, WS log backpressure. - **Refactors (`47c0cf44`, `4d34b5e7`, `6ee07006`)** — notifications outbox openapi docs + drain-safe stop, registry contract hygiene, `readJsonResponse` adoption. - **UI fix (`de07d412`)** — include scanning/sbom-generating in phase enum. - **CI polish (`794497e5`, `06e56be3`, `612e81c4`, `06eed8dc`, `82caeb08`, `3721f66b`)** — stub `getContainersRaw` in same-source watcher tests, vite codeSplitting i18n group, `runTrigger` json type narrowing, Crowdin Sync emoji prefix, transient 401 checkout retry, exclude pure-data files from UI mutation slice. ### CHANGELOG - `[Unreleased]` now contains only the #386 deferred-prune entry (will be stamped to `[1.5.0-rc.28]` by `release-cut.yml`). - `[1.5.0-rc.27]` section added retroactively for the two #289 entries that shipped in rc.27 release notes but were never stamped into the file. ## Test plan - [x] All pre-push gates green on `bca0b7bb`: clean-tree, biome, qlty, qlty-smells, typecheck-ui, **100% coverage**, build - [x] `npx vitest run agent/AgentClient.test.ts` → 455/455 pass (includes 4 new regression tests for handshake-zero prune-skip) - [ ] Required CI checks green on this PR - [ ] After merge: `gh workflow run release-cut.yml --ref main -f release_tag=v1.5.0-rc.28` - [ ] Reporter (Cesc1986) verifies #386 fixed on rc.28 image Fixes: #386
1 parent afee8cb commit 7dc8b2d

51 files changed

Lines changed: 1138 additions & 151 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/e2e-playwright.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ jobs:
133133
- name: Install e2e dependencies
134134
uses: nick-fields/retry@ce71cc2ab81d554ebbe88c79ab5975992d79ba08 # v3.0.2
135135
with:
136-
timeout_minutes: 5
136+
timeout_minutes: 10 # @playwright/browser-chromium postinstall downloads ~167 MiB; 5 min was tight enough to flake on slow CDN moments
137137
max_attempts: 3
138138
retry_wait_seconds: 30
139139
command: cd e2e && npm ci

.github/workflows/i18n-crowdin.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Crowdin Sync
1+
name: "🌐 i18n: Crowdin Sync"
22

33
on:
44
push:

.github/workflows/quality-mutation-monthly.yml

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,10 @@ jobs:
128128
# everywhere it appears: it
129129
# uses import.meta.glob(), a Vite compile-time macro whose argument
130130
# must stay a literal — Stryker instrumentation breaks the glob
131-
# parser and aborts the dry run.
131+
# parser and aborts the dry run. src/icons.ts and src/i18n/locales.ts
132+
# are also excluded: they are pure-data files whose string-literal
133+
# mutants cause transitive jsdom hangs in Vue tests and provide no
134+
# logic-quality signal.
132135
- name: ui-views-containers
133136
package: ui
134137
incremental_file: reports/stryker-incremental-ui-views-containers.json
@@ -173,7 +176,7 @@ jobs:
173176
package: ui
174177
incremental_file: reports/stryker-incremental-ui-shell-app.json
175178
mutate: >-
176-
src/components/**/*.ts,src/boot/**/*.ts,src/layouts/**/*.ts,src/i18n/**/*.ts,src/main.ts,src/icons.ts,!src/boot/i18n.ts,!**/*.d.ts,!**/*.test.ts,!**/*.fuzz.test.ts,!**/*.typecheck.ts,!dist/**,!coverage/**
179+
src/components/**/*.ts,src/boot/**/*.ts,src/layouts/**/*.ts,src/i18n/**/*.ts,src/main.ts,src/icons.ts,!src/boot/i18n.ts,!src/icons.ts,!src/i18n/locales.ts,!**/*.d.ts,!**/*.test.ts,!**/*.fuzz.test.ts,!**/*.typecheck.ts,!dist/**,!coverage/**
177180
178181
steps:
179182
- name: Harden Runner
@@ -182,6 +185,19 @@ jobs:
182185
egress-policy: audit
183186

184187
- name: Checkout
188+
id: checkout
189+
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
190+
continue-on-error: true
191+
with:
192+
persist-credentials: false
193+
194+
- name: Wait before checkout retry
195+
if: steps.checkout.outcome == 'failure'
196+
shell: bash
197+
run: sleep 30
198+
199+
- name: Checkout (retry)
200+
if: steps.checkout.outcome == 'failure'
185201
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
186202
with:
187203
persist-credentials: false
@@ -265,6 +281,19 @@ jobs:
265281
egress-policy: audit
266282

267283
- name: Checkout
284+
id: checkout
285+
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
286+
continue-on-error: true
287+
with:
288+
persist-credentials: false
289+
290+
- name: Wait before checkout retry
291+
if: steps.checkout.outcome == 'failure'
292+
shell: bash
293+
run: sleep 30
294+
295+
- name: Checkout (retry)
296+
if: steps.checkout.outcome == 'failure'
268297
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
269298
with:
270299
persist-credentials: false

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1212

1313
### Fixed
1414

15+
- **[#386](https://github.com/CodesWhat/drydock/issues/386) follow-through — a fresh-restart agent whose in-memory store has not yet been re-populated no longer wipes the controller's last-known container state on handshake.** A user running drydock in a controller + agent topology reported on rc.27 that their `ml` agent still rendered 0 running containers in the controller UI even after the rc.25 watcher-snapshot suppression (`d02080ae`) and the rc.26 stats-changed broadcast (`512c3751`). Cause: when `AgentClient._doHandshake` (`app/agent/AgentClient.ts`) reconnects to an agent's SSE stream it handshakes via `GET /api/containers`, which serves from the agent's *in-memory* `storeContainer`. If the agent process has just restarted and its `watchatstart` cron has not yet finished its first run (the agent's store is non-persistent across restarts), that endpoint legitimately returns `[]` even though the docker daemon has N running containers. `_doHandshake` then called `pruneOldContainers([])` unconditionally, deleting every controller-side container the agent had previously contributed — even though the agent's first `dd:watcher-snapshot` was about to repopulate the store seconds later. The rc.25 fix in `Docker.watch()` only suppresses *outgoing* snapshots from the agent when enumeration fails on the agent itself; it does not cover the controller-side handshake path. The fix makes the handshake's prune step ambiguity-aware: `_doHandshake` now skips `pruneOldContainers` whenever `containers.length === 0` and emits a `Handshake returned 0 containers; preserving last-known state until the first watch cycle completes` warning (only after `hasConnectedOnce`, so the first-ever connection of a genuinely empty agent stays silent). Pruning is deferred to the next authoritative `dd:watcher-snapshot`, which is already gated on `!containerEnumerationFailed && enrichmentErrors === 0` (`app/watchers/providers/docker/Docker.ts:1136`) and is therefore unambiguous: a 0-container snapshot means the agent really has 0 running containers. Non-zero handshakes continue to prune normally — the behaviour change is scoped strictly to the 0-container case that exposed the cold-start race.
16+
17+
## [1.5.0-rc.27] — 2026-05-24
18+
19+
### Fixed
20+
1521
- **[#289](https://github.com/CodesWhat/drydock/issues/289) — Agent-hosted container updates no longer leave an orphaned queued operation row on the controller that the 30-minute TTL sweep force-fails into a misleading "update failed" Pushover/Telegram notification long after the update actually succeeded.** A user running drydock in a controller + agent topology reported on rc.25 that an "Update All" of Tautulli on two hosts produced the success notification only for the controller-host container; the agent-host container's success notification was missing and, ~30 minutes later, a second Pushover arrived saying `[mediavault] Container Tautulli update failed — Marked failed after exceeding active update TTL (1800000ms) while queued.` even though the update had in fact succeeded on the agent. Cause: when the controller queues a container update via `createAcceptedContainerUpdateRequest` (`app/updates/request-update.ts`) it mints a controller-side `operationId` and inserts a `queued` row; the dispatcher then calls `entry.trigger.trigger(entry.container, { operationId })`. For containers hosted on an agent the trigger is `AgentTrigger`, whose `trigger(container)` previously accepted only the container and discarded the `runtimeContext`. `AgentClient.runRemoteTrigger` posted `{id, name}` to the agent without the operationId, so the agent's `/api/triggers/:type/:name` endpoint called `requestContainerUpdate` with no operationId and minted its own row; the agent's `dd:update-applied` / `dd:update-operation-changed` events then arrived back at the controller carrying the agent-side id, which the controller routed through `toAgentScopedId` into a third, agent-scoped row (`agent-<name>-<remote-id>`). The original controller-side queued row was therefore never touched, sat queued past the `UPDATE_OPERATION_ACTIVE_TTL_MS` deadline in `app/store/update-operation.ts:295-300`, and was force-failed by the TTL sweep — which fired the misleading "failed" notification with the row's still-valid container snapshot (hence the correct `[mediavault]` agent prefix). The fix threads the controller's `operationId` end-to-end so a single row is the source of truth for the whole lifecycle: `AgentTrigger.trigger` / `triggerBatch` now accept and forward `runtimeContext`; `AgentClient.runRemoteTrigger` / `runRemoteTriggerBatch` extract per-container operationIds via the existing `getRequestedOperationId` helper and include them in the agent payload (`{id, name, operationId}` for single triggers; `{...container, operationId}` per entry for batches); the agent-side controller `runTrigger` accepts an `operationId` in the request body (validated by `triggerRequestBodySchema`) and threads it into `requestContainerUpdate`; the agent-side batch endpoint extracts per-container operationIds into an `{operationIds}` runtimeContext before forwarding to the local trigger; `EnqueueContainerUpdateOptions` gains an `operationId` field honored by `createAcceptedContainerUpdateRequest` (single-container batches only; multi-container batches still mint per-container UUIDs); and a new `AgentClient.resolveAgentOperationId` helper checks the controller's operation store for an existing row at the raw (unscoped) id and reuses it when found — falling back to the `toAgentScopedId` form only when the agent does not echo a known controller id, preserving backwards compatibility with older agents. The controller-side queued row therefore transitions directly to `in-progress` and `succeeded`/`failed` from the agent's lifecycle events, no parallel agent-scoped row is created, the TTL sweep has nothing stale to fail, and the spurious "update failed" notification disappears.
1622

1723
- **[#289](https://github.com/CodesWhat/drydock/issues/289) — Update-applied and update-failed notification triggers (Pushover, Telegram, etc.) and UI success toasts no longer silently drop for containers running on a connected agent.** A user running drydock in a controller + agent topology reported on rc.25 that an "Update All" across two hosts produced the success toast and Pushover notification only for the container on the controller host, never for the same-name container on the agent host. Cause: when the agent finishes an update it sends a `dd:update-applied` SSE payload to the controller carrying a full `container` snapshot. The controller's `AgentClient.handleEvent` routes this through `maybeMarkAgentOperationSucceededFromAppliedPayload` → `markAgentOperationTerminal` → `ensureAgentOperationForTerminal` → `updateOperationStore.insertOperation` + `markOperationTerminal`, but `buildAgentOperationBase` in `app/agent/AgentClient.ts` constructed the inserted row from `{id, kind, containerName, containerId, newContainerId}` only — the container snapshot was dropped on the floor. When `markOperationTerminal` then fired `emitTerminalLifecycleEvent` (`app/store/update-operation.ts`), the resulting `emitContainerUpdateApplied` / `emitContainerUpdateFailed` payload built by `buildTerminalLifecycleEventBase` lacked `container`. The notification handler `handleContainerUpdateAppliedEvent` (`app/triggers/providers/Trigger.ts`) then fell back to `findContainerByBusinessId(containerName)`, which compares the agent's bare `containerName` (e.g. `tautulli`) against the controller-side `fullName` (e.g. `mediavault_docker_tautulli`) and silently dropped — the same class of `findContainerByBusinessId` miss as [#385](https://github.com/CodesWhat/drydock/issues/385) but on the agent-scoped operation path that [#385](https://github.com/CodesWhat/drydock/issues/385) did not cover. The fix threads the agent's container snapshot through every level of the agent-scoped operation pipeline — `buildAgentOperationBase`, `ensureAgentOperationForTerminal`, `markAgentOperationTerminal`, `maybeMarkAgentOperationSucceededFromAppliedPayload`, and `maybeMarkAgentOperationFailedFromFailedPayload` — stamping `agent: this.name` so the controller's view of the container is consistent. The `dd:update-operation-changed`-before-`dd:update-applied` race is handled by patching the container snapshot onto the existing active row via `updateOperation` before the terminal emit runs (only when the existing row lacks a container, never overwriting an existing snapshot). `container` is added to `MutableUpdateOperationFields` in `app/store/update-operation.ts` so terminal and active patches accept it. The store's terminal-lifecycle emit therefore naturally carries the agent's container into `emitContainerUpdateApplied` / `emitContainerUpdateFailed`, the `payloadContainer` shortcut in the trigger handler succeeds, and both the notification trigger and the SSE toast fire end-to-end on the controller for agent-originated updates.

README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -145,8 +145,10 @@ See the [Quick Start guide](https://getdrydock.com/docs/quickstart) for Docker C
145145
146146
<hr>
147147
148+
<h2 align="center" id="recent-updates">🆕 Recent Updates</h2>
149+
148150
<details>
149-
<summary><h2 align="center" id="recent-updates" style="display:inline-block">🆕 Recent Updates</h2></summary>
151+
<summary><strong>Latest release highlights</strong></summary>
150152
151153
- **Unified update-completion toasts** — All terminal "Updated / Update failed / Rolled back" toasts now fire from a single global handler mounted at `App.vue`, with toast emission gated on the matching container-state SSE event so the toast appears the moment the row's "Updating" badge clears. Closes a long-standing intermittent-drop bug where `ContainerUpdateDialog`, `useContainerSsePatchPipeline`, and the dashboard each fired (or didn't) based on which view happened to be mounted. Includes a Last-Event-ID query-param fallback so missed terminal events get replayed from the server-side ring buffer on SSE reconnect. ([#289](https://github.com/CodesWhat/drydock/issues/289), [#290](https://github.com/CodesWhat/drydock/issues/290), [#291](https://github.com/CodesWhat/drydock/issues/291))
152154
- **17 UI locales** — v1.5.0 ships with 17 locales: English, Simplified Chinese, Traditional Chinese, Italian, Spanish, German, French, Brazilian Portuguese, Dutch, Polish, Turkish, Japanese, Korean, Russian, Vietnamese, Ukrainian, and Arabic. Simplified and Traditional Chinese were contributed by [TianMiao](https://github.com/TianMiao) ([PR #331](https://github.com/CodesWhat/drydock/discussions/331), [PR #344](https://github.com/CodesWhat/drydock/pull/344)); the remaining 14 non-English locales were added in subsequent RCs. Switch language in **Config > Appearance**. Crowdin sync is configured for ongoing translation contributions.
@@ -398,8 +400,10 @@ Drop-in replacement — swap the image, restart, done. All `WUD_*` env vars and
398400
399401
<hr>
400402
403+
<h2 align="center" id="roadmap">🗺️ Roadmap</h2>
404+
401405
<details>
402-
<summary><h2 align="center" id="roadmap" style="display:inline-block">🗺️ Roadmap</h2></summary>
406+
<summary><strong>Version themes & highlights</strong></summary>
403407
404408
High-level themes only — see [CHANGELOG.md](CHANGELOG.md) for per-release detail.
405409

0 commit comments

Comments
 (0)