Skip to content

feat: add docker-runner backend#1273

Merged
vitramir merged 32 commits intomainfrom
noa/issue-1272
Feb 16, 2026
Merged

feat: add docker-runner backend#1273
vitramir merged 32 commits intomainfrom
noa/issue-1272

Conversation

@casey-brooks
Copy link
Contributor

Summary

  • add dedicated docker-runner service with authenticated API and Fastify server
  • integrate new HttpDockerRunnerClient + DOCKER_CLIENT token so platform-server targets runner when DOCKER_BACKEND=runner
  • update compose/manifests/docs for runner + optional monitoring overlay

Testing

  • pnpm lint
  • pnpm --filter @agyn/docker-runner test
  • pnpm --filter @agyn/platform-server test

Closes #1272

@casey-brooks casey-brooks requested a review from a team as a code owner January 29, 2026 12:33
@casey-brooks
Copy link
Contributor Author

Test & Lint

  • pnpm lint
  • pnpm --filter @agyn/docker-runner test — test files: 1 passed; tests: 1 passed
  • pnpm --filter @agyn/platform-server test — test files: 179 passed / 23 skipped; tests: 706 passed / 12 skipped

Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runner backend/client integration still breaks core invariants:

  • The Fastify endpoints always respond with generic 500/404 codes and drop dockerode's original , so platform-server can no longer treat benign 304/404/409 cases as no-ops (cleanup + teardown flows will start failing when the runner is enabled).
  • The Http client rethrows failures as plain s without any status metadata, compounding the issue because callers have no way to distinguish an expected race from a real failure.
  • SSE helpers (, ) never check the HTTP status, so auth/config issues now result in silent empty streams instead of actionable errors.
    Please address these to preserve existing semantics before we rely on the runner backend.}

@casey-brooks
Copy link
Contributor Author

Test & Lint

  • pnpm lint
  • pnpm --filter @agyn/docker-runner test — test files: 1 passed; tests: 1 passed
  • pnpm --filter @agyn/platform-server test — test files: 179 passed / 23 skipped; tests: 706 passed / 12 skipped

noa-lucent
noa-lucent previously approved these changes Jan 29, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the earlier blockers. The runner now relays dockerade status codes via sendDockerError, the HTTP client surfaces them through DockerRunnerRequestError, and the SSE helpers fail fast whenever the runner rejects a request. Everything looks good from my side.

@casey-brooks
Copy link
Contributor Author

Summary:\n- align ConfigService with the updated schema so runner settings satisfy the typed contract\n- tighten HttpDockerRunnerClient typing for ensureImage/platform + interactive exec payloads\n- refresh the docker-runner Fastify service for Zod v4, runner label lookups, SSE, and websocket typing\n\nTesting:\n- pnpm --filter @agyn/docker-runner build\n- pnpm lint

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm --filter @agyn/platform-server test
    • Test suites: 430 passed / 0 failed
    • Tests: 706 passed / 0 failed / 12 skipped
  • pnpm --filter @agyn/platform-server lint
    • Lint passed with no errors

Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runner-only mode still has a critical gap: the endpoint parses Docker’s event stream per chunk and drops any event that isn’t perfectly aligned with chunk boundaries. Docker emits newline-delimited JSON, so partial/multi-event chunks are common—those events are silently discarded. Now that platform-server always depends on runner to receive container lifecycle events, losing them breaks cleanup and status tracking. Please buffer/split on newlines (like the platform watcher does) so every event is forwarded exactly once.

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm --filter @agyn/docker-runner test
    • Test suites: 2 passed / 0 failed
    • Tests: 3 passed / 0 failed
  • pnpm --filter @agyn/docker-runner lint
    • Lint passed with no errors
  • pnpm --filter @agyn/platform-server test
    • Test suites: 430 passed / 0 failed
    • Tests: 706 passed / 0 failed / 12 skipped
  • pnpm --filter @agyn/platform-server lint
    • Lint passed with no errors

noa-lucent
noa-lucent previously approved these changes Jan 29, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoint now buffers newline-delimited JSON via the new , flushes trailing data on close, and is covered by targeted tests. Docker events will no longer be dropped when chunk boundaries split objects, so the runner backend preserves the full lifecycle stream. LGTM.

@casey-brooks
Copy link
Contributor Author

Test, Lint & Build Summary

  • pnpm --filter @agyn/docker-runner build
    • TypeScript build succeeded
  • pnpm --filter @agyn/docker-runner test
    • Test suites: 2 passed / 0 failed
    • Tests: 3 passed / 0 failed
  • pnpm --filter @agyn/docker-runner lint
    • Lint passed with no errors
  • pnpm --filter @agyn/platform-ui build
    • Vite build completed successfully (warnings only for chunk size)

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm --filter @agyn/docker-runner build
    • Build succeeded (tsc)
  • pnpm --filter @agyn/docker-runner test
    • Test suites: 2 passed / 0 failed
    • Tests: 3 passed / 0 failed
  • pnpm --filter @agyn/docker-runner lint
    • Lint passed with no errors
  • pnpm --filter @agyn/platform-server build
    • Build succeeded (tsc)
  • pnpm --filter @agyn/platform-server lint
    • Lint passed with no errors
  • pnpm --filter @agyn/platform-server test -- --reporter=json
    • Test suites: 202 total (179 passed, 23 skipped, 0 failed)
    • Tests: 718 total (706 passed, 12 skipped, 0 failed)

noa-lucent
noa-lucent previously approved these changes Feb 11, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the access-key removal: config schema, env templates, docs, Docker runner config, and Http client now rely solely on DOCKER_RUNNER_SHARED_SECRET, and the Fastify auth guard no longer expects X-DR-Access-Key. No lingering references or mismatched tests found.

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm --filter @agyn/docker-runner build
    • Build succeeded (tsc)
  • pnpm --filter @agyn/docker-runner lint
    • Lint passed with no errors
  • pnpm --filter @agyn/docker-runner test
    • Test suites: 2 passed / 0 failed
    • Tests: 3 passed / 0 failed

noa-lucent
noa-lucent previously approved these changes Feb 14, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change scope is limited to loading dotenv for the docker-runner dev entrypoint. Verified README guidance, package dependency, module, main entry import, and lockfile updates—all consistent and guarded so production () still bypasses files. Looks good.

@casey-brooks
Copy link
Contributor Author

Tests & Lint Summary:

  • `pnpm --filter @agyn/docker-runner lint` → ✅ no lint errors.
  • `pnpm --filter @agyn/docker-runner test` → ✅ 7 passed / 0 failed / 0 skipped.

@casey-brooks
Copy link
Contributor Author

Tests & Lint Summary:

  • pnpm --filter @agyn/docker-runner lint → ✅ no lint errors.
  • pnpm --filter @agyn/docker-runner test → ✅ 9 passed / 0 failed / 0 skipped.
  • pnpm --filter @agyn/platform-server lint → ✅ no lint errors.
  • pnpm --filter @agyn/platform-server test → ✅ suite passed (see JSON run for counts).
  • pnpm exec vitest run --reporter=json --outputFile=/tmp/platform-server-vitest.json (from packages/platform-server) → ✅ 719 total / 707 passed / 0 failed / 12 skipped.

noa-lucent
noa-lucent previously approved these changes Feb 14, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-verified the latest changes: the runner’s interactive exec route now serves 426 responses for non-upgrade HTTP calls while returning 101 on successful upgrades, and the shared websocket util ensures sockets close safely. Added runner WS tests cover both behaviors. On the platform side, the terminal gateway now traps handleConnection rejections and closes the socket with 1011, with regression tests proving the failure path. Everything looks solid.

@casey-brooks
Copy link
Contributor Author

Resolved the Build UI failure: Scope: 6 of 7 workspace projects
packages/docker-runner build$ tsc -p tsconfig.json
packages/docker-runner build: Done
packages/platform-server build$ tsc -p tsconfig.json
packages/platform-ui build$ vite build
packages/platform-ui build: vite v7.1.6 building for production...
packages/platform-ui build: transforming...
packages/platform-server build: Done
packages/platform-ui build: ✓ 4326 modules transformed.
packages/platform-ui build: rendering chunks...
packages/platform-ui build: computing gzip size...
packages/platform-ui build: dist/index.html 0.74 kB │ gzip: 0.37 kB
packages/platform-ui build: dist/assets/index-BPqOMBZn.css 124.55 kB │ gzip: 20.94 kB
packages/platform-ui build: dist/assets/index-DHKQMwmt.js 3,151.47 kB │ gzip: 969.02 kB
packages/platform-ui build: (!) Some chunks are larger than 500 kB after minification. Consider:
packages/platform-ui build: - Using dynamic import() to code-split the application
packages/platform-ui build: - Use build.rollupOptions.output.manualChunks to improve chunking: https://rollupjs.org/configuration-options/#output-manualchunks
packages/platform-ui build: - Adjust chunk size limit for this warning via build.chunkSizeWarningLimit.
packages/platform-ui build: ✓ built in 10.41s
packages/platform-ui build: Done choked on src/service/app.ts because the new websocket route handler was typed as a Fastify handler (taking SocketStream) but registered as an @fastify/websocket handler expecting the raw ws socket. TypeScript treated the signature mismatch as an error and also rejected the schema.hide flag. I switched the handler to a plain function (cast only when registering) and cast the hidden schema to Fastify's type, matching runtime behavior.\n\nVerification on latest head:\n- pnpm --filter @agyn/docker-runner lint\n- pnpm --filter @agyn/docker-runner test (9 passed / 0 failed / 0 skipped)\n- pnpm --filter @agyn/docker-runner build\n- pnpm --filter @agyn/platform-ui build (per request, succeeds)

@casey-brooks
Copy link
Contributor Author

Root cause for missing nix installs was two-fold: (1) the runner API rejected idleTimeoutMs: 0, so our initial command -v nix probe never executed, and (2) the HTTP client always aborted runner requests at the global 30s timeout even when callers requested longer exec timeouts (e.g., nix profile install at 60s). Combined, ensureNixPackages never actually ran.\n\nFixes in this PR:\n- allow timeoutMs/idleTimeoutMs values of zero in /v1/exec/run validation.\n- extend the HttpDockerRunnerClient request timeout to max(requestTimeoutMs, max(timeoutMs, idleTimeoutMs) + 5s) so long installs are allowed to complete.\n- regression tests: ContainerService now has coverage ensuring string commands run via /bin/sh -lc (export FOO=bar && echo emits bar), and the HTTP client tests assert the extended timeout behavior.\n\nVerification:\n- pnpm --filter @agyn/docker-runner lint\n- pnpm --filter @agyn/docker-runner test (10 passed / 0 failed / 0 skipped)\n- pnpm --filter @agyn/platform-server lint\n- pnpm --filter @agyn/platform-server test (721 total / 709 passed / 12 skipped / 0 failed)\n- pnpm exec vitest run --reporter=json --outputFile=/tmp/platform-server-vitest.json (counts above)\n- pnpm --filter @agyn/platform-ui build

noa-lucent
noa-lucent previously approved these changes Feb 14, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-checked the timeout fixes: runner now accepts 0ms exec/idle timeouts, and the platform client extends HTTP timeouts to cover longer exec requests (with coverage tests). Added exec shell-semantic regression test ensures string commands still run via /bin/sh. Looks solid.

@casey-brooks
Copy link
Contributor Author

Root cause: when we switched workspace provisioning over to the docker-runner HTTP client we stopped instantiating the local ContainerService, so nothing ever called ContainerRegistry.registerStart. The runner process itself still registers starts (when it has a registry), but platform-server’s HttpDockerRunnerClient path never persisted anything, so new threads showed zero containers even though the runtime was live.\n\nFixes in this change set:\n- Injected ContainerRegistry into DockerWorkspaceRuntimeProvider and now upsert every ensured workspace by inspecting the container via the runner and registering it (including labels, thread ID, TTL, mounts, etc.). This covers both newly-created and reused containers, so existing threads healed automatically.\n- Updated InfraModule wiring to pass the registry to the provider.\n- Added docker.workspace.provider.test.ts to assert that ensuring a workspace calls registerStart with the correct thread ID / labels.\n\nVerification:\n- pnpm --filter @agyn/docker-runner lint\n- pnpm --filter @agyn/docker-runner test → 10 passed / 0 failed / 0 skipped\n- pnpm --filter @agyn/platform-server lint\n- pnpm --filter @agyn/platform-server test → 434 suites / 722 total / 710 passed / 0 failed / 12 skipped\n- pnpm exec vitest run --reporter=json --outputFile=/tmp/platform-server-vitest.json (for stats above)\n- pnpm --filter @agyn/platform-ui build

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • LOG_LEVEL=error pnpm --filter @agyn/platform-server exec -- vitest run --reporter=basic --silent
    • Test Files: 190 passed / 23 skipped (213 total)
    • Tests: 755 passed / 12 skipped (767 total)
  • pnpm --filter @agyn/platform-server lint
    • ESLint completed with no issues

Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase looks clean. The only delta pulls the shared runnerConfigDefaults into the fs persistence graph test, ensuring the config schema still validates after the merge. No regressions spotted—still good to go.

@vitramir vitramir added this pull request to the merge queue Feb 16, 2026
Merged via the queue into main with commit c53bbed Feb 16, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract Docker interface into isolated docker-runner service

4 participants