Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
5d5b4df
feat: add docker runner backend
casey-brooks Jan 29, 2026
da8242e
fix: preserve runner error semantics
casey-brooks Jan 29, 2026
1e45a1d
fix: restore runner type safety
casey-brooks Jan 29, 2026
39c1fa2
fix: align runner schemas
casey-brooks Jan 29, 2026
61d971e
feat(platform-server): enforce docker runner usage
casey-brooks Jan 29, 2026
703f5e6
fix(docker-runner): buffer newline events
casey-brooks Jan 29, 2026
474bfe1
fix(docker-runner): relax event handler
casey-brooks Jan 29, 2026
118f65b
refactor(auth): drop runner access key
casey-brooks Feb 11, 2026
074982d
chore(runner): load dotenv for dev
casey-brooks Feb 14, 2026
d830c25
chore: trigger ci
casey-brooks Feb 14, 2026
80736e5
chore(ci): allow manual dispatch
casey-brooks Feb 14, 2026
38df4a4
fix(runner): harden websocket closing
casey-brooks Feb 14, 2026
edd2675
fix(runner-platform): ensure ws upgrade stability
casey-brooks Feb 14, 2026
2ed736c
fix(runner): align ws handler types
casey-brooks Feb 14, 2026
2bebcff
fix(workspace): restore runner exec semantics
casey-brooks Feb 14, 2026
b521a6f
fix(workspace): persist runner starts in registry
casey-brooks Feb 14, 2026
a28dca3
feat(containers): add delete endpoint
casey-brooks Feb 14, 2026
043cd77
fix(containers): handle delete runner errors
casey-brooks Feb 14, 2026
d8dda50
test(containers): cover runner delete flow
casey-brooks Feb 15, 2026
8885ad5
fix(containers): force delete when stop fails
casey-brooks Feb 15, 2026
bc96b55
fix(containers): swallow stop errors before delete
casey-brooks Feb 15, 2026
0d3d80f
feat(containers): harden delete observability
casey-brooks Feb 15, 2026
424891a
feat(docker-runner): add docker lifecycle tests
casey-brooks Feb 15, 2026
fe51324
fix(docker-runner): guard route logging types
casey-brooks Feb 15, 2026
4c0801e
fix(containers): guarantee structured delete errors
casey-brooks Feb 15, 2026
7e48d8e
feat(containers): harden docker runner connectivity
casey-brooks Feb 16, 2026
7442ff9
test(platform-server): add docker fullstack flow
casey-brooks Feb 16, 2026
21b9445
fix(platform-server): log structured delete failures
casey-brooks Feb 16, 2026
eb1943b
fix(platform-server): restore container delete di wiring
casey-brooks Feb 16, 2026
11833a9
fix(platform-server): retain delete error stacks
casey-brooks Feb 16, 2026
2612015
fix(terminal): handle missing containers gracefully
casey-brooks Feb 16, 2026
a329e1d
test(graph): add runner defaults to fs persistence
casey-brooks Feb 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ on:
push:
branches: [ main ]
merge_group:
workflow_dispatch:

jobs:
lint:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ packages/platform-server/dist
packages/platform-server/vitest-report.json
# LangGraph API
.langgraph_api
data/
agent_instances
# Local MCP tool cache
codex-tools-mcp
Expand Down
14 changes: 12 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,9 @@ pnpm install
```bash
docker compose up -d
# Starts postgres (5442), agents-db (5443), vault (8200), ncps (8501),
# litellm (127.0.0.1:4000), prometheus (9090), grafana (3000), cadvisor (8080)
# litellm (127.0.0.1:4000), docker-runner (7071)
# Optional monitoring (prometheus/grafana) lives in docker-compose.monitoring.yml.
# Enable with: docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d
```

4) Apply server migrations and generate Prisma client:
Expand All @@ -136,9 +138,13 @@ pnpm --filter @agyn/platform-server run prisma:generate
pnpm --filter @agyn/platform-server dev
# UI (Vite dev server)
pnpm --filter @agyn/platform-ui dev
# docker-runner (Fastify dev server)
pnpm --filter @agyn/docker-runner dev
```
Server listens on PORT (default 3010; see packages/platform-server/src/index.ts and Dockerfile), UI dev server on default Vite port.

The docker-runner dev script automatically loads the first `.env` it finds (prefers repo root, falls back to packages/docker-runner) when `NODE_ENV` is not `production`. Production `pnpm start` keeps relying solely on the surrounding environment, so missing `.env` files do not crash the process.

- Production (Docker):
- Use published images from GHCR (see .github/workflows/docker-ghcr.yml):
- ghcr.io/agynio/platform-server
Expand Down Expand Up @@ -179,6 +185,9 @@ Key environment variables (server) from packages/platform-server/.env.example an
- Workspace/Docker:
- WORKSPACE_NETWORK_NAME (default agents_net)
- DOCKER_MIRROR_URL (default http://registry-mirror:5000)
- DOCKER_RUNNER_BASE_URL (required; default http://docker-runner:7071)
- DOCKER_RUNNER_SHARED_SECRET (required HMAC credential)
- DOCKER_RUNNER_TIMEOUT_MS (optional request timeout; default 30000)
- Nix/NCPS:
- NCPS_ENABLED (default false)
- NCPS_URL_SERVER, NCPS_URL_CONTAINER (default http://ncps:8501)
Expand Down Expand Up @@ -210,7 +219,8 @@ UI variables (packages/platform-ui/.env.example):
- vault — HashiCorp Vault (8200), auto-init helper vault-auto-init
- ncps — Nix cache proxy (8501)
- litellm + litellm-db — LLM proxy with UI (4000 loopback)
- cadvisor (8080), prometheus (9090), grafana (3000)
- docker-runner — authenticated Docker API proxy (7071, mounts /var/run/docker.sock)
- Optional monitoring overlay (docker-compose.monitoring.yml) adds prometheus (9090) and grafana (3000) without mounting the Docker socket; provide your own scrape targets via configuration.

To start services:
```bash
Expand Down
46 changes: 46 additions & 0 deletions docker-compose.monitoring.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=7d"
ports:
- "9090:9090"
volumes:
- type: bind
source: ./monitoring/prometheus
target: /etc/prometheus
read_only: true
- prometheus-data:/prometheus
networks:
- agents_net

grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
depends_on:
- prometheus
ports:
- "3000:3000"
volumes:
- type: bind
source: ./monitoring/grafana/provisioning
target: /etc/grafana/provisioning
read_only: true
- grafana-data:/var/lib/grafana
networks:
- agents_net

volumes:
prometheus-data:
driver: local
grafana-data:
driver: local

networks:
agents_net:
external: true
66 changes: 8 additions & 58 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -247,66 +247,20 @@ services:
networks:
- agents_net

cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
docker-runner:
build:
context: .
dockerfile: packages/docker-runner/Dockerfile
restart: unless-stopped
ports:
- "8080:8080"
environment:
DOCKER_RUNNER_SHARED_SECRET: ${DOCKER_RUNNER_SHARED_SECRET:-dev-shared-secret}
DOCKER_RUNNER_PORT: ${DOCKER_RUNNER_PORT:-7071}
volumes:
- type: bind
source: /var/run/docker.sock
target: /var/run/docker.sock
read_only: true
- type: bind
source: /sys
target: /sys
read_only: true
- type: bind
source: /var/lib/docker
target: /var/lib/docker
read_only: true
command:
- --docker_only=true
- --disable_metrics=hugetlb,perf_event,resctrl,tcp,udp,process,referenced_memory,disk
networks:
- agents_net

prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=7d"
depends_on:
- cadvisor
ports:
- "9090:9090"
volumes:
- type: bind
source: ./monitoring/prometheus
target: /etc/prometheus
read_only: true
- prometheus-data:/prometheus
networks:
- agents_net

grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
depends_on:
- prometheus
ports:
- "3000:3000"
volumes:
- type: bind
source: ./monitoring/grafana/provisioning
target: /etc/grafana/provisioning
read_only: true
- grafana-data:/var/lib/grafana
- "${DOCKER_RUNNER_PORT:-7071}:7071"
networks:
- agents_net

Expand All @@ -320,10 +274,6 @@ volumes:
driver: local
agents_pgdata:
driver: local
prometheus-data:
driver: local
grafana-data:
driver: local

networks:
# Shared user-defined bridge with deterministic name so non-compose containers
Expand Down
13 changes: 12 additions & 1 deletion docs/containers/workspaces.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,4 +51,15 @@ Terminal WebSocket
- Close semantics:
- The gateway closes with code `1000` for normal termination (e.g., client request or exec exit) and `1008` when the request is invalid or the session cannot be validated (`workspace_id_required`, `invalid_query`, `workspace_mismatch`, etc.).
- Before issuing close frames the server always sends the corresponding `error` or `status` payload so clients can surface user-facing feedback.
- Socket shutdown attempts `ws.close(code, reason)` first, then falls back to `ws.terminate()` and finally invokes `ws.end()` to guarantee transport teardown even when Fastify exposes only a `SocketStream` façade.
- Socket shutdown attempts `ws.close(code, reason)` first, then falls back to `ws.terminate()` and finally invokes `ws.end()` to guarantee transport teardown even when Fastify exposes only a `SocketStream` façade.

## Test-only provisioning endpoint

- The docker-backed full-stack integration test (`packages/platform-server/__tests__/containers.fullstack.docker.integration.test.ts`)
boots a real docker-runner + platform server pair and exercises the HTTP lifecycle.
- Because there is no public "create workspace" REST endpoint, the test registers a private controller at
`POST /test/workspaces`. This controller uses the production `WorkspaceProvider.ensureWorkspace` flow and
stores the resulting container/thread IDs so that `/api/containers/:id` deletion can be exercised end-to-end.
- The controller always provisions an `nginx:1.25-alpine` workspace on the `bridge` network and tags all
containers with `TEST_SUITE=containers-fullstack` for deterministic cleanup.
- The route is only mounted inside the integration test module; it is **not** part of the public API surface.
3 changes: 2 additions & 1 deletion docs/product-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ Configuration matrix (server env vars)
- VAULT_ENABLED: true|false (default false)
- VAULT_ADDR, VAULT_TOKEN
- DOCKER_MIRROR_URL (default http://registry-mirror:5000)
- DOCKER_RUNNER_BASE_URL, DOCKER_RUNNER_SHARED_SECRET (required for docker-runner), plus optional DOCKER_RUNNER_TIMEOUT_MS (default 30000).
- MCP_TOOLS_STALE_TIMEOUT_MS
- LANGGRAPH_CHECKPOINTER: postgres (default)
- POSTGRES_URL (postgres connection string)
Expand All @@ -131,7 +132,7 @@ HTTP API and sockets (pointers)
Runbooks
- Local dev
- Prereqs: Node 18+, pnpm, Docker, Postgres.
- Set: LLM_PROVIDER=litellm, LITELLM_BASE_URL, LITELLM_MASTER_KEY, GITHUB_*, GH_TOKEN, AGENTS_DATABASE_URL. Optional VAULT_* and DOCKER_MIRROR_URL.
- Set: LLM_PROVIDER=litellm, LITELLM_BASE_URL, LITELLM_MASTER_KEY, GITHUB_*, GH_TOKEN, AGENTS_DATABASE_URL, DOCKER_RUNNER_BASE_URL, DOCKER_RUNNER_SHARED_SECRET. Optional VAULT_* and DOCKER_MIRROR_URL.
- Start deps (compose or local Postgres)
- Server: pnpm -w -F @agyn/platform-server dev
- UI: pnpm -w -F @agyn/platform-ui dev
Expand Down
6 changes: 6 additions & 0 deletions docs/technical-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,12 @@ Per-workspace Docker-in-Docker and registry mirror
- Readiness: the server waits for the DinD engine to be ready before executing any initial scripts.
- To override the mirror, set environment variable `DOCKER_MIRROR_URL` to an alternate URL.

Remote Docker runner
- The platform-server always routes container lifecycle, exec, and log streaming calls through the `@agyn/docker-runner` service.
- The runner exposes authenticated Fastify HTTP/SSE/WebSocket endpoints with HMAC headers derived solely from `DOCKER_RUNNER_SHARED_SECRET`.
- Only the docker-runner service mounts `/var/run/docker.sock` in default stacks; platform-server and auxiliary services talk to it over the internal network (default http://docker-runner:7071).
- Container events are forwarded via SSE so the existing watcher pipeline (ContainerEventProcessor, cleanup jobs, metrics) remains unchanged.

Defaults and toggles
- LiveGraphRuntime serializes apply operations by default.
- PRTrigger intervalMs default 60000; includeAuthored default false.
Expand Down
5 changes: 0 additions & 5 deletions monitoring/prometheus/prometheus.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,3 @@ scrape_configs:
static_configs:
- targets:
- prometheus:9090

- job_name: cadvisor
static_configs:
- targets:
- cadvisor:8080
49 changes: 49 additions & 0 deletions packages/docker-runner/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# syntax=docker/dockerfile:1.7

FROM node:20-slim AS base

ENV PNPM_HOME=/pnpm \
PNPM_STORE_PATH=/pnpm-store \
PATH=/pnpm:$PATH

RUN corepack enable \
&& corepack prepare pnpm@10.5.0 --activate

RUN apt-get update \
&& apt-get install -y --no-install-recommends git \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /workspace

COPY pnpm-lock.yaml pnpm-workspace.yaml package.json ./

RUN pnpm fetch

FROM base AS build

COPY . .

RUN pnpm install --filter @agyn/docker-runner... --offline --frozen-lockfile

RUN pnpm --filter @agyn/docker-runner run build

RUN pnpm deploy --filter @agyn/docker-runner --prod --legacy /opt/app

FROM node:20-slim AS runtime

ENV NODE_ENV=production \
PORT=7071

WORKDIR /opt/app/packages/docker-runner

RUN apt-get update \
&& apt-get install -y --no-install-recommends git \
&& rm -rf /var/lib/apt/lists/*

COPY --from=build --chown=node:node /opt/app /opt/app

USER node

EXPOSE 7071

CMD ["node", "dist/service/main.js"]
Loading