[fix] Repair runner-to-API sessions routing and split internal vs public URLs#5059
Conversation
The API app is mounted at /api (SCRIPT_NAME=/api + FastAPI root_path), but the Traefik router also stripped /api via a stripprefix middleware. The app never saw /api, so its redirects dropped the prefix and became 308s to the wrong path behind a TLS proxy. Remove the strip; forward /api/* unchanged. Claude-Session: https://claude.ai/code/session_01STbkKsnAUZn5v9PiDRiSsY
… URL contract The runner's sessions calls (heartbeat/records/mounts/interactions/states) all 404'd because AGENTA_API_URL grew an /api suffix on the direct api:8000 hop, where routes live at root — five subsystems down from one prefix. - API: ApiPrefixStripMiddleware accepts /api-prefixed paths, so traefik-strip, direct, and ALB-verbatim topologies all route with either URL shape. - Env contract: *_URL is public (host + prefix), *_INTERNAL_URL is the direct in-network hop (container DNS, no prefix). Runner reads AGENTA_API_INTERNAL_URL; services reads AGENTA_RUNNER_INTERNAL_URL; AGENTA_RUNNER_URL is dropped (no public runner surface). - Runner: one shared apiBase() replaces five copy-pasted fallbacks; otel keeps its cloud tail behind the same chain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-prefix' into fix/broken-internal-urls
Collection endpoints keep their trailing slash; only the env-provided base is normalized so a http://api:8000/ value can't produce //-joined paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Agenta Team seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Pull request overview
Fixes runner→API session/telemetry routing regressions caused by ambiguous “API base URL” semantics by separating internal vs public URL configuration, centralizing runner base-url resolution, and making the API tolerant of /api-prefixed inbound paths across different proxy topologies.
Changes:
- Introduces
AGENTA_API_INTERNAL_URL/AGENTA_RUNNER_INTERNAL_URLand updates runner/services/compose/docs to use internal-hop URLs by default. - Refactors the runner to use a shared
apiBase()helper for all runner→API calls and aligns OTLP exporter fallback selection. - Adds API middleware to strip a leading
/apiprefix so the API routes correctly whether a proxy strips the prefix or forwards it verbatim.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| services/runner/src/tracing/otel.ts | Prefer internal API URL for OTLP exporter base; keep public/cloud fallback chain. |
| services/runner/src/sessions/persist.ts | Switch to shared apiBase() helper for ingest. |
| services/runner/src/sessions/interactions.ts | Switch to shared apiBase() helper for interactions ingest. |
| services/runner/src/sessions/alive.ts | Switch to shared apiBase() helper for heartbeats. |
| services/runner/src/server.ts | Use shared apiBase() for session state persistence; import helper. |
| services/runner/src/engines/sandbox_agent.ts | Switch to shared apiBase() helper for runner→API calls. |
| services/runner/src/apiBase.ts | New shared helper implementing internal→public→default base resolution + trailing-slash trimming. |
| services/runner/README.md | Update runner URL env var name in documentation. |
| services/oss/tests/pytest/unit/agent/test_select_backend.py | Update env var name used by backend-selection tests. |
| services/oss/src/agent/config.py | Rename env var read by runner_url() to the new internal URL contract. |
| services/oss/src/agent/app.py | Update docstrings/comments to the renamed runner URL env var. |
| hosting/docker-compose/oss/docker-compose.gh.yml | Stop stripping /api at Traefik for GH; rename runner/API env vars to internal variants. |
| hosting/docker-compose/oss/docker-compose.gh.ssl.yml | Same as above for GH SSL compose. |
| hosting/docker-compose/oss/docker-compose.gh.local.yml | Same as above for GH local compose. |
| hosting/docker-compose/oss/docker-compose.dev.yml | Rename runner/API env vars to internal variants in dev compose. |
| hosting/docker-compose/ee/docker-compose.gh.yml | Stop stripping /api at Traefik for EE GH; rename runner/API env vars to internal variants. |
| hosting/docker-compose/ee/docker-compose.gh.local.yml | Same as above for EE GH local compose. |
| hosting/docker-compose/ee/docker-compose.dev.yml | Rename runner/API env vars to internal variants in EE dev compose. |
| docs/design/agent-workflows/documentation/running-the-agent.md | Update docs to renamed runner URL env var. |
| docs/design/agent-workflows/documentation/architecture.md | Update architecture diagram prose to renamed runner URL env var. |
| api/entrypoints/routers.py | Add middleware to strip /api prefix on inbound requests for consistent routing across proxies. |
Comments suppressed due to low confidence (1)
services/oss/tests/pytest/unit/agent/test_select_backend.py:55
- This test uses
http://sandbox-agent:8765as the example runner URL, but the docker-compose defaults setAGENTA_RUNNER_INTERNAL_URLtohttp://runner:8765. Using the same hostname here reduces confusion when debugging failures across docs/tests/compose.
def test_runner_url_selects_http_transport(monkeypatch):
monkeypatch.setenv("AGENTA_RUNNER_INTERNAL_URL", "http://sandbox-agent:8765")
backend = select_backend(_sel("pi_core", "local"))
assert backend._url == "http://sandbox-agent:8765"
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def runner_url() -> Optional[str]: | ||
| """HTTP URL for the deployed agent runner service, when configured.""" | ||
| value = os.getenv("AGENTA_RUNNER_URL") | ||
| """HTTP URL for the deployed agent runner (internal direct hop), when configured.""" | ||
| value = os.getenv("AGENTA_RUNNER_INTERNAL_URL") | ||
| return value.strip() if value and value.strip() else None |
| The Python service finds the runner through `AGENTA_RUNNER_URL`, which defaults to | ||
| The Python service finds the runner through `AGENTA_RUNNER_INTERNAL_URL`, which defaults to | ||
| `http://sandbox-agent:8765` in every compose stage (for example | ||
| `hosting/docker-compose/ee/docker-compose.dev.yml:421`). |
| if scope["type"] in ("http", "websocket"): | ||
| path = scope.get("path", "") | ||
| if path == "/api" or path.startswith("/api/"): | ||
| scope = dict(scope) | ||
| scope["path"] = path[4:] or "/" | ||
| raw = scope.get("raw_path") | ||
| if isinstance(raw, (bytes, bytearray)) and raw[:4] == b"/api": | ||
| scope["raw_path"] = bytes(raw)[4:] or b"/" | ||
| await self.app(scope, receive, send) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 37 out of 38 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
docs/design/agent-workflows/documentation/architecture.md:44
- This diagram still calls the Compose service
sandbox-agentand points atservices/agent/..., but the runner sidecar has moved toservices/runner/and the Compose service is nowrunner. Updating these references avoids sending readers to non-existent paths/service names.
| POST /run over HTTP (AGENTA_RUNNER_INTERNAL_URL set)
| or spawn the runner CLI in a source checkout
v
agent runner sidecar (Node)
compose service: sandbox-agent
HTTP server on :8765
services/agent/src/server.ts
| if stripped: | ||
| scope = dict(scope) | ||
| scope["path"] = path | ||
| scope["raw_path"] = raw |
| @pytest.mark.asyncio | ||
| async def test_bare_api_root_strips_to_slash(): | ||
| scope = await _run(_scope("/api")) | ||
| assert scope["path"] == "/" | ||
|
|
| drives a harness over ACP through the sandbox-agent daemon. In Docker Compose the sidecar is | ||
| named `sandbox-agent`, and the service reaches it through `AGENTA_RUNNER_URL` | ||
| named `sandbox-agent`, and the service reaches it through `AGENTA_RUNNER_INTERNAL_URL` | ||
| (`services/oss/src/agent/config.py:46`). |
- handler.py: permission_default (the SDK deleted permission_policy in phase 3) - fold(): the terminal result's stop_reason wins over the done event (the live runner emits done without stopReason, so a real pause would otherwise drop its envelope metadata) and pending_interaction carries a derived top-level tool name - sandbox_agent.ts: fold upstream keepers lost to ours-wins conflict resolution (shared apiBase module, claude-model + resolved-model log lines) - service pause test pinned to the realistic stream shape (done event with no stopReason)
Context
Every runner-to-API sessions call (heartbeat, records ingest, mount signing, interactions, states) has silently 404'd since July 1. Transcripts were dropped, no durable cwd was mounted, and the sessions tables stayed empty. The runs themselves kept working, so nothing surfaced.
Root cause: the runner's
AGENTA_API_URLgrew an/apisuffix (commits91ad02fdc6,f8765a9b89) to fix an AWS reachability incident. The suffix is correct through a proxy that strips it (Traefik) but wrong on the directapi:8000hop, where routes live at the root. The API logged the calls asPOST /api/api/sessions/... 404.The deeper problem is that one variable name carried two meanings: the public URL (
https://host/api, proxy-shaped) and the internal container hop (http://api:8000, no prefix). Whichever shape you picked broke the other consumer.Changes
Env contract. Path prefixes belong to the public edge; internal hops are addressed by container DNS with no prefix:
AGENTA_RUNNER_URLis removed everywhere (code, composes, tests, docs). The runner has no public surface, so a public name for it should not exist. Not released, so no back-compat shim.Runner. One shared
apiBase()helper replaces five copy-pasted env fallbacks (alive, persist, interactions, sandbox_agent, server). It readsAGENTA_API_INTERNAL_URL, falls back toAGENTA_API_URL, thenhttp://api:8000, and trims trailing slashes off the base only (collection endpoints keep theirs). The OTLP exporter uses the same chain with its existing cloud tail.API. New
ApiPrefixStripMiddlewareaccepts a literal/apiprefix on incoming paths. An ALB forwards the public path verbatim; Traefik strips it; a direct hop has none. The app now routes all three shapes, so a wrongly shaped base URL can no longer produce silent 404s.Before:
GET http://api:8000/api/healthreturned 404.After: both
/healthand/api/healthreturn 200, and trailing-slash redirects keep the shape the client spoke (/api/docs/redirects to/api/docs,/docs/to/docs).Merged PR #5005 (
fix/self-host-traefik-api-double-prefix): Traefik stops stripping/apiin the gh/ssl composes. Its no-strip topology previously depended onSCRIPT_NAMEhandling; the middleware now guarantees it. Close #5005 when this lands.Tests / notes
tsc --noEmitclean, 379/379 vitest.test_select_backend.py7/7 after the env rename./health, both shapes of a collection URL behave identically, redirectLocationpreserves prefix and slash form, and a real agent run resolved and completed with the corrected URL.platform/compose and env templates.What to QA
docker logs agenta-ee-runner-1: nosessions/... HTTP 404lines, andselect count(*) from session_streamsis greater than 0 after the run.http://localhost/apistill work (login, playground, secrets page).