Skip to content

[fix] Repair runner-to-API sessions routing and split internal vs public URLs#5059

Merged
junaway merged 6 commits into
big-agentsfrom
fix/broken-internal-urls
Jul 3, 2026
Merged

[fix] Repair runner-to-API sessions routing and split internal vs public URLs#5059
junaway merged 6 commits into
big-agentsfrom
fix/broken-internal-urls

Conversation

@junaway

@junaway junaway commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Context

Every runner-to-API sessions call (heartbeat, records ingest, mount signing, interactions, states) has silently 404'd since July 1. Transcripts were dropped, no durable cwd was mounted, and the sessions tables stayed empty. The runs themselves kept working, so nothing surfaced.

Root cause: the runner's AGENTA_API_URL grew an /api suffix (commits 91ad02fdc6, f8765a9b89) to fix an AWS reachability incident. The suffix is correct through a proxy that strips it (Traefik) but wrong on the direct api:8000 hop, where routes live at the root. The API logged the calls as POST /api/api/sessions/... 404.

The deeper problem is that one variable name carried two meanings: the public URL (https://host/api, proxy-shaped) and the internal container hop (http://api:8000, no prefix). Whichever shape you picked broke the other consumer.

Changes

Env contract. Path prefixes belong to the public edge; internal hops are addressed by container DNS with no prefix:

AGENTA_API_URL             = https://<host>/api    (public: web, SDK, fallback)
AGENTA_API_INTERNAL_URL    = http://api:8000       (direct: runner egress)
AGENTA_RUNNER_INTERNAL_URL = http://runner:8765    (direct: services to runner)

AGENTA_RUNNER_URL is removed everywhere (code, composes, tests, docs). The runner has no public surface, so a public name for it should not exist. Not released, so no back-compat shim.

Runner. One shared apiBase() helper replaces five copy-pasted env fallbacks (alive, persist, interactions, sandbox_agent, server). It reads AGENTA_API_INTERNAL_URL, falls back to AGENTA_API_URL, then http://api:8000, and trims trailing slashes off the base only (collection endpoints keep theirs). The OTLP exporter uses the same chain with its existing cloud tail.

API. New ApiPrefixStripMiddleware accepts a literal /api prefix on incoming paths. An ALB forwards the public path verbatim; Traefik strips it; a direct hop has none. The app now routes all three shapes, so a wrongly shaped base URL can no longer produce silent 404s.

Before: GET http://api:8000/api/health returned 404.
After: both /health and /api/health return 200, and trailing-slash redirects keep the shape the client spoke (/api/docs/ redirects to /api/docs, /docs/ to /docs).

Merged PR #5005 (fix/self-host-traefik-api-double-prefix): Traefik stops stripping /api in the gh/ssl composes. Its no-strip topology previously depended on SCRIPT_NAME handling; the middleware now guarantees it. Close #5005 when this lands.

Tests / notes

  • Runner: tsc --noEmit clean, 379/379 vitest.
  • Services: test_select_backend.py 7/7 after the env rename.
  • Live on the dev stack: both path shapes return 200 on /health, both shapes of a collection URL behave identically, redirect Location preserves prefix and slash form, and a real agent run resolved and completed with the corrected URL.
  • Dev composes keep Traefik stripping (out of fix(hosting): don't strip /api at Traefik (fixes broken self-host redirects) #5005's scope); gh composes stop. Both route identically through the middleware.
  • Follow-up (separate repo): the same two renames in platform/ compose and env templates.

What to QA

  • Run an agent in the playground, then check docker logs agenta-ee-runner-1: no sessions/... HTTP 404 lines, and select count(*) from session_streams is greater than 0 after the run.
  • Regression: web app and SDK calls through http://localhost/api still work (login, playground, secrets page).
  • Self-host (gh compose): API requests with and without trailing slashes still resolve; no 308 redirect loops behind a TLS proxy.

Agenta Team and others added 4 commits July 1, 2026 20:22
The API app is mounted at /api (SCRIPT_NAME=/api + FastAPI root_path), but
the Traefik router also stripped /api via a stripprefix middleware. The app
never saw /api, so its redirects dropped the prefix and became 308s to the
wrong path behind a TLS proxy. Remove the strip; forward /api/* unchanged.

Claude-Session: https://claude.ai/code/session_01STbkKsnAUZn5v9PiDRiSsY
… URL contract

The runner's sessions calls (heartbeat/records/mounts/interactions/states) all
404'd because AGENTA_API_URL grew an /api suffix on the direct api:8000 hop,
where routes live at root — five subsystems down from one prefix.

- API: ApiPrefixStripMiddleware accepts /api-prefixed paths, so traefik-strip,
  direct, and ALB-verbatim topologies all route with either URL shape.
- Env contract: *_URL is public (host + prefix), *_INTERNAL_URL is the direct
  in-network hop (container DNS, no prefix). Runner reads
  AGENTA_API_INTERNAL_URL; services reads AGENTA_RUNNER_INTERNAL_URL;
  AGENTA_RUNNER_URL is dropped (no public runner surface).
- Runner: one shared apiBase() replaces five copy-pasted fallbacks; otel keeps
  its cloud tail behind the same chain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collection endpoints keep their trailing slash; only the env-provided base is
normalized so a http://api:8000/ value can't produce //-joined paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings July 3, 2026 15:29
@vercel

vercel Bot commented Jul 3, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Jul 3, 2026 4:41pm

Request Review

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 59c48a07-1e7d-4f99-9644-908f02c73cb2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/broken-internal-urls

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. Bug Report Something isn't working labels Jul 3, 2026
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ jp-agenta
❌ Agenta Team


Agenta Team seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes runner→API session/telemetry routing regressions caused by ambiguous “API base URL” semantics by separating internal vs public URL configuration, centralizing runner base-url resolution, and making the API tolerant of /api-prefixed inbound paths across different proxy topologies.

Changes:

  • Introduces AGENTA_API_INTERNAL_URL / AGENTA_RUNNER_INTERNAL_URL and updates runner/services/compose/docs to use internal-hop URLs by default.
  • Refactors the runner to use a shared apiBase() helper for all runner→API calls and aligns OTLP exporter fallback selection.
  • Adds API middleware to strip a leading /api prefix so the API routes correctly whether a proxy strips the prefix or forwards it verbatim.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
services/runner/src/tracing/otel.ts Prefer internal API URL for OTLP exporter base; keep public/cloud fallback chain.
services/runner/src/sessions/persist.ts Switch to shared apiBase() helper for ingest.
services/runner/src/sessions/interactions.ts Switch to shared apiBase() helper for interactions ingest.
services/runner/src/sessions/alive.ts Switch to shared apiBase() helper for heartbeats.
services/runner/src/server.ts Use shared apiBase() for session state persistence; import helper.
services/runner/src/engines/sandbox_agent.ts Switch to shared apiBase() helper for runner→API calls.
services/runner/src/apiBase.ts New shared helper implementing internal→public→default base resolution + trailing-slash trimming.
services/runner/README.md Update runner URL env var name in documentation.
services/oss/tests/pytest/unit/agent/test_select_backend.py Update env var name used by backend-selection tests.
services/oss/src/agent/config.py Rename env var read by runner_url() to the new internal URL contract.
services/oss/src/agent/app.py Update docstrings/comments to the renamed runner URL env var.
hosting/docker-compose/oss/docker-compose.gh.yml Stop stripping /api at Traefik for GH; rename runner/API env vars to internal variants.
hosting/docker-compose/oss/docker-compose.gh.ssl.yml Same as above for GH SSL compose.
hosting/docker-compose/oss/docker-compose.gh.local.yml Same as above for GH local compose.
hosting/docker-compose/oss/docker-compose.dev.yml Rename runner/API env vars to internal variants in dev compose.
hosting/docker-compose/ee/docker-compose.gh.yml Stop stripping /api at Traefik for EE GH; rename runner/API env vars to internal variants.
hosting/docker-compose/ee/docker-compose.gh.local.yml Same as above for EE GH local compose.
hosting/docker-compose/ee/docker-compose.dev.yml Rename runner/API env vars to internal variants in EE dev compose.
docs/design/agent-workflows/documentation/running-the-agent.md Update docs to renamed runner URL env var.
docs/design/agent-workflows/documentation/architecture.md Update architecture diagram prose to renamed runner URL env var.
api/entrypoints/routers.py Add middleware to strip /api prefix on inbound requests for consistent routing across proxies.
Comments suppressed due to low confidence (1)

services/oss/tests/pytest/unit/agent/test_select_backend.py:55

  • This test uses http://sandbox-agent:8765 as the example runner URL, but the docker-compose defaults set AGENTA_RUNNER_INTERNAL_URL to http://runner:8765. Using the same hostname here reduces confusion when debugging failures across docs/tests/compose.
def test_runner_url_selects_http_transport(monkeypatch):
    monkeypatch.setenv("AGENTA_RUNNER_INTERNAL_URL", "http://sandbox-agent:8765")

    backend = select_backend(_sel("pi_core", "local"))

    assert backend._url == "http://sandbox-agent:8765"


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 46 to 49
def runner_url() -> Optional[str]:
"""HTTP URL for the deployed agent runner service, when configured."""
value = os.getenv("AGENTA_RUNNER_URL")
"""HTTP URL for the deployed agent runner (internal direct hop), when configured."""
value = os.getenv("AGENTA_RUNNER_INTERNAL_URL")
return value.strip() if value and value.strip() else None
Comment on lines 19 to 26
The Python service finds the runner through `AGENTA_RUNNER_URL`, which defaults to
The Python service finds the runner through `AGENTA_RUNNER_INTERNAL_URL`, which defaults to
`http://sandbox-agent:8765` in every compose stage (for example
`hosting/docker-compose/ee/docker-compose.dev.yml:421`).
Comment thread api/entrypoints/routers.py Outdated
Comment on lines +498 to +506
if scope["type"] in ("http", "websocket"):
path = scope.get("path", "")
if path == "/api" or path.startswith("/api/"):
scope = dict(scope)
scope["path"] = path[4:] or "/"
raw = scope.get("raw_path")
if isinstance(raw, (bytes, bytearray)) and raw[:4] == b"/api":
scope["raw_path"] = bytes(raw)[4:] or b"/"
await self.app(scope, receive, send)
Copilot AI review requested due to automatic review settings July 3, 2026 16:39
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jul 3, 2026
@junaway junaway mentioned this pull request Jul 3, 2026
12 tasks
@junaway junaway merged commit cd633b5 into big-agents Jul 3, 2026
21 of 23 checks passed

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 38 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

docs/design/agent-workflows/documentation/architecture.md:44

  • This diagram still calls the Compose service sandbox-agent and points at services/agent/..., but the runner sidecar has moved to services/runner/ and the Compose service is now runner. Updating these references avoids sending readers to non-existent paths/service names.
    | POST /run over HTTP (AGENTA_RUNNER_INTERNAL_URL set)
    | or spawn the runner CLI in a source checkout
    v
agent runner sidecar (Node)
    compose service: sandbox-agent
    HTTP server on :8765
    services/agent/src/server.ts

Comment on lines +23 to +26
if stripped:
scope = dict(scope)
scope["path"] = path
scope["raw_path"] = raw
Comment on lines +47 to +51
@pytest.mark.asyncio
async def test_bare_api_root_strips_to_slash():
scope = await _run(_scope("/api"))
assert scope["path"] == "/"

Comment on lines 58 to 60
drives a harness over ACP through the sandbox-agent daemon. In Docker Compose the sidecar is
named `sandbox-agent`, and the service reaches it through `AGENTA_RUNNER_URL`
named `sandbox-agent`, and the service reaches it through `AGENTA_RUNNER_INTERNAL_URL`
(`services/oss/src/agent/config.py:46`).
mmabrouk added a commit that referenced this pull request Jul 4, 2026
- handler.py: permission_default (the SDK deleted permission_policy in phase 3)
- fold(): the terminal result's stop_reason wins over the done event (the live runner
  emits done without stopReason, so a real pause would otherwise drop its envelope
  metadata) and pending_interaction carries a derived top-level tool name
- sandbox_agent.ts: fold upstream keepers lost to ours-wins conflict resolution
  (shared apiBase module, claude-model + resolved-model log lines)
- service pause test pinned to the realistic stream shape (done event with no stopReason)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Report Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants