docs(agent-workflows): build-kit tools cleanup design (tool homes, test_run, skills port) by mmabrouk · Pull Request #5060 · Agenta-AI/agenta

mmabrouk · 2026-07-03T15:47:56Z

Context

The playground builder agent carries 19 tools. In practice it wanders between them, picks the wrong one, and cannot test the app it just built. The lab's outside kit proved that a small tool set plus a verified run loop produces reliable builds. On top of that, three of the four authoring skills appear to never have been delivered to the agent by the overlay.

What the docs propose (current state, after review round 1)

Docs only; no code changes. This PR adds the design workspace for the build-kit cleanup, plus the two-part tools review that motivated it.

Where logic-bearing tools live (tool-home-options.md): Option C. The platform-op catalog grows a handler mode; the tool executes server-side on the existing tool-call plane via a reserved callRef, with the relay injecting run context. The resource API stays atomic. Option B (runner-side logic) is rejected: no business logic in the runner. The two survivors, A' (a gateway-config arm with an agenta provider) and C, are the same runtime design; the new "A' vs C, concretely" section compares them field by field against the real dataclasses and recommends C. Codex's independent opinion lands as a PR comment.
Gateway semantics, plus a rename proposal (tool-home-options.md): "gateway" means the call runs through the Agenta gateway; Agenta-implemented actions belong on that plane by design. Rename candidates for the executor term: server (recommended; mirrors client), hosted, platform_run.
test_run contract and the query_spans stopgap (api-design.md): a headless self-test with verdicts ported from the lab's check-tools.sh; query_spans ships first as a pure data add.
Hard-migrate renames: find_capabilities -> discover_tools, find_triggers -> discover_triggers; no aliases.
Overlay cuts: 19 -> 13 default tools (8 core + 5 event ops); every cut op stays in the catalog for opt-in.
One-playbook skills port (skills-port.md): one new build-an-agent skill replaces the three authoring skills, and the overlay embeds ONLY it. agenta-getting-started leaves the overlay and stays harness-forced (universal baseline, not build knowledge). A thin-enough-to-fold-into-the-preamble alternative is flagged.
Six-slice plan (plan.md), sequenced behind the approval-boundary lane on op_catalog.py.

Open decisions live in status.md: overlay scope, the test_run shape (sync plus delta recommended), shipping query_spans now, the gateway rename, and the A'-vs-C pick.

Scope / risk

Docs only; nothing here changes runtime behavior. Two workspace files (context.md, research.md) are absent from this diff: a concurrent lane's sweep commit absorbed them (hand-off filed on the coordination board; readable in #5041's diff meanwhile).

Review history

Round 1 (Mahmoud, 2026-07-04), addressed in commit f54b9e0: corrected the meaning of "gateway" and added the rename proposal, rejected Option B, added the concrete A'-vs-C comparison, made the playbook the overlay's only skill.
CodeRabbit round 1: six comments; five fixed (tracing citation, portable lab path, overlay-scope claim, alignment-matrix markdown, TL;DR counts), one partly declined (aliases stay rejected by decision; an explicit migration-sweep step was added instead).

https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT

vercel · 2026-07-03T15:48:02Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	Jul 4, 2026 9:25pm

mmabrouk · 2026-07-03T15:48:02Z

Review requested on: (1) tool-home-options.md — is Option C (handler mode on the tool-call plane) the right home for test_run? (2) plan.md slice ordering vs the approval-boundary lane; (3) the three open decisions in status.md (overlay scope, test_run sync+delta, query_spans now).

coderabbitai · 2026-07-03T15:48:04Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a5ac7c32-fa07-4474-9162-5b253da224ea

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds documentation-only design and review artifacts across two folders: build-kit-tools-cleanup (README, api-design, plan, skills-port, status, tool-home-options) and builder-agent-reliability/tools-review (README, part-1-external-tools, part-2-internal-tools). No source code is modified.

Changes

Build-kit tools cleanup design workspace

Layer / File(s)	Summary
Workspace overview `.../build-kit-tools-cleanup/README.md`	TL;DR of major decisions, reading order, and settled inputs for the redesign.
test_run / query_spans API contract `.../build-kit-tools-cleanup/api-design.md`	Defines request/response shapes, Option C server handler behavior, guards, shape decision, and query_spans read-only op.
Tool execution placement options `.../build-kit-tools-cleanup/tool-home-options.md`	Compares Options A/B/C/D for internal tool execution and recommends Option C.
Migration plan (PR slices) `.../build-kit-tools-cleanup/plan.md`	Ordered slices 0-6 for renames, overlay cuts, query_spans, skills port, test_run, verification, and risks.
Skills consolidation `.../build-kit-tools-cleanup/skills-port.md`	Proposes a single ordered playbook skill replacing three existing skills, with mandatory rules and delivery steps.
Workspace status `.../build-kit-tools-cleanup/status.md`	Tracks decided/open items, blockers, verification checklist, and document links.

Builder-agent reliability tools review

Layer / File(s)	Summary
Reliability review overview `.../tools-review/README.md`	Compares inside platform ops vs outside scripts, gaps, and agreed rename decision.
External tools review `.../tools-review/part-1-external-tools.md`	Alignment matrix, lab evidence, verdicts, and recommendations for outside scripts.
Internal tools review `.../tools-review/part-2-internal-tools.md`	Platform op execution model, verdicts, test_run gap/proposal, and recommended overlay tool set.

Estimated code review effort: 2 (Simple) | ~15 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Agent
  participant PlatformOpHandler
  participant WorkflowRevision
  participant InvokeAPI
  participant SpansAPI
  Agent->>PlatformOpHandler: test_run(inputs, delta, expectations)
  PlatformOpHandler->>WorkflowRevision: resolve committed revision
  PlatformOpHandler->>WorkflowRevision: apply delta in memory
  PlatformOpHandler->>InvokeAPI: signed internal /invoke call
  InvokeAPI-->>PlatformOpHandler: streamed run output
  PlatformOpHandler->>SpansAPI: query spans (fallback)
  SpansAPI-->>PlatformOpHandler: run evidence
  PlatformOpHandler-->>Agent: output, tools, approvals, resolved, verdict

Possibly related PRs

Agenta-AI/agenta#4906: Both PRs relate to the platform tool execution contract around op_catalog.py and context-binding fields referenced in the build-kit cleanup design.
Agenta-AI/agenta#4930: The skills-port proposal replaces skills (build-your-first-app, discover-and-wire-tools, set-up-triggers) originally introduced in this PR.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly summarizes the docs-only build-kit cleanup design and its main themes.
Description check	✅ Passed	The description is detailed and directly matches the docs-only design workspace changes.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/build-kit-tools-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

mmabrouk · 2026-07-03T15:48:09Z

@coderabbitai review

coderabbitai · 2026-07-03T15:48:20Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 6

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1a7e650a-8607-4b89-bc88-0029f55f7a86

📥 Commits

Reviewing files that changed from the base of the PR and between c2c7ade and 3a6dee9.

📒 Files selected for processing (9)

docs/design/agent-workflows/projects/build-kit-tools-cleanup/README.md
docs/design/agent-workflows/projects/build-kit-tools-cleanup/api-design.md
docs/design/agent-workflows/projects/build-kit-tools-cleanup/plan.md
docs/design/agent-workflows/projects/build-kit-tools-cleanup/skills-port.md
docs/design/agent-workflows/projects/build-kit-tools-cleanup/status.md
docs/design/agent-workflows/projects/build-kit-tools-cleanup/tool-home-options.md
docs/design/agent-workflows/projects/builder-agent-reliability/tools-review/README.md
docs/design/agent-workflows/projects/builder-agent-reliability/tools-review/part-1-external-tools.md
docs/design/agent-workflows/projects/builder-agent-reliability/tools-review/part-2-internal-tools.md

…ls review

mmabrouk · 2026-07-04T17:19:51Z

+  catalog entry. No new endpoint on the resource API.
+- (+) Credential-clean: `/tools/call` runs under the caller's project auth; the internal
+  invoke signs its own token server-side.
+- (-) **Semantics muddle.** "Gateway" means an external provider action executed through


I don't think this is true. Gateway here just means it uses the Agenta gateway to run things. It does not mean it's an external provider per se. At least that's what I meant when I designed it the first time.

Gateway are tools. Even if you kind of load the configuration of the agent locally around it, you still need Agenta to run them because they use the Agenta API. It was always meant to have Agenta as a type of this. Maybe the naming gateway is bad, and please propose a new one, but that was actually the meaning or the intent behind it.

You are right, and the doc now says so. Gateway = the call runs through the Agenta gateway; Agenta-implemented actions are one kind of it by design. I rewrote Option A on that definition (it is now A', the sanctioned gateway/agenta arm, not a semantics hack) and added the rename proposal you asked for: server recommended (it names where the call executes and mirrors the existing client value, so the vocabulary reads client / code / server), with hosted and platform_run considered and argued against. See tool-home-options.md, "What gateway means (corrected)" and "Rename proposal".

mmabrouk · 2026-07-04T17:20:36Z

+
+## Option B: runner-side composite ops
+
+**What it is.** The runner learns multi-step ops: on `test_run`, the runner itself


I'm not much in favor of having business logic in the runner. The runner should teach us to run in the agens.

Recorded as decided: no business logic in the runner; the runner just runs the agents. Option B is now marked REJECTED in tool-home-options.md with your verdict as the closing reason (it also matched the doc's recommendation), and status.md lists it under "Decided (do not relitigate)".

…ract into the test_run design

mmabrouk · 2026-07-04T17:25:51Z

+  view; the server-side path gets revision hydration, delta application, and internal
+  token signing for free (`_prepare_invoke`).
+
+## Option C: a first-class server-side executor for internal logic-bearing tools


OK, just to make sure I understand this correctly. Basically, you're saying we still implement the tool under tool calls, but what you want to do is, instead of having a desktop-type gateway, you want to have a desktop platform op. You want to change the shape of the schema of the platform op to add more or less where the endpoint lies, right?

Yeah, I think this is also an option. I would like to see kind of the changes that you would make to the interface and the logic in both option A and B so that they can compare. Ask also a codex for its opening. xhigh

Yes, you understood it exactly: still a tool under the tool-call plane, but the platform-op schema grows a mode that says where the logic lives (a handler) instead of which REST endpoint to wrap. I added a new section "A' vs C, concretely" to tool-home-options.md with the side-by-side you asked for: (a) the config the author commits, (b) the field-level schema change against the real dataclasses (GatewayToolConfig surgery for A' vs a handler mode on PlatformOp for C), (c) the wire spec the runner sees (identical in both), (d) a per-file change table, and (e) what stays identical (the Python handler, the $ctx binding, the credential pattern). Recommendation stays C: same wire and same server either way; A' spends its budget relaxing GatewayToolConfig and still reads the platform catalog for metadata. Codex's independent xhigh opinion is being posted as a separate PR comment.

mmabrouk · 2026-07-04T17:26:59Z

Heads up mid-review: JP's #5064 (merged to big-agents last night) changed the invoke contract, and I pushed a new commit (b961f90) to this PR folding it in. No amend; everything you already read is untouched except the three files below.

What changed upstream: batch (non-streaming) invoke now returns the full turn. data.outputs.messages carries the assistant text plus a role: "tool" entry for every tool call and result, in order, alongside stop_reason and pending_interaction when the run paused on a gate. Canonical spec: docs/designs/invoke-negotiations/specs.md.

What that simplifies here: the test_run handler in api-design.md no longer streams and drains. It makes one non-streaming invoke and digests the response body directly. Spans are queried (with retry) only for two things the body cannot settle: returned-output verification of gated writes, and the resolved config. The old "output may be empty on a tool-call ending" defect is fixed; the surviving rule is that the last message may be a tool message. part-2-internal-tools.md drops stream parsing from the why-it-cannot-be-composed list; status.md gets a dated note.

What did NOT change: the silent fallback on malformed parameters still needs #5002; resolved config still comes only from the trace; there is still no max-turns knob; invoke still lives outside /api. So the composition still exceeds a single-endpoint descriptor and the Option C conclusion stands. Batch also still blocks for the whole run, so the duration-cap reasoning holds.

Two side notes: the external build-agent skill is being updated for #5064 in parallel, and the sibling streaming-invoke project docs (#5003, different lane, not touched here) are now largely superseded by #5064; flagging only.

mmabrouk · 2026-07-04T17:27:15Z

+
+## The shape
+
+- **`agenta-getting-started` stays** as the forced baseline (unchanged scope: platform


Didn't you say before that you want to have all miles skating overall and not multiple ones? Why do you want to keep Agenta getting started here?

Fair catch: the goal is one skill and the first draft did not deliver that. Revised in skills-port.md, "The shape": the overlay now embeds ONLY the playbook, so the builder sees exactly one authoring skill. agenta-getting-started leaves the overlay and stays where it already lives, the harness-forced set: it is the platform-manners baseline for EVERY Agenta agent (deployed ones included), not build knowledge, and today the overlay was double-delivering it next to the harness force. Also flagged an alternative for you: its body is about twenty lines, thin enough to fold into the harness preamble and delete the skill entirely, which would leave the playbook as the only platform skill anywhere.

…ntics, A-vs-C comparison, single-skill overlay

mmabrouk

Round-1 revision landed as one new commit (f54b9e0, no amend). The three inline markers below point at the hunks that need your eyes.

mmabrouk · 2026-07-04T17:44:11Z

+
+| Candidate | Reading | Verdict |
+|---|---|---|
+| `server` | the call executes server-side on Agenta | **Recommended.** It names the execution locus, and it mirrors the existing `client` (browser-fulfilled) value, so the executor vocabulary reads as a set of places: `client` / `code` (sandbox) / `server`. Boring and predictable, which is the goal. |


Review this: the corrected gateway definition and the rename proposal (recommended: server, mirroring the existing client; hosted and platform_run argued against).

mmabrouk · 2026-07-04T17:44:11Z

+  (invoke + wait + digest + verdict) are tool semantics, not resource semantics.
+- (-) Grows the public API surface for something only the builder agent calls.
+
+## A' vs C, concretely


Review this: the concrete A'-vs-C comparison you asked for (author config, field-level schema change on the real dataclasses, identical wire, per-file table, what stays identical), ending in the C recommendation; Codex second opinion lands as a PR comment.

mmabrouk · 2026-07-04T17:44:11Z

+slugs (evidence: [research.md](research.md) gotcha 1; verify live). So this port is also
+an attach fix: the new playbook must actually ride the overlay.
+
+## The shape (revised, review round 1)


Review this: the single-skill decision. The overlay now embeds ONLY the playbook; getting-started leaves the overlay and stays harness-forced, with a fold-into-preamble alternative flagged.

mmabrouk · 2026-07-04T17:52:44Z

Codex second opinion (xhigh, read the real code), as referenced in tool-home-options.md:

Verdict: Option C. "A' is honest semantically but wrong structurally." GatewayToolConfig is provider/connection-shaped (required integration + connection, five-segment reference grammar); bolting a polymorphic internal-action arm onto it forces gateway model surgery and still needs the platform catalog for metadata. PlatformOp already IS the Agenta-owned internal op registry. C has the smaller blast radius: catalog, platform resolver, mirrored wire types/tests, relay callback branch, Python tools handler.

Refinements adopted into the implementation:

Spec binding field named contextBindings, not context (collides with trace propagation on AgentRunRequest).
Hidden target path becomes target.workflow_variant_id (routing/context role; test_run does not commit).
Explicit timeoutMs plumbing (real wire change; 30s callback / 60s relay ceilings exist today).
Approval untouched: emit readOnly: false only; the permission ladder consumes readOnlyHint as-is.
XOR mode is fine with guards: method+path XOR handler, handler allowlist, separate to_call() / to_call_ref() helpers; nested discriminated target only if handler-only fields multiply.

Also settled: the gateway->server rename is deferred out of this work (persisted wire kinds are callback/code/client; renaming config literals is unrelated blast radius; docs/UI labels only, later). Option E (static workflow via the existing workflow.* call_ref) rejected: it would bury deterministic product logic inside an agent workflow or need a new builtin workflow handler, a worse hidden mechanism.

Implementation is underway per Mahmoud's go-ahead (Codex xhigh authors, Fable reviews, GitButler lane, one draft PR).

…5059 + decisions and implementation status Claude-Session: https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT

…haned-timeout, 30s runner abort, resolver flag)

…e map

vercel Bot deployed to Preview July 3, 2026 15:48 View deployment

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

docs(agent-workflows): build-kit tools cleanup design workspace + too…

9030eb2

…ls review

mmabrouk commented Jul 4, 2026

View reviewed changes

docs(build-kit-tools-cleanup): fold the #5064 invoke negotiation cont…

b961f90

…ract into the test_run design

mmabrouk commented Jul 4, 2026

View reviewed changes

mmabrouk force-pushed the docs/build-kit-tools-cleanup branch from 3a6dee9 to b961f90 Compare July 4, 2026 17:26

mmabrouk commented Jul 4, 2026

View reviewed changes

vercel Bot deployed to Preview July 4, 2026 17:27 View deployment

docs(build-kit-tools-cleanup): review round 1: corrected gateway sema…

f54b9e0

…ntics, A-vs-C comparison, single-skill overlay

vercel Bot deployed to Preview July 4, 2026 17:43 View deployment

mmabrouk commented Jul 4, 2026

View reviewed changes

docs(build-kit-tools-cleanup): phase-0 reconciliation vs #5041/#5064/#…

ecd1557

…5059 + decisions and implementation status Claude-Session: https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT

vercel Bot deployed to Preview July 4, 2026 18:49 View deployment

docs(build-kit-tools-cleanup): 5a->5b contract (recursion header, orp…

ceb5495

…haned-timeout, 30s runner abort, resolver flag)

vercel Bot deployed to Preview July 4, 2026 20:29 View deployment

docs(build-kit-tools-cleanup): landed (server half) status + per-slic…

d68193f

…e map

mmabrouk mentioned this pull request Jul 4, 2026

feat(sdk,api): build-kit tools cleanup: discover renames, 12-op overlay, query_spans, one playbook skill, test_run server half #5068

Draft

vercel Bot deployed to Preview July 4, 2026 21:25 View deployment


		## Option B: runner-side composite ops

		What it is. The runner learns multi-step ops: on `test_run`, the runner itself


		## The shape

		- `agenta-getting-started` stays as the forced baseline (unchanged scope: platform

Uh oh!

Conversation

mmabrouk commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What the docs propose (current state, after review round 1)

Scope / risk

Review history

Uh oh!

vercel Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmabrouk commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

mmabrouk commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmabrouk commented Jul 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmabrouk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmabrouk commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mmabrouk commented Jul 3, 2026 •

edited

Loading

vercel Bot commented Jul 3, 2026 •

edited

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading