docs(agent-workflows): build-kit tools cleanup design (tool homes, test_run, skills port)#5060
docs(agent-workflows): build-kit tools cleanup design (tool homes, test_run, skills port)#5060mmabrouk wants to merge 6 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Review requested on: (1) tool-home-options.md — is Option C (handler mode on the tool-call plane) the right home for test_run? (2) plan.md slice ordering vs the approval-boundary lane; (3) the three open decisions in status.md (overlay scope, test_run sync+delta, query_spans now). |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds documentation-only design and review artifacts across two folders: build-kit-tools-cleanup (README, api-design, plan, skills-port, status, tool-home-options) and builder-agent-reliability/tools-review (README, part-1-external-tools, part-2-internal-tools). No source code is modified. ChangesBuild-kit tools cleanup design workspace
Builder-agent reliability tools review
Estimated code review effort: 2 (Simple) | ~15 minutes Sequence Diagram(s)sequenceDiagram
participant Agent
participant PlatformOpHandler
participant WorkflowRevision
participant InvokeAPI
participant SpansAPI
Agent->>PlatformOpHandler: test_run(inputs, delta, expectations)
PlatformOpHandler->>WorkflowRevision: resolve committed revision
PlatformOpHandler->>WorkflowRevision: apply delta in memory
PlatformOpHandler->>InvokeAPI: signed internal /invoke call
InvokeAPI-->>PlatformOpHandler: streamed run output
PlatformOpHandler->>SpansAPI: query spans (fallback)
SpansAPI-->>PlatformOpHandler: run evidence
PlatformOpHandler-->>Agent: output, tools, approvals, resolved, verdict
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 6
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 1a7e650a-8607-4b89-bc88-0029f55f7a86
📒 Files selected for processing (9)
docs/design/agent-workflows/projects/build-kit-tools-cleanup/README.mddocs/design/agent-workflows/projects/build-kit-tools-cleanup/api-design.mddocs/design/agent-workflows/projects/build-kit-tools-cleanup/plan.mddocs/design/agent-workflows/projects/build-kit-tools-cleanup/skills-port.mddocs/design/agent-workflows/projects/build-kit-tools-cleanup/status.mddocs/design/agent-workflows/projects/build-kit-tools-cleanup/tool-home-options.mddocs/design/agent-workflows/projects/builder-agent-reliability/tools-review/README.mddocs/design/agent-workflows/projects/builder-agent-reliability/tools-review/part-1-external-tools.mddocs/design/agent-workflows/projects/builder-agent-reliability/tools-review/part-2-internal-tools.md
| catalog entry. No new endpoint on the resource API. | ||
| - (+) Credential-clean: `/tools/call` runs under the caller's project auth; the internal | ||
| invoke signs its own token server-side. | ||
| - (-) **Semantics muddle.** "Gateway" means an external provider action executed through |
There was a problem hiding this comment.
I don't think this is true. Gateway here just means it uses the Agenta gateway to run things. It does not mean it's an external provider per se. At least that's what I meant when I designed it the first time.
Gateway are tools. Even if you kind of load the configuration of the agent locally around it, you still need Agenta to run them because they use the Agenta API. It was always meant to have Agenta as a type of this. Maybe the naming gateway is bad, and please propose a new one, but that was actually the meaning or the intent behind it.
There was a problem hiding this comment.
You are right, and the doc now says so. Gateway = the call runs through the Agenta gateway; Agenta-implemented actions are one kind of it by design. I rewrote Option A on that definition (it is now A', the sanctioned gateway/agenta arm, not a semantics hack) and added the rename proposal you asked for: server recommended (it names where the call executes and mirrors the existing client value, so the vocabulary reads client / code / server), with hosted and platform_run considered and argued against. See tool-home-options.md, "What gateway means (corrected)" and "Rename proposal".
|
|
||
| ## Option B: runner-side composite ops | ||
|
|
||
| **What it is.** The runner learns multi-step ops: on `test_run`, the runner itself |
There was a problem hiding this comment.
I'm not much in favor of having business logic in the runner. The runner should teach us to run in the agens.
There was a problem hiding this comment.
Recorded as decided: no business logic in the runner; the runner just runs the agents. Option B is now marked REJECTED in tool-home-options.md with your verdict as the closing reason (it also matched the doc's recommendation), and status.md lists it under "Decided (do not relitigate)".
…ract into the test_run design
| view; the server-side path gets revision hydration, delta application, and internal | ||
| token signing for free (`_prepare_invoke`). | ||
|
|
||
| ## Option C: a first-class server-side executor for internal logic-bearing tools |
There was a problem hiding this comment.
OK, just to make sure I understand this correctly. Basically, you're saying we still implement the tool under tool calls, but what you want to do is, instead of having a desktop-type gateway, you want to have a desktop platform op. You want to change the shape of the schema of the platform op to add more or less where the endpoint lies, right?
Yeah, I think this is also an option. I would like to see kind of the changes that you would make to the interface and the logic in both option A and B so that they can compare. Ask also a codex for its opening. xhigh
There was a problem hiding this comment.
Yes, you understood it exactly: still a tool under the tool-call plane, but the platform-op schema grows a mode that says where the logic lives (a handler) instead of which REST endpoint to wrap. I added a new section "A' vs C, concretely" to tool-home-options.md with the side-by-side you asked for: (a) the config the author commits, (b) the field-level schema change against the real dataclasses (GatewayToolConfig surgery for A' vs a handler mode on PlatformOp for C), (c) the wire spec the runner sees (identical in both), (d) a per-file change table, and (e) what stays identical (the Python handler, the $ctx binding, the credential pattern). Recommendation stays C: same wire and same server either way; A' spends its budget relaxing GatewayToolConfig and still reads the platform catalog for metadata. Codex's independent xhigh opinion is being posted as a separate PR comment.
3a6dee9 to
b961f90
Compare
|
Heads up mid-review: JP's #5064 (merged to big-agents last night) changed the invoke contract, and I pushed a new commit (b961f90) to this PR folding it in. No amend; everything you already read is untouched except the three files below. What changed upstream: batch (non-streaming) invoke now returns the full turn. What that simplifies here: the test_run handler in api-design.md no longer streams and drains. It makes one non-streaming invoke and digests the response body directly. Spans are queried (with retry) only for two things the body cannot settle: returned-output verification of gated writes, and the resolved config. The old "output may be empty on a tool-call ending" defect is fixed; the surviving rule is that the last message may be a tool message. part-2-internal-tools.md drops stream parsing from the why-it-cannot-be-composed list; status.md gets a dated note. What did NOT change: the silent fallback on malformed parameters still needs #5002; resolved config still comes only from the trace; there is still no max-turns knob; invoke still lives outside /api. So the composition still exceeds a single-endpoint descriptor and the Option C conclusion stands. Batch also still blocks for the whole run, so the duration-cap reasoning holds. Two side notes: the external build-agent skill is being updated for #5064 in parallel, and the sibling streaming-invoke project docs (#5003, different lane, not touched here) are now largely superseded by #5064; flagging only. |
|
|
||
| ## The shape | ||
|
|
||
| - **`agenta-getting-started` stays** as the forced baseline (unchanged scope: platform |
There was a problem hiding this comment.
Didn't you say before that you want to have all miles skating overall and not multiple ones? Why do you want to keep Agenta getting started here?
There was a problem hiding this comment.
Fair catch: the goal is one skill and the first draft did not deliver that. Revised in skills-port.md, "The shape": the overlay now embeds ONLY the playbook, so the builder sees exactly one authoring skill. agenta-getting-started leaves the overlay and stays where it already lives, the harness-forced set: it is the platform-manners baseline for EVERY Agenta agent (deployed ones included), not build knowledge, and today the overlay was double-delivering it next to the harness force. Also flagged an alternative for you: its body is about twenty lines, thin enough to fold into the harness preamble and delete the skill entirely, which would leave the playbook as the only platform skill anywhere.
…ntics, A-vs-C comparison, single-skill overlay
|
|
||
| | Candidate | Reading | Verdict | | ||
| |---|---|---| | ||
| | `server` | the call executes server-side on Agenta | **Recommended.** It names the execution locus, and it mirrors the existing `client` (browser-fulfilled) value, so the executor vocabulary reads as a set of places: `client` / `code` (sandbox) / `server`. Boring and predictable, which is the goal. | |
There was a problem hiding this comment.
Review this: the corrected gateway definition and the rename proposal (recommended: server, mirroring the existing client; hosted and platform_run argued against).
| (invoke + wait + digest + verdict) are tool semantics, not resource semantics. | ||
| - (-) Grows the public API surface for something only the builder agent calls. | ||
|
|
||
| ## A' vs C, concretely |
There was a problem hiding this comment.
Review this: the concrete A'-vs-C comparison you asked for (author config, field-level schema change on the real dataclasses, identical wire, per-file table, what stays identical), ending in the C recommendation; Codex second opinion lands as a PR comment.
| slugs (evidence: [research.md](research.md) gotcha 1; verify live). So this port is also | ||
| an attach fix: the new playbook must actually ride the overlay. | ||
|
|
||
| ## The shape (revised, review round 1) |
There was a problem hiding this comment.
Review this: the single-skill decision. The overlay now embeds ONLY the playbook; getting-started leaves the overlay and stays harness-forced, with a fold-into-preamble alternative flagged.
|
Codex second opinion (xhigh, read the real code), as referenced in tool-home-options.md: Verdict: Option C. "A' is honest semantically but wrong structurally." Refinements adopted into the implementation:
Also settled: the gateway->server rename is deferred out of this work (persisted wire kinds are callback/code/client; renaming config literals is unrelated blast radius; docs/UI labels only, later). Option E (static workflow via the existing workflow.* call_ref) rejected: it would bury deterministic product logic inside an agent workflow or need a new builtin workflow handler, a worse hidden mechanism. Implementation is underway per Mahmoud's go-ahead (Codex xhigh authors, Fable reviews, GitButler lane, one draft PR). |
…5059 + decisions and implementation status Claude-Session: https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT
…haned-timeout, 30s runner abort, resolver flag)
Context
The playground builder agent carries 19 tools. In practice it wanders between them, picks the wrong one, and cannot test the app it just built. The lab's outside kit proved that a small tool set plus a verified run loop produces reliable builds. On top of that, three of the four authoring skills appear to never have been delivered to the agent by the overlay.
What the docs propose (current state, after review round 1)
Docs only; no code changes. This PR adds the design workspace for the build-kit cleanup, plus the two-part tools review that motivated it.
tool-home-options.md): Option C. The platform-op catalog grows ahandlermode; the tool executes server-side on the existing tool-call plane via a reservedcallRef, with the relay injecting run context. The resource API stays atomic. Option B (runner-side logic) is rejected: no business logic in the runner. The two survivors, A' (a gateway-config arm with anagentaprovider) and C, are the same runtime design; the new "A' vs C, concretely" section compares them field by field against the real dataclasses and recommends C. Codex's independent opinion lands as a PR comment.tool-home-options.md): "gateway" means the call runs through the Agenta gateway; Agenta-implemented actions belong on that plane by design. Rename candidates for the executor term:server(recommended; mirrorsclient),hosted,platform_run.test_runcontract and thequery_spansstopgap (api-design.md): a headless self-test with verdicts ported from the lab'scheck-tools.sh;query_spansships first as a pure data add.find_capabilities->discover_tools,find_triggers->discover_triggers; no aliases.skills-port.md): one newbuild-an-agentskill replaces the three authoring skills, and the overlay embeds ONLY it.agenta-getting-startedleaves the overlay and stays harness-forced (universal baseline, not build knowledge). A thin-enough-to-fold-into-the-preamble alternative is flagged.plan.md), sequenced behind the approval-boundary lane onop_catalog.py.Open decisions live in
status.md: overlay scope, thetest_runshape (sync plus delta recommended), shippingquery_spansnow, the gateway rename, and the A'-vs-C pick.Scope / risk
Docs only; nothing here changes runtime behavior. Two workspace files (
context.md,research.md) are absent from this diff: a concurrent lane's sweep commit absorbed them (hand-off filed on the coordination board; readable in #5041's diff meanwhile).Review history
https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT