Skip to content

feat(sdk,api): build-kit tools cleanup: discover renames, 12-op overlay, query_spans, one playbook skill, test_run server half#5068

Draft
mmabrouk wants to merge 5 commits into
docs/build-kit-tools-cleanupfrom
feat/build-kit-tools-cleanup
Draft

feat(sdk,api): build-kit tools cleanup: discover renames, 12-op overlay, query_spans, one playbook skill, test_run server half#5068
mmabrouk wants to merge 5 commits into
docs/build-kit-tools-cleanupfrom
feat/build-kit-tools-cleanup

Conversation

@mmabrouk

@mmabrouk mmabrouk commented Jul 4, 2026

Copy link
Copy Markdown
Member

PR title

feat(sdk,api): build-kit tools cleanup + test_run server half

PR body

Context

The playground build kit is the tool set a fresh agent uses to build itself. It carried 19 tools plus three cross-referencing authoring skills, and in end-to-end tests the builder agent wandered: it guessed op names, drifted between overlapping skills, and had no way to test the agent it had just built. The agent-creation-lab kit proved the fix outside Agenta: a small ordered tool set, one playbook, and a self-verifying test step. The tools-review compared the 19 inside tools against the lab's 13 outside scripts, with evidence per verdict, and design PR #5060 turned that into this plan. This PR executes it: slices 1-5a, one commit per slice, on docs/design/agent-workflows/projects/build-kit-tools-cleanup/plan.md.

Changes

Slice 1: rename the discovery ops, hard. find_capabilities is now discover_tools and find_triggers is now discover_triggers. No aliases (decided; pre-production). The legacy reserved /tools/call route for tools.agenta.find_capabilities is deleted. A sweep script rewrites committed revisions that still carry the old keys (projects/build-kit-tools-cleanup/scripts/sweep_platform_op_renames.py).

Slice 2: an explicit default overlay. The overlay previously embedded every catalog op. It now embeds an explicit DEFAULT_BUILD_KIT_OPS list of 12 ops plus the request_connection client tool. Cut from the default (all stay in the catalog for opt-in): pause_schedule, resume_schedule, pause_subscription, resume_subscription, query_workflows, list_connections, list_subscriptions.

Slice 3: query_spans. A new read-only platform op over POST /api/spans/query, so the builder can read its own past runs and verify a tool actually fired. The op schema mirrors SpansQueryRequest; a drift contract test pins the two together.

Slice 4: one playbook skill. A single ordered build-an-agent skill (slug __ag__build_an_agent) replaces build-your-first-app, discover-and-wire-tools, and set-up-triggers. The overlay embeds only the playbook. getting-started stays harness-forced on pi_agenta, so it is delivered once instead of twice.

Slice 5a: test_run, server half only. The op that lets the builder test its own variant:

  • PlatformOp gains a handler mode: an op targets exactly one of method+path (endpoint) or handler (an allowlisted reserved tools.agenta.<op> call-ref).
  • The resolver emits handler ops as callRef + spec-level contextBindings + timeoutMs. This is flag-gated OFF (AGENTA_AGENT_ENABLE_PLATFORM_HANDLERS) until the runner half lands.
  • /tools/call gains a reserved-ref registry: a registered tools.agenta.* ref dispatches to its Python handler; an unknown one fails loud with a 404.
  • The test_run handler hydrates the bound variant's committed revision, applies an optional in-memory delta (requires EDIT_WORKFLOWS), invokes it headless with a server-minted token, and returns a digest with a terminal-result-wins verdict, under a 120s ceiling. A recursion marker is emitted but inert until the runner half lands.
  • test_run is NOT in the overlay yet; it joins when the flag flips.

Plus one SDK resolver fix found live: an inline revision committed without parameters now seeds the default agent template, the same way a reference resolution does. Before the fix, test_run against a fresh playground app failed because the child invoke saw an empty config.

Assumed decisions (review these first)

Three calls proceeded on their recommendations as provisional defaults. They were flagged, not decided:

  1. Static overlay, including query_spans. The default is a fixed 12-op list, not a conditional event pack. Argued in projects/build-kit-tools-cleanup/research.md (overlay-scope).
  2. test_run is sync with an in-memory delta. 120s cap, per-op timeout_ms plumbing; test_id stays reserved so an async pair remains additive. Argued in api-design.md (shape-decision).
  3. query_spans ships now, not held for test_run. It stays useful after test_run lands (reading scheduled fires). Argued in api-design.md (query_spans stopgap).

Scope / risk

  • The runner half of test_run (slice 5b) is deferred. The flag (AGENTA_AGENT_ENABLE_PLATFORM_HANDLERS, default off) protects exactly this: with it off, a handler-mode op fails resolution loudly instead of reaching a runner that cannot dispatch a reserved callRef, inject spec-level context, or honor timeoutMs. protocol.ts and the golden wire fixtures do not carry the new spec fields yet. Open-issues entry filed with the full 5b scope.
  • The gateway -> server rename (slice 6) is deferred. Codex review scoped the safe version to docs/UI labels; the stored type: "gateway" literal is a data migration. Open-issues entry filed.
  • The rename is a hard migration. Old committed revisions that reference find_capabilities / find_triggers fail loud (UnknownPlatformOpError) at resolve time until the sweep script runs against that database. Dev DBs are swept; a production deploy must run the same sweep. Acceptable pre-production, but it is a deploy step, not a no-op.
  • FE fixtures and generated clients are deferred. agentRequest.test.ts op literals and the CapabilitiesResult docstrings in the generated clients update at the next codegen.
  • Approval semantics are untouched: test_run sets read_only=False and nothing else, so the merged permission plan resolves it to ask by default.

How to QA

Prerequisites: local dev stack (load-env hosting/docker-compose/ee/.env.ee.dev + bash ./hosting/docker-compose/run.sh --ee --dev --build). For step 4, set AGENTA_AGENT_ENABLE_PLATFORM_HANDLERS=true on the API and grab a project API key.

Steps:

  1. Create a fresh agent application in the playground. Open its config: the build-kit block must show the 12 default ops (discover_tools, commit_revision, annotate_trace, query_spans, discover_triggers, create_schedule, create_subscription, list_schedules, list_deliveries, test_subscription, remove_schedule, remove_subscription), the request_connection client tool, and exactly one skill embed, the build-an-agent playbook. No pause/resume ops, no query_workflows.

  2. Chat with the agent ("find me a tool that creates GitHub issues"). Watch it call discover_tools and come back with Agenta-shaped capabilities and connection state.

  3. Confirm the old names are gone: a config carrying {type:"platform", op:"find_capabilities"} fails resolution with UnknownPlatformOpError.

  4. Curl test_run directly (server half):

    curl -s -X POST "$API/tools/call" -H "Authorization: ApiKey $KEY" -H "Content-Type: application/json" -d '{
      "data": {"id": "call_qa1", "type": "function", "function": {
        "name": "tools__agenta__test_run",
        "arguments": {"target": {"workflow_variant_id": "<variant-id>"},
                      "inputs": {"messages": [{"role": "user", "content": "hi"}]}}}}}'
    
    • Happy path: 200; call.data.content is a JSON digest with a verdict.
    • Negative 404: same call with "name": "tools__agenta__nope" (unknown reserved ref).
    • Negative 400: arguments that fail the schema (e.g. drop inputs). A delta from a caller without EDIT_WORKFLOWS returns 403.
  5. Run the sweep script against the dev DB (dry run is the default; it reads POSTGRES_URI_CORE, falls back to DATABASE_URL) and confirm it reports zero remaining old-key tools or old-slug skill embeds: uv run --script docs/design/agent-workflows/projects/build-kit-tools-cleanup/scripts/sweep_platform_op_renames.py. Add --apply to commit rewrites.

Expected result: steps 1-2 show the leaner kit working end to end; steps 3-5 show the hard migration failing loud and the sweep closing it.

Automated tests:

cd api && uv run --no-sync python -m pytest oss/tests/pytest/unit/tools oss/tests/pytest/unit/applications oss/tests/pytest/unit/workflows oss/tests/pytest/unit/tracing
cd sdks/python && uv run --no-sync python -m pytest oss/tests/pytest/unit

Expected: 250 passed (api) and 1596 passed, 10 xfailed (sdk).

Edge cases: a revision committed before this PR with the old op keys (fails loud until swept, by design); a handler-mode op with the flag off (must fail resolution with a clear message, not resolve to a broken spec); the playbook embed on a fresh app (must resolve by artifact slug; a bare revision slug 500s).

Live verification

Phase-3 verification ran against the local dev stack: a fresh playground agent showed the 12-op block plus the playbook and resolved getting-started once (forced) and the playbook once (overlay); discover_tools / discover_triggers ran end to end in chat; query_spans returned model-digestible spans for a real trace; test_run returned a digest and verdict with the flag on, and the 404/400/403 negative paths behaved; the sweep dry-run came back clean after the dev sweep.

Two findings from that pass:

  • Fresh-app bug, found and fixed here. A fresh playground app commits an inline revision with no parameters, which broke test_run's child invoke. The SDK resolver now seeds the default template for parameters-less inline revisions, matching the reference path.
  • Known gap, not this PR. Child invokes through the local in-process -> sub-sidecar path return trace_id=None (the sidecar's persist/heartbeat 401s), leaving test_run's span digest empty on that path. Tracked as a tracing follow-up in the project workspace, not a build-kit bug.

https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT

mmabrouk added 5 commits July 4, 2026 23:23
… drop the legacy reserved call_ref route; revision sweep script

Claude-Session: https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT
…kill (slices 3-4)

- query_spans: read-only platform op over POST /api/spans/query, schema mirrors SpansQueryRequest, drift contract test pins schema<->endpoint agreement
- one ordered playbook skill replaces the three authoring skills; overlay embeds only the playbook (getting-started stays harness-forced); concrete query_spans verify recipe; approval stops on trigger creation
- sweep script extended to rewrite committed old-slug skill embeds

Claude-Session: https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT
- PlatformOp handler mode (method+path XOR handler), to_call/to_call_ref, allowlist
- resolver emits callRef + contextBindings + timeoutMs specs, flag-gated OFF (AGENTA_AGENT_ENABLE_PLATFORM_HANDLERS) until the runner half lands
- /tools/call reserved-ref registry: registered refs dispatch, unknown fail loud 404
- tools-domain handler: hydrate revision (delta requires EDIT_WORKFLOWS, fetch-and-fail target validation), headless invoke via signed secret, spans-confirmed digest, terminal-result-wins verdict, recursion marker, 120s ceiling
- test_run NOT in the overlay yet (5b flips it with the runner dispatch)

Claude-Session: https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT
@vercel

vercel Bot commented Jul 4, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Jul 4, 2026 9:25pm

Request Review

@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added a new build-kit starter experience centered on “build an agent,” with updated default tools and skill content.
    • Introduced support for a new test-run platform operation with richer execution details and tracing-aware results.
    • Expanded platform tool resolution to support both direct API calls and server-handled operations.
  • Bug Fixes

    • Renamed discovery and trigger terminology throughout the product for consistency.
    • Improved handling of tool callbacks, permissions, and error responses for reserved operations.

Walkthrough

This PR migrates platform tool discovery from find_capabilities/find_triggers to discover_tools/discover_triggers, introduces a handler-mode test_run platform tool with a server-side dispatch handler and tracing integration, replaces three legacy authoring skills with a single build-an-agent skill in the build-kit overlay and static catalog, updates related SDK contracts/tests, design docs, and adds a database migration sweep script.

Changes

Platform tool and handler migration

Layer / File(s) Summary
SDK PlatformOp contract
sdks/python/agenta/sdk/agents/platform/op_catalog.py, sdks/python/agenta/sdk/agents/tools/models.py, sdks/python/agenta/sdk/agents/wire_models.py
PlatformOp supports endpoint-mode and handler-mode ops with validation, timeout_ms, and to_call_ref(); discover_tools/discover_triggers replace find_capabilities/find_triggers; query_spans and test_run ops added; CallbackToolSpec/WireResolvedToolSpec gain context_bindings/timeout_ms.
SDK resolver and build-an-agent skill
sdks/python/agenta/sdk/agents/platform/platform_tools.py, sdks/python/agenta/sdk/agents/adapters/agenta_builtins.py
Resolver builds handler-mode callbacks gated by AGENTA_AGENT_ENABLE_PLATFORM_HANDLERS; legacy authoring skills replaced with BUILD_AN_AGENT_SLUG/SKILL.
Server-side test_run handler
api/oss/src/core/tools/platform_handlers.py
New handle_test_run/dispatch_platform_tool_handler invoke a child workflow, digest responses (tools, approvals, spans), and compute a verdict.
Router reserved-call dispatch and renames
api/oss/src/apis/fastapi/tools/router.py, api/entrypoints/routers.py, api/oss/src/core/tools/discovery.py, dtos.py, service.py, api/oss/src/core/triggers/dtos.py, api/oss/src/apis/fastapi/tools/models.py
Router dispatches reserved calls including permission-gated test_run delta; wires tracing_service; comments renamed to discover_tools/discover_triggers.
Build-kit overlay and static catalog
api/oss/src/apis/fastapi/applications/overlay.py, api/oss/src/core/workflows/static_catalog.py
Overlay uses DEFAULT_BUILD_KIT_OPS and embeds BUILD_AN_AGENT_SLUG; static catalog replaces legacy skill entries.
Tests
api/oss/tests/pytest/unit/tools/*, sdks/python/oss/tests/pytest/unit/agents/*, api/oss/tests/pytest/unit/applications/test_build_kit_overlay.py, api/oss/tests/pytest/unit/workflows/test_static_catalog.py, api/oss/tests/manual/tools/tools.http
Extensive test coverage for handler/router dispatch, op catalog contracts, overlay contents, and static catalog behavior.

Estimated code review effort: 4 (Complex) | ~75 minutes

Design documentation and migration script

Layer / File(s) Summary
Design docs
docs/design/agent-workflows/documentation/*, docs/design/agent-workflows/interfaces/**
Docs updated to describe discover_tools/discover_triggers, handler-mode ops, build-kit defaults, and author-configurable skills.
Sweep script
docs/design/agent-workflows/projects/build-kit-tools-cleanup/scripts/sweep_platform_op_renames.py
New CLI script to rewrite stored workflow_revisions JSON, renaming ops and consolidating skill embeds, with dry-run/apply and verification.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant ToolsRouter
  participant PlatformHandlers
  participant WorkflowsService
  participant TracingService
  participant ChildWorkflow

  ToolsRouter->>PlatformHandlers: call_tool(tools.agenta.test_run)
  PlatformHandlers->>PlatformHandlers: check delta => require EDIT_WORKFLOWS
  PlatformHandlers->>WorkflowsService: ensure/prepare revision, resolve delta
  PlatformHandlers->>ChildWorkflow: POST /invoke (via httpx)
  ChildWorkflow-->>PlatformHandlers: workflow batch response
  PlatformHandlers->>TracingService: query spans by trace_id
  TracingService-->>PlatformHandlers: span observations
  PlatformHandlers-->>ToolsRouter: TestRunResponse (verdict, output)
Loading

Possibly related PRs

  • Agenta-AI/agenta#4860: Both modify ToolsRouter wiring/routing in api/entrypoints/routers.py and api/oss/src/apis/fastapi/tools/router.py.
  • Agenta-AI/agenta#4884: Directly refactors the reserved tools.agenta.find_capabilities dispatch path this PR replaces.
  • Agenta-AI/agenta#4929: Introduces the playground build-kit agent-template overlay this PR modifies in overlay.py.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main build-kit cleanup, including discovery renames, default overlay changes, query_spans, the playbook skill, and test_run server work.
Description check ✅ Passed The description is clearly related to the changeset and describes the same build-kit cleanup slices and implementation details.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/build-kit-tools-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@mmabrouk mmabrouk left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review guide: inline pointers to the key hunks follow.


has_direct_target = self.method is not None or self.path is not None
has_handler_target = self.handler is not None
if has_direct_target == has_handler_target:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review guide: the Option C seam: one catalog type, two modes, XOR-validated; to_call vs to_call_ref keep the modes from half-populating each other.

handler="tools.agenta.test_run",
input_schema=_TEST_RUN_INPUT_SCHEMA,
context_bindings={"target.workflow_variant_id": "$ctx.workflow.variant.id"},
read_only=False,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review guide: read_only=False is the ONLY approval surface (defaults to ask under the permission plan); target.workflow_variant_id is ctx-bound and stripped from the model schema.

)

# Cut ops stay catalog opt-ins.
DEFAULT_BUILD_KIT_OPS: tuple[str, ...] = (

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review guide: the static-13 assumed decision: 12 ops + request_connection today, test_run joins when the runner half (5b) lands.

# Slice 5 seam: replace the query_spans verification paragraph above with the test_run paragraph
# when that platform op ships.
BUILD_AN_AGENT_SKILL = SkillTemplate(
name="build-an-agent",

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review guide: the one-playbook port: numbered loop, decision table, query_spans verify recipe (test_run seam marked), prefer-wired-tools rule, approval stops.

data=WorkflowServiceRequestData(inputs={"messages": request.inputs.messages}),
)

if request.delta is not None:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review guide: security: delta requires EDIT_WORKFLOWS + fetch-and-fail target validation (closes the caller-authored-revision hole the adversarial review found).

if not _platform_handlers_enabled():
error = GatewayToolResolutionError(
f"Platform handler-mode op '{op.op}' requires "
f"{_ENABLE_PLATFORM_HANDLERS_ENV}=true before it can resolve.",

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review guide: handler-mode specs cannot reach a runner that does not dispatch them: flag-gated OFF until 5b; resolving without the flag fails loud.

@mmabrouk

mmabrouk commented Jul 4, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@mmabrouk

mmabrouk commented Jul 4, 2026

Copy link
Copy Markdown
Member Author

What feedback is needed here:

  1. The three assumed decisions (see the PR body section): the static 12-op overlay, sync test_run with an in-memory delta, and shipping query_spans now. Each proceeded on its recommendation; a veto on any is cheap before the runner half lands.
  2. The playbook text in agenta_builtins.py: the numbered loop, the decision table, and the verify recipe are the agent's actual operating manual — read it as prose, not as code.
  3. The test_run security posture: the delta EDIT_WORKFLOWS gate, the flag gate (handler-mode ops fail resolution loudly with the flag off), and the recursion-marker contract.
  4. The deferred 5b scope (runner callRef dispatch, spec-level context injection, timeoutMs, overlay flip): confirm the cut line is right.

One correction to the PR body: the SDK resolver seed fix described in "Changes" and "Live verification" (parameters-less inline revisions seeding the default template) is implemented and live-verified but is NOT in this diff yet. Its resolver.py hunk is textually entangled with the parallel invoke-validation lane (#5002), which edits the same region; committing it here conflicts the GitButler workspace re-merge. It lands as a follow-up commit once #5002 merges (or the lanes are restacked). The 4-file fix (utils.py seed helper, resolver.py wiring, shape tests, a platform-handlers test tweak) sits intact in the working tree.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (8)
api/oss/src/apis/fastapi/applications/overlay.py (1)

7-11: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Importing a private module-level dict across module boundary.

_STATIC_WORKFLOWS is a leading-underscore (private) module attribute of static_catalog.py. Importing it directly into overlay.py for iteration (used later in _reserved_static_tool_embeds) couples this file to static_catalog's internal representation rather than its public StaticWorkflowCatalog API (e.g. a method to list slugs). Any internal refactor of that dict's shape in static_catalog.py risks silently breaking this file without a contract signal.

♻️ Suggested direction
-from oss.src.core.workflows.static_catalog import (
-    STATIC_SLUG_PREFIX,
-    StaticWorkflowCatalog,
-    _STATIC_WORKFLOWS,
-)
+from oss.src.core.workflows.static_catalog import (
+    STATIC_SLUG_PREFIX,
+    StaticWorkflowCatalog,
+)

Add a small public accessor on StaticWorkflowCatalog (e.g. list_slugs()) and use catalog.list_slugs() in _reserved_static_tool_embeds instead of iterating the private dict directly.

api/oss/tests/pytest/unit/workflows/test_static_catalog.py (1)

54-62: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Hardcoded "ag" prefix duplicates STATIC_SLUG_PREFIX.

_old_authoring_slug hardcodes the literal "__ag__" even though STATIC_SLUG_PREFIX is already imported and used elsewhere in this file (e.g. line 174). If the prefix constant ever changes, this helper silently drifts out of sync and the negative assertions (lines 112-114, 176-178) stop testing what they claim to.

♻️ Suggested fix
 def _old_authoring_slug(name: str) -> str:
-    return "__ag__" + name
+    return STATIC_SLUG_PREFIX + name
sdks/python/agenta/sdk/agents/adapters/agenta_builtins.py (1)

208-209: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Tracked seam note for the test_run migration.

This comment marks a deferred content swap (replace the query_spans verification paragraph with the test_run paragraph once that op ships in the overlay). Just confirming it's an intentional follow-up rather than a leftover; want me to open a tracking issue so it isn't lost?

sdks/python/agenta/sdk/agents/tools/models.py (1)

366-390: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Consider restricting context_bindings to call_ref-mode specs.

context_bindings is documented as handler-mode metadata that "carries run-context bindings at the spec level so the relay can inject them before posting to /tools/call", which only applies when call_ref is set (gateway dispatch). The _check_call_target validator only enforces the call/call_ref XOR invariant but doesn't prevent a caller from setting context_bindings alongside a direct call descriptor, which would be a semantically invalid combination (direct calls bypass /tools/call and have no relay to apply the bindings).

Consider extending the validator to reject context_bindings when call is set (or when call_ref is None).

🛡️ Proposed validator extension
     `@model_validator`(mode="after")
     def _check_call_target(self) -> "CallbackToolSpec":
         # A callback tool must have exactly one place to call: the gateway slug (``call_ref``) or
         # the direct descriptor (``call``). This encodes the design's ``call`` XOR ``call_ref``
         # rule and preserves the prior invariant that a callback spec always has a target (back
         # when ``call_ref`` was required).
         if (self.call_ref is None) == (self.call is None):
             raise ValueError(
                 "a callback tool spec must carry exactly one of `call_ref` (gateway) "
                 "or `call` (direct)"
             )
+        if self.context_bindings is not None and self.call_ref is None:
+            raise ValueError(
+                "`context_bindings` is only valid with `call_ref` (gateway dispatch)"
+            )
         return self
api/oss/src/apis/fastapi/tools/router.py (1)

1211-1220: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Delta permission gating is hardcoded to TEST_RUN_CALL_REF in the router.

The call_ref == TEST_RUN_CALL_REF check couples this generic reserved-tool dispatcher to test_run-specific policy. If a future handler also needs delta-style elevated permissions, this conditional will need another hardcoded branch. Consider carrying the required permission (or a "requires elevated permission when X" predicate) on PlatformToolHandlerRegistration in platform_handlers.py so the router stays generic.

api/oss/tests/pytest/unit/tools/test_platform_handlers.py (1)

1-594: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Solid coverage of happy path, delta, permissions, pauses, timeouts, and span retries.

One gap worth adding once the _tools_from_messages overwrite bug (flagged in platform_handlers.py) is fixed: a test where the same tool name is invoked twice in one run, once erroring and once succeeding, to lock in correct accumulation of error/returned.

api/oss/src/core/tools/platform_handlers.py (1)

197-219: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Avoid depending on WorkflowsService internals platform_handlers.py calls underscore-prefixed helpers on WorkflowsService (_ensure_request_revision, _resolve_revision_delta, _prepare_invoke, _coerce_invoke_response) across a module boundary, so this handler is coupled to an internal API that can change without notice. Expose a small public method or interface for revision resolution and invoke preparation instead.

docs/design/agent-workflows/projects/build-kit-tools-cleanup/scripts/sweep_platform_op_renames.py (1)

189-239: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Duplicate skill-consolidation logic between planning and execution.

_planned_skill_embed_changes and rewrite_skill_embeds both re-implement the same seen_playbook walk independently. If one is changed without the other, the printed change log and the actual rewritten data would silently diverge.

Consider having rewrite_skill_embeds compute and return the change list as it mutates, and have _planned_skill_embed_changes/old_skill_embed_paths call into a shared single-pass helper (or have rewrite_skill_embeds be the sole source of truth and drop the separate planning pass in rewrite_revision_data).


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: b55a246c-f0e2-4d6a-9b32-e49c4f468b98

📥 Commits

Reviewing files that changed from the base of the PR and between d68193f and 734acba.

📒 Files selected for processing (36)
  • api/entrypoints/routers.py
  • api/oss/src/apis/fastapi/applications/overlay.py
  • api/oss/src/apis/fastapi/tools/models.py
  • api/oss/src/apis/fastapi/tools/router.py
  • api/oss/src/core/tools/discovery.py
  • api/oss/src/core/tools/dtos.py
  • api/oss/src/core/tools/platform_handlers.py
  • api/oss/src/core/tools/service.py
  • api/oss/src/core/triggers/dtos.py
  • api/oss/src/core/workflows/static_catalog.py
  • api/oss/tests/manual/tools/tools.http
  • api/oss/tests/pytest/unit/applications/test_build_kit_overlay.py
  • api/oss/tests/pytest/unit/tools/test_discovery.py
  • api/oss/tests/pytest/unit/tools/test_platform_handlers.py
  • api/oss/tests/pytest/unit/tools/test_workflow_tool_call.py
  • api/oss/tests/pytest/unit/tracing/test_query_spans_op_contract.py
  • api/oss/tests/pytest/unit/triggers/test_triggers_discovery.py
  • api/oss/tests/pytest/unit/workflows/test_static_catalog.py
  • docs/design/agent-workflows/documentation/agent-configuration.md
  • docs/design/agent-workflows/documentation/tools.md
  • docs/design/agent-workflows/interfaces/README.md
  • docs/design/agent-workflows/interfaces/cross-service/runner-to-tool-callback.md
  • docs/design/agent-workflows/interfaces/in-service/tool-models-and-resolution.md
  • docs/design/agent-workflows/interfaces/public-edge/agent-config-schema.md
  • docs/design/agent-workflows/projects/build-kit-tools-cleanup/scripts/sweep_platform_op_renames.py
  • sdks/python/agenta/sdk/agents/adapters/agenta_builtins.py
  • sdks/python/agenta/sdk/agents/platform/op_catalog.py
  • sdks/python/agenta/sdk/agents/platform/platform_tools.py
  • sdks/python/agenta/sdk/agents/tools/models.py
  • sdks/python/agenta/sdk/agents/wire_models.py
  • sdks/python/oss/tests/pytest/unit/agents/platform/conftest.py
  • sdks/python/oss/tests/pytest/unit/agents/platform/test_op_catalog.py
  • sdks/python/oss/tests/pytest/unit/agents/tools/test_models.py
  • sdks/python/oss/tests/pytest/unit/agents/tools/test_parsing.py
  • sdks/python/oss/tests/pytest/unit/agents/tools/test_resolver.py
  • sdks/python/oss/tests/pytest/unit/test_skill_template_catalog.py

Comment on lines +1204 to +1247
async def _call_reserved_agenta_tool(
self,
*,
request: Request,
body: ToolCall,
call_ref: str,
) -> ToolCallResponse:
"""Run a reserved ``tools.agenta.*`` platform tool. v1 op: find_capabilities.

Routed here from ``call_tool`` by the ``tools.agenta.`` prefix (so the reserved
tool is out of the Composio 5-segment namespace). Project scope comes from the
run's caller auth, exactly like the gateway path.
"""
call_ref = body.data.function.name.replace("__", ".")
op = call_ref[len(AGENTA_TOOL_CALL_REF_PREFIX) :]
if op != FIND_CAPABILITIES_OP:
raise HTTPException(
status_code=404,
detail=f"Unknown Agenta tool: {call_ref}",
)

arguments = body.data.function.arguments
if isinstance(arguments, str):
try:
arguments = json.loads(arguments)
except json.JSONDecodeError as e:
log.warning(
"Failed to parse find_capabilities arguments as JSON: %s", e
)
arguments = {}
elif not isinstance(arguments, dict):
arguments = {}

use_cases, provider, limit_alternatives = parse_find_capabilities_arguments(
arguments
)
if not use_cases:
raise HTTPException(
status_code=400,
detail="find_capabilities requires at least one use_case",
if call_ref == TEST_RUN_CALL_REF and _test_run_arguments_include_delta(
body.data.function.arguments
):
has_permission = await check_action_access(
user_uid=request.state.user_id,
project_id=request.state.project_id,
permission=Permission.EDIT_WORKFLOWS,
)
if not has_permission:
raise FORBIDDEN_EXCEPTION

try:
result = await self.tools_service.discover_capabilities(
response = await dispatch_platform_tool_handler(
call_ref=call_ref,
arguments=body.data.function.arguments,
headers=request.headers,
project_id=UUID(request.state.project_id),
use_cases=use_cases,
provider_key=provider,
limit_alternatives=limit_alternatives,
user_id=UUID(request.state.user_id),
workflows_service=self.workflows_service,
tracing_service=self.tracing_service,
)
except DiscoveryUnsupportedError as e:
raise HTTPException(status_code=422, detail=e.message) from e
except PlatformToolHandlerError as e:
raise HTTPException(status_code=e.status_code, detail=e.message) from e

tool_result = ToolResult(
result = ToolResult(
id=uuid4(),
data=ToolResultData(
tool_call_id=body.data.id,
content=json.dumps(result.model_dump(mode="json", exclude_none=True)),
content=response.model_dump_json(exclude_none=True),
),
status=Status(
timestamp=datetime.now(timezone.utc),
code="STATUS_CODE_OK",
),
)
return ToolCallResponse(call=tool_result)

return ToolCallResponse(call=result)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

_call_reserved_agenta_tool always reports STATUS_CODE_OK, even on infra-level test_run failures.

Unlike the direct-call path in call_tool (line ~1377: "STATUS_CODE_OK" if successful else "STATUS_CODE_ERROR"), this always sets code="STATUS_CODE_OK" regardless of TestRunResponse.verdict. Since _digest_test_run_response (in platform_handlers.py) returns verdict="failed" both for business-level tool errors and for genuine infrastructure failures (workflow invoke timeout/502, non-2xx status), any consumer inspecting only the outer ToolResult.status.code (rather than parsing the JSON content) cannot distinguish an actual outage from a normal response.

Comment on lines +43 to +130
class PlatformToolHandlerError(Exception):
status_code = 400

def __init__(self, message: str):
self.message = message
super().__init__(message)


class PlatformToolHandlerNotFound(PlatformToolHandlerError):
status_code = 404


class PlatformToolHandlerUnavailable(PlatformToolHandlerError):
status_code = 501


class PlatformToolHandlerRefused(PlatformToolHandlerError):
status_code = 400


class TestRunTarget(BaseModel):
model_config = ConfigDict(extra="forbid")

workflow_variant_id: UUID


class TestRunInputs(BaseModel):
model_config = ConfigDict(extra="forbid")

messages: List[Dict[str, Any]]


class TestRunExpectations(BaseModel):
model_config = ConfigDict(extra="forbid")

terminal_tool: Optional[str] = None


class TestRunRequest(BaseModel):
model_config = ConfigDict(extra="forbid")

target: TestRunTarget
inputs: TestRunInputs
delta: Optional[WorkflowRevisionDelta] = None
expectations: Optional[TestRunExpectations] = None


class TestRunToolDigest(BaseModel):
name: str
called: bool = True
returned: bool = False
error: bool = Field(default=False, exclude=True)


class TestRunResolved(BaseModel):
harness: Optional[str] = None
model: Optional[str] = None
provider: Optional[str] = None
connection_mode: Optional[str] = None


TestRunVerdict = Literal["pass", "incomplete", "unconfirmed", "failed"]


class TestRunResponse(BaseModel):
output: str = ""
tools: List[TestRunToolDigest] = Field(default_factory=list)
approvals: List[str] = Field(default_factory=list)
resolved: TestRunResolved = Field(default_factory=TestRunResolved)
trace_id: Optional[str] = None
test_id: Optional[str] = None
verdict: TestRunVerdict
verdict_reason: Optional[str] = None


@dataclass(frozen=True)
class PlatformToolHandlerRegistration:
call_ref: str
timeout_ms: int


PLATFORM_TOOL_HANDLERS: Dict[str, PlatformToolHandlerRegistration] = {
TEST_RUN_CALL_REF: PlatformToolHandlerRegistration(
call_ref=TEST_RUN_CALL_REF,
timeout_ms=TEST_RUN_DEFAULT_TIMEOUT_MS,
)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Domain contracts/exceptions live in the handler module instead of dtos.py/types.py.

TestRunTarget/TestRunInputs/TestRunResponse/etc. and the PlatformToolHandlerError exception hierarchy are all defined directly in platform_handlers.py. Per the coding guidelines for this domain, request/response contracts belong in dtos.py/types.py and domain exceptions belong in the core layer's types.py/dtos.py with a base exception per domain, while platform_handlers.py should hold only the business orchestration. Consider splitting these definitions out into oss/src/core/tools/dtos.py (or a new types.py) to align with the established domain layout used elsewhere in core/tools.

As per coding guidelines, api/oss/src/core/**/{types.py,dtos.py}: "Define domain exceptions in the core layer, include structured context, and use a base exception per domain; do not raise HTTPException from services or DAOs," and api/oss/src/core/**/{dtos.py,types.py,interfaces.py,service.py}: "keep domain contracts in dtos.py or types.py... and business orchestration in service.py."

Source: Coding guidelines

Comment on lines +264 to +296
if request.delta is not None:
await workflows_service._ensure_request_revision(
project_id=project_id,
request=workflow_request,
)
if not workflow_request.data or not workflow_request.data.revision:
raise PlatformToolHandlerError(
"test_run could not resolve the target workflow variant revision."
)

resolved = await workflows_service._resolve_revision_delta(
project_id=project_id,
workflow_revision_commit=WorkflowRevisionCommit(
workflow_variant_id=request.target.workflow_variant_id,
delta=request.delta,
),
)
if resolved.data is None:
raise PlatformToolHandlerError(
"test_run could not resolve the revision delta."
)
workflow_request.data.revision = {"data": resolved.data.model_dump(mode="json")}
return workflow_request

await workflows_service._ensure_request_revision(
project_id=project_id,
request=workflow_request,
)
if not workflow_request.data or not workflow_request.data.revision:
raise PlatformToolHandlerError(
"test_run could not resolve the target workflow variant revision."
)
return workflow_request

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect WorkflowRevisionDelta and _resolve_revision_delta to confirm which
# fields (e.g. "url") a delta's "set" can mutate.
rg -n -A 30 'class WorkflowRevisionDelta' api/oss/src/core/workflows
rg -n -A 40 'def _resolve_revision_delta' api/oss/src/core/workflows
rg -n -A 30 'def _prepare_invoke' api/oss/src/core/workflows

Repository: Agenta-AI/agenta

Length of output: 7981


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the relevant models and invoke path.
ast-grep outline api/oss/src/core/workflows/dtos.py --match WorkflowRevisionData --view expanded
ast-grep outline api/oss/src/core/workflows/dtos.py --match WorkflowRevisionDelta --view expanded
ast-grep outline api/oss/src/core/workflows/service.py --match _prepare_invoke --view expanded
ast-grep outline api/oss/src/core/workflows/service.py --match _invoke_child_workflow --view expanded

# Read the focused sections with line numbers.
sed -n '1,220p' api/oss/src/core/workflows/dtos.py | cat -n
printf '\n----\n'
sed -n '2035,2105p' api/oss/src/core/workflows/service.py | cat -n
printf '\n----\n'
sed -n '2105,2195p' api/oss/src/core/workflows/service.py | cat -n

Repository: Agenta-AI/agenta

Length of output: 13712


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find the service URL derivation and the workflow revision data schema.
rg -n -A 40 -B 10 '_get_service_url|service_url|has_url|url:' api/oss/src/core/workflows
rg -n -A 80 'class WorkflowRevisionData|class BaseWorkflowRevisionData|WorkflowRevisionData\(' -g '*.py' .

Repository: Agenta-AI/agenta

Length of output: 50373


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the exact workflow revision data schema and URL derivation logic.
rg -n -A 40 -B 10 'class WorkflowRevisionData|class BaseWorkflowRevisionData|url:|uri:|has_url' sdks/python/agenta/sdk/models/workflows.py api/oss/src/core/workflows/dtos.py
printf '\n----\n'
rg -n -A 40 -B 10 '_get_service_url|def _get_service_url|service_url =|has_url' api/oss/src/core/workflows/service.py

Repository: Agenta-AI/agenta

Length of output: 29850


Block url changes in revision deltas. WorkflowRevisionData explicitly allows url, and _resolve_revision_delta() deep-merges delta.set before _get_service_url() reads it. That lets an EDIT_WORKFLOWS caller point the server-side POST at an arbitrary HTTP(S) endpoint and send server-minted credentials there.

Comment on lines +432 to +453
def _tools_from_messages(messages: List[Any]) -> Dict[str, TestRunToolDigest]:
by_call_id: Dict[str, TestRunToolDigest] = {}
by_name: Dict[str, TestRunToolDigest] = {}
for message in messages:
if not isinstance(message, dict) or message.get("role") != "tool":
continue
call_id = message.get("tool_call_id") or message.get("toolCallId")
name = message.get("tool_name") or message.get("toolName")
if name:
digest = by_name.setdefault(name, TestRunToolDigest(name=name))
if call_id:
by_call_id[str(call_id)] = digest
if _has_tool_result_content(message):
digest.returned = True
digest.error = bool(message.get("is_error") or message.get("isError"))
continue
if call_id and str(call_id) in by_call_id:
digest = by_call_id[str(call_id)]
if _has_tool_result_content(message):
digest.returned = True
digest.error = bool(message.get("is_error") or message.get("isError"))
return by_name

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

_tools_from_messages overwrites, rather than accumulates, returned/error per tool name.

Since by_name.setdefault(name, ...) shares one TestRunToolDigest across all calls to the same tool name, a second call to the same tool overwrites digest.error/digest.returned set by an earlier call. If a tool is invoked twice in one run (e.g. once erroring, once succeeding), the earlier error is silently discarded and _verdict() (line 593-595) will not flag it — masking a real failure in what is meant to be a correctness-verifying test_run. Contrast with _merge_span_observations (lines 506-518), which correctly only sets flags to True and never resets them.

🐛 Proposed fix to accumulate instead of overwrite
         if name:
             digest = by_name.setdefault(name, TestRunToolDigest(name=name))
             if call_id:
                 by_call_id[str(call_id)] = digest
             if _has_tool_result_content(message):
-                digest.returned = True
-                digest.error = bool(message.get("is_error") or message.get("isError"))
+                digest.returned = True
+                digest.error = digest.error or bool(
+                    message.get("is_error") or message.get("isError")
+                )
             continue
         if call_id and str(call_id) in by_call_id:
             digest = by_call_id[str(call_id)]
             if _has_tool_result_content(message):
-                digest.returned = True
-                digest.error = bool(message.get("is_error") or message.get("isError"))
+                digest.returned = True
+                digest.error = digest.error or bool(
+                    message.get("is_error") or message.get("isError")
+                )
This is not covered by the current test suite (no test exercises the same tool name being called twice); consider adding a regression test alongside the fix.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _tools_from_messages(messages: List[Any]) -> Dict[str, TestRunToolDigest]:
by_call_id: Dict[str, TestRunToolDigest] = {}
by_name: Dict[str, TestRunToolDigest] = {}
for message in messages:
if not isinstance(message, dict) or message.get("role") != "tool":
continue
call_id = message.get("tool_call_id") or message.get("toolCallId")
name = message.get("tool_name") or message.get("toolName")
if name:
digest = by_name.setdefault(name, TestRunToolDigest(name=name))
if call_id:
by_call_id[str(call_id)] = digest
if _has_tool_result_content(message):
digest.returned = True
digest.error = bool(message.get("is_error") or message.get("isError"))
continue
if call_id and str(call_id) in by_call_id:
digest = by_call_id[str(call_id)]
if _has_tool_result_content(message):
digest.returned = True
digest.error = bool(message.get("is_error") or message.get("isError"))
return by_name
def _tools_from_messages(messages: List[Any]) -> Dict[str, TestRunToolDigest]:
by_call_id: Dict[str, TestRunToolDigest] = {}
by_name: Dict[str, TestRunToolDigest] = {}
for message in messages:
if not isinstance(message, dict) or message.get("role") != "tool":
continue
call_id = message.get("tool_call_id") or message.get("toolCallId")
name = message.get("tool_name") or message.get("toolName")
if name:
digest = by_name.setdefault(name, TestRunToolDigest(name=name))
if call_id:
by_call_id[str(call_id)] = digest
if _has_tool_result_content(message):
digest.returned = True
digest.error = digest.error or bool(
message.get("is_error") or message.get("isError")
)
continue
if call_id and str(call_id) in by_call_id:
digest = by_call_id[str(call_id)]
if _has_tool_result_content(message):
digest.returned = True
digest.error = digest.error or bool(
message.get("is_error") or message.get("isError")
)
return by_name

Comment on lines +254 to +281
def _select_candidates(
conn: psycopg.Connection[Any], project_id: str | None
) -> list[tuple[str, str, Any]]:
match_terms = {
"find_capabilities": "%find_capabilities%",
"find_triggers": "%find_triggers%",
"build_your_first_app": "%__ag__build_your_first_app%",
"discover_and_wire_tools": "%__ag__discover_and_wire_tools%",
"set_up_triggers": "%__ag__set_up_triggers%",
}
where = [
"data IS NOT NULL",
"(" + " OR ".join(f"data::text LIKE %({key})s" for key in match_terms) + ")",
]
params: dict[str, Any] = dict(match_terms)
if project_id is not None:
where.append("project_id = %(project_id)s::uuid")
params["project_id"] = project_id

query = f"""
SELECT project_id::text, id::text, data
FROM workflow_revisions
WHERE {" AND ".join(where)}
ORDER BY project_id, id
"""
with conn.cursor() as cur:
cur.execute(query, params)
return [(row[0], row[1], _json_data(row[2])) for row in cur.fetchall()]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

file='docs/design/agent-workflows/projects/build-kit-tools-cleanup/scripts/sweep_platform_op_renames.py'

wc -l "$file"
printf '\n--- outline ---\n'
ast-grep outline "$file" --view expanded

printf '\n--- candidate selection and cleanup flow ---\n'
rg -n "def _select_candidates|def assert_clean|def main|uncovered|SystemExit|LIKE %\\(" "$file" -A 40 -B 20

Repository: Agenta-AI/agenta

Length of output: 10134


Narrow the candidate filter to the covered JSON paths

data::text LIKE can still match unrelated free-form text, and assert_clean turns those rows into a hard SystemExit as “matched but not rewritten (uncovered path)” even when no covered op/embed paths remain. Make uncovered matches non-fatal, or prefilter on the actual JSON paths instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant