feat(sdk,api): build-kit tools cleanup: discover renames, 12-op overlay, query_spans, one playbook skill, test_run server half#5068
Conversation
… drop the legacy reserved call_ref route; revision sweep script Claude-Session: https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT
…plicative ops, keep test_subscription) Claude-Session: https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT
…kill (slices 3-4) - query_spans: read-only platform op over POST /api/spans/query, schema mirrors SpansQueryRequest, drift contract test pins schema<->endpoint agreement - one ordered playbook skill replaces the three authoring skills; overlay embeds only the playbook (getting-started stays harness-forced); concrete query_spans verify recipe; approval stops on trigger creation - sweep script extended to rewrite committed old-slug skill embeds Claude-Session: https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT
- PlatformOp handler mode (method+path XOR handler), to_call/to_call_ref, allowlist - resolver emits callRef + contextBindings + timeoutMs specs, flag-gated OFF (AGENTA_AGENT_ENABLE_PLATFORM_HANDLERS) until the runner half lands - /tools/call reserved-ref registry: registered refs dispatch, unknown fail loud 404 - tools-domain handler: hydrate revision (delta requires EDIT_WORKFLOWS, fetch-and-fail target validation), headless invoke via signed secret, spans-confirmed digest, terminal-result-wins verdict, recursion marker, 120s ceiling - test_run NOT in the overlay yet (5b flips it with the runner dispatch) Claude-Session: https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT
…eanup contracts Claude-Session: https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR migrates platform tool discovery from ChangesPlatform tool and handler migration
Estimated code review effort: 4 (Complex) | ~75 minutes Design documentation and migration script
Estimated code review effort: 3 (Moderate) | ~25 minutes Sequence Diagram(s)sequenceDiagram
participant ToolsRouter
participant PlatformHandlers
participant WorkflowsService
participant TracingService
participant ChildWorkflow
ToolsRouter->>PlatformHandlers: call_tool(tools.agenta.test_run)
PlatformHandlers->>PlatformHandlers: check delta => require EDIT_WORKFLOWS
PlatformHandlers->>WorkflowsService: ensure/prepare revision, resolve delta
PlatformHandlers->>ChildWorkflow: POST /invoke (via httpx)
ChildWorkflow-->>PlatformHandlers: workflow batch response
PlatformHandlers->>TracingService: query spans by trace_id
TracingService-->>PlatformHandlers: span observations
PlatformHandlers-->>ToolsRouter: TestRunResponse (verdict, output)
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
mmabrouk
left a comment
There was a problem hiding this comment.
Review guide: inline pointers to the key hunks follow.
|
|
||
| has_direct_target = self.method is not None or self.path is not None | ||
| has_handler_target = self.handler is not None | ||
| if has_direct_target == has_handler_target: |
There was a problem hiding this comment.
Review guide: the Option C seam: one catalog type, two modes, XOR-validated; to_call vs to_call_ref keep the modes from half-populating each other.
| handler="tools.agenta.test_run", | ||
| input_schema=_TEST_RUN_INPUT_SCHEMA, | ||
| context_bindings={"target.workflow_variant_id": "$ctx.workflow.variant.id"}, | ||
| read_only=False, |
There was a problem hiding this comment.
Review guide: read_only=False is the ONLY approval surface (defaults to ask under the permission plan); target.workflow_variant_id is ctx-bound and stripped from the model schema.
| ) | ||
|
|
||
| # Cut ops stay catalog opt-ins. | ||
| DEFAULT_BUILD_KIT_OPS: tuple[str, ...] = ( |
There was a problem hiding this comment.
Review guide: the static-13 assumed decision: 12 ops + request_connection today, test_run joins when the runner half (5b) lands.
| # Slice 5 seam: replace the query_spans verification paragraph above with the test_run paragraph | ||
| # when that platform op ships. | ||
| BUILD_AN_AGENT_SKILL = SkillTemplate( | ||
| name="build-an-agent", |
There was a problem hiding this comment.
Review guide: the one-playbook port: numbered loop, decision table, query_spans verify recipe (test_run seam marked), prefer-wired-tools rule, approval stops.
| data=WorkflowServiceRequestData(inputs={"messages": request.inputs.messages}), | ||
| ) | ||
|
|
||
| if request.delta is not None: |
There was a problem hiding this comment.
Review guide: security: delta requires EDIT_WORKFLOWS + fetch-and-fail target validation (closes the caller-authored-revision hole the adversarial review found).
| if not _platform_handlers_enabled(): | ||
| error = GatewayToolResolutionError( | ||
| f"Platform handler-mode op '{op.op}' requires " | ||
| f"{_ENABLE_PLATFORM_HANDLERS_ENV}=true before it can resolve.", |
There was a problem hiding this comment.
Review guide: handler-mode specs cannot reach a runner that does not dispatch them: flag-gated OFF until 5b; resolving without the flag fails loud.
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
What feedback is needed here:
One correction to the PR body: the SDK resolver seed fix described in "Changes" and "Live verification" (parameters-less inline revisions seeding the default template) is implemented and live-verified but is NOT in this diff yet. Its resolver.py hunk is textually entangled with the parallel invoke-validation lane (#5002), which edits the same region; committing it here conflicts the GitButler workspace re-merge. It lands as a follow-up commit once #5002 merges (or the lanes are restacked). The 4-file fix (utils.py seed helper, resolver.py wiring, shape tests, a platform-handlers test tweak) sits intact in the working tree. |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (8)
api/oss/src/apis/fastapi/applications/overlay.py (1)
7-11: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueImporting a private module-level dict across module boundary.
_STATIC_WORKFLOWSis a leading-underscore (private) module attribute ofstatic_catalog.py. Importing it directly intooverlay.pyfor iteration (used later in_reserved_static_tool_embeds) couples this file tostatic_catalog's internal representation rather than its publicStaticWorkflowCatalogAPI (e.g. a method to list slugs). Any internal refactor of that dict's shape instatic_catalog.pyrisks silently breaking this file without a contract signal.♻️ Suggested direction
-from oss.src.core.workflows.static_catalog import ( - STATIC_SLUG_PREFIX, - StaticWorkflowCatalog, - _STATIC_WORKFLOWS, -) +from oss.src.core.workflows.static_catalog import ( + STATIC_SLUG_PREFIX, + StaticWorkflowCatalog, +)Add a small public accessor on
StaticWorkflowCatalog(e.g.list_slugs()) and usecatalog.list_slugs()in_reserved_static_tool_embedsinstead of iterating the private dict directly.api/oss/tests/pytest/unit/workflows/test_static_catalog.py (1)
54-62: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winHardcoded "ag" prefix duplicates
STATIC_SLUG_PREFIX.
_old_authoring_slughardcodes the literal"__ag__"even thoughSTATIC_SLUG_PREFIXis already imported and used elsewhere in this file (e.g. line 174). If the prefix constant ever changes, this helper silently drifts out of sync and the negative assertions (lines 112-114, 176-178) stop testing what they claim to.♻️ Suggested fix
def _old_authoring_slug(name: str) -> str: - return "__ag__" + name + return STATIC_SLUG_PREFIX + namesdks/python/agenta/sdk/agents/adapters/agenta_builtins.py (1)
208-209: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueTracked seam note for the
test_runmigration.This comment marks a deferred content swap (replace the
query_spansverification paragraph with thetest_runparagraph once that op ships in the overlay). Just confirming it's an intentional follow-up rather than a leftover; want me to open a tracking issue so it isn't lost?sdks/python/agenta/sdk/agents/tools/models.py (1)
366-390: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winConsider restricting
context_bindingstocall_ref-mode specs.
context_bindingsis documented as handler-mode metadata that "carries run-context bindings at the spec level so the relay can inject them before posting to/tools/call", which only applies whencall_refis set (gateway dispatch). The_check_call_targetvalidator only enforces thecall/call_refXOR invariant but doesn't prevent a caller from settingcontext_bindingsalongside a directcalldescriptor, which would be a semantically invalid combination (direct calls bypass/tools/calland have no relay to apply the bindings).Consider extending the validator to reject
context_bindingswhencallis set (or whencall_refisNone).🛡️ Proposed validator extension
`@model_validator`(mode="after") def _check_call_target(self) -> "CallbackToolSpec": # A callback tool must have exactly one place to call: the gateway slug (``call_ref``) or # the direct descriptor (``call``). This encodes the design's ``call`` XOR ``call_ref`` # rule and preserves the prior invariant that a callback spec always has a target (back # when ``call_ref`` was required). if (self.call_ref is None) == (self.call is None): raise ValueError( "a callback tool spec must carry exactly one of `call_ref` (gateway) " "or `call` (direct)" ) + if self.context_bindings is not None and self.call_ref is None: + raise ValueError( + "`context_bindings` is only valid with `call_ref` (gateway dispatch)" + ) return selfapi/oss/src/apis/fastapi/tools/router.py (1)
1211-1220: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDelta permission gating is hardcoded to
TEST_RUN_CALL_REFin the router.The
call_ref == TEST_RUN_CALL_REFcheck couples this generic reserved-tool dispatcher to test_run-specific policy. If a future handler also needs delta-style elevated permissions, this conditional will need another hardcoded branch. Consider carrying the required permission (or a "requires elevated permission when X" predicate) onPlatformToolHandlerRegistrationinplatform_handlers.pyso the router stays generic.api/oss/tests/pytest/unit/tools/test_platform_handlers.py (1)
1-594: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winSolid coverage of happy path, delta, permissions, pauses, timeouts, and span retries.
One gap worth adding once the
_tools_from_messagesoverwrite bug (flagged inplatform_handlers.py) is fixed: a test where the same tool name is invoked twice in one run, once erroring and once succeeding, to lock in correct accumulation oferror/returned.api/oss/src/core/tools/platform_handlers.py (1)
197-219: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAvoid depending on
WorkflowsServiceinternalsplatform_handlers.pycalls underscore-prefixed helpers onWorkflowsService(_ensure_request_revision,_resolve_revision_delta,_prepare_invoke,_coerce_invoke_response) across a module boundary, so this handler is coupled to an internal API that can change without notice. Expose a small public method or interface for revision resolution and invoke preparation instead.docs/design/agent-workflows/projects/build-kit-tools-cleanup/scripts/sweep_platform_op_renames.py (1)
189-239: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDuplicate skill-consolidation logic between planning and execution.
_planned_skill_embed_changesandrewrite_skill_embedsboth re-implement the sameseen_playbookwalk independently. If one is changed without the other, the printed change log and the actual rewritten data would silently diverge.Consider having
rewrite_skill_embedscompute and return the change list as it mutates, and have_planned_skill_embed_changes/old_skill_embed_pathscall into a shared single-pass helper (or haverewrite_skill_embedsbe the sole source of truth and drop the separate planning pass inrewrite_revision_data).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: b55a246c-f0e2-4d6a-9b32-e49c4f468b98
📒 Files selected for processing (36)
api/entrypoints/routers.pyapi/oss/src/apis/fastapi/applications/overlay.pyapi/oss/src/apis/fastapi/tools/models.pyapi/oss/src/apis/fastapi/tools/router.pyapi/oss/src/core/tools/discovery.pyapi/oss/src/core/tools/dtos.pyapi/oss/src/core/tools/platform_handlers.pyapi/oss/src/core/tools/service.pyapi/oss/src/core/triggers/dtos.pyapi/oss/src/core/workflows/static_catalog.pyapi/oss/tests/manual/tools/tools.httpapi/oss/tests/pytest/unit/applications/test_build_kit_overlay.pyapi/oss/tests/pytest/unit/tools/test_discovery.pyapi/oss/tests/pytest/unit/tools/test_platform_handlers.pyapi/oss/tests/pytest/unit/tools/test_workflow_tool_call.pyapi/oss/tests/pytest/unit/tracing/test_query_spans_op_contract.pyapi/oss/tests/pytest/unit/triggers/test_triggers_discovery.pyapi/oss/tests/pytest/unit/workflows/test_static_catalog.pydocs/design/agent-workflows/documentation/agent-configuration.mddocs/design/agent-workflows/documentation/tools.mddocs/design/agent-workflows/interfaces/README.mddocs/design/agent-workflows/interfaces/cross-service/runner-to-tool-callback.mddocs/design/agent-workflows/interfaces/in-service/tool-models-and-resolution.mddocs/design/agent-workflows/interfaces/public-edge/agent-config-schema.mddocs/design/agent-workflows/projects/build-kit-tools-cleanup/scripts/sweep_platform_op_renames.pysdks/python/agenta/sdk/agents/adapters/agenta_builtins.pysdks/python/agenta/sdk/agents/platform/op_catalog.pysdks/python/agenta/sdk/agents/platform/platform_tools.pysdks/python/agenta/sdk/agents/tools/models.pysdks/python/agenta/sdk/agents/wire_models.pysdks/python/oss/tests/pytest/unit/agents/platform/conftest.pysdks/python/oss/tests/pytest/unit/agents/platform/test_op_catalog.pysdks/python/oss/tests/pytest/unit/agents/tools/test_models.pysdks/python/oss/tests/pytest/unit/agents/tools/test_parsing.pysdks/python/oss/tests/pytest/unit/agents/tools/test_resolver.pysdks/python/oss/tests/pytest/unit/test_skill_template_catalog.py
| async def _call_reserved_agenta_tool( | ||
| self, | ||
| *, | ||
| request: Request, | ||
| body: ToolCall, | ||
| call_ref: str, | ||
| ) -> ToolCallResponse: | ||
| """Run a reserved ``tools.agenta.*`` platform tool. v1 op: find_capabilities. | ||
|
|
||
| Routed here from ``call_tool`` by the ``tools.agenta.`` prefix (so the reserved | ||
| tool is out of the Composio 5-segment namespace). Project scope comes from the | ||
| run's caller auth, exactly like the gateway path. | ||
| """ | ||
| call_ref = body.data.function.name.replace("__", ".") | ||
| op = call_ref[len(AGENTA_TOOL_CALL_REF_PREFIX) :] | ||
| if op != FIND_CAPABILITIES_OP: | ||
| raise HTTPException( | ||
| status_code=404, | ||
| detail=f"Unknown Agenta tool: {call_ref}", | ||
| ) | ||
|
|
||
| arguments = body.data.function.arguments | ||
| if isinstance(arguments, str): | ||
| try: | ||
| arguments = json.loads(arguments) | ||
| except json.JSONDecodeError as e: | ||
| log.warning( | ||
| "Failed to parse find_capabilities arguments as JSON: %s", e | ||
| ) | ||
| arguments = {} | ||
| elif not isinstance(arguments, dict): | ||
| arguments = {} | ||
|
|
||
| use_cases, provider, limit_alternatives = parse_find_capabilities_arguments( | ||
| arguments | ||
| ) | ||
| if not use_cases: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail="find_capabilities requires at least one use_case", | ||
| if call_ref == TEST_RUN_CALL_REF and _test_run_arguments_include_delta( | ||
| body.data.function.arguments | ||
| ): | ||
| has_permission = await check_action_access( | ||
| user_uid=request.state.user_id, | ||
| project_id=request.state.project_id, | ||
| permission=Permission.EDIT_WORKFLOWS, | ||
| ) | ||
| if not has_permission: | ||
| raise FORBIDDEN_EXCEPTION | ||
|
|
||
| try: | ||
| result = await self.tools_service.discover_capabilities( | ||
| response = await dispatch_platform_tool_handler( | ||
| call_ref=call_ref, | ||
| arguments=body.data.function.arguments, | ||
| headers=request.headers, | ||
| project_id=UUID(request.state.project_id), | ||
| use_cases=use_cases, | ||
| provider_key=provider, | ||
| limit_alternatives=limit_alternatives, | ||
| user_id=UUID(request.state.user_id), | ||
| workflows_service=self.workflows_service, | ||
| tracing_service=self.tracing_service, | ||
| ) | ||
| except DiscoveryUnsupportedError as e: | ||
| raise HTTPException(status_code=422, detail=e.message) from e | ||
| except PlatformToolHandlerError as e: | ||
| raise HTTPException(status_code=e.status_code, detail=e.message) from e | ||
|
|
||
| tool_result = ToolResult( | ||
| result = ToolResult( | ||
| id=uuid4(), | ||
| data=ToolResultData( | ||
| tool_call_id=body.data.id, | ||
| content=json.dumps(result.model_dump(mode="json", exclude_none=True)), | ||
| content=response.model_dump_json(exclude_none=True), | ||
| ), | ||
| status=Status( | ||
| timestamp=datetime.now(timezone.utc), | ||
| code="STATUS_CODE_OK", | ||
| ), | ||
| ) | ||
| return ToolCallResponse(call=tool_result) | ||
|
|
||
| return ToolCallResponse(call=result) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
_call_reserved_agenta_tool always reports STATUS_CODE_OK, even on infra-level test_run failures.
Unlike the direct-call path in call_tool (line ~1377: "STATUS_CODE_OK" if successful else "STATUS_CODE_ERROR"), this always sets code="STATUS_CODE_OK" regardless of TestRunResponse.verdict. Since _digest_test_run_response (in platform_handlers.py) returns verdict="failed" both for business-level tool errors and for genuine infrastructure failures (workflow invoke timeout/502, non-2xx status), any consumer inspecting only the outer ToolResult.status.code (rather than parsing the JSON content) cannot distinguish an actual outage from a normal response.
| class PlatformToolHandlerError(Exception): | ||
| status_code = 400 | ||
|
|
||
| def __init__(self, message: str): | ||
| self.message = message | ||
| super().__init__(message) | ||
|
|
||
|
|
||
| class PlatformToolHandlerNotFound(PlatformToolHandlerError): | ||
| status_code = 404 | ||
|
|
||
|
|
||
| class PlatformToolHandlerUnavailable(PlatformToolHandlerError): | ||
| status_code = 501 | ||
|
|
||
|
|
||
| class PlatformToolHandlerRefused(PlatformToolHandlerError): | ||
| status_code = 400 | ||
|
|
||
|
|
||
| class TestRunTarget(BaseModel): | ||
| model_config = ConfigDict(extra="forbid") | ||
|
|
||
| workflow_variant_id: UUID | ||
|
|
||
|
|
||
| class TestRunInputs(BaseModel): | ||
| model_config = ConfigDict(extra="forbid") | ||
|
|
||
| messages: List[Dict[str, Any]] | ||
|
|
||
|
|
||
| class TestRunExpectations(BaseModel): | ||
| model_config = ConfigDict(extra="forbid") | ||
|
|
||
| terminal_tool: Optional[str] = None | ||
|
|
||
|
|
||
| class TestRunRequest(BaseModel): | ||
| model_config = ConfigDict(extra="forbid") | ||
|
|
||
| target: TestRunTarget | ||
| inputs: TestRunInputs | ||
| delta: Optional[WorkflowRevisionDelta] = None | ||
| expectations: Optional[TestRunExpectations] = None | ||
|
|
||
|
|
||
| class TestRunToolDigest(BaseModel): | ||
| name: str | ||
| called: bool = True | ||
| returned: bool = False | ||
| error: bool = Field(default=False, exclude=True) | ||
|
|
||
|
|
||
| class TestRunResolved(BaseModel): | ||
| harness: Optional[str] = None | ||
| model: Optional[str] = None | ||
| provider: Optional[str] = None | ||
| connection_mode: Optional[str] = None | ||
|
|
||
|
|
||
| TestRunVerdict = Literal["pass", "incomplete", "unconfirmed", "failed"] | ||
|
|
||
|
|
||
| class TestRunResponse(BaseModel): | ||
| output: str = "" | ||
| tools: List[TestRunToolDigest] = Field(default_factory=list) | ||
| approvals: List[str] = Field(default_factory=list) | ||
| resolved: TestRunResolved = Field(default_factory=TestRunResolved) | ||
| trace_id: Optional[str] = None | ||
| test_id: Optional[str] = None | ||
| verdict: TestRunVerdict | ||
| verdict_reason: Optional[str] = None | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class PlatformToolHandlerRegistration: | ||
| call_ref: str | ||
| timeout_ms: int | ||
|
|
||
|
|
||
| PLATFORM_TOOL_HANDLERS: Dict[str, PlatformToolHandlerRegistration] = { | ||
| TEST_RUN_CALL_REF: PlatformToolHandlerRegistration( | ||
| call_ref=TEST_RUN_CALL_REF, | ||
| timeout_ms=TEST_RUN_DEFAULT_TIMEOUT_MS, | ||
| ) | ||
| } | ||
|
|
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Domain contracts/exceptions live in the handler module instead of dtos.py/types.py.
TestRunTarget/TestRunInputs/TestRunResponse/etc. and the PlatformToolHandlerError exception hierarchy are all defined directly in platform_handlers.py. Per the coding guidelines for this domain, request/response contracts belong in dtos.py/types.py and domain exceptions belong in the core layer's types.py/dtos.py with a base exception per domain, while platform_handlers.py should hold only the business orchestration. Consider splitting these definitions out into oss/src/core/tools/dtos.py (or a new types.py) to align with the established domain layout used elsewhere in core/tools.
As per coding guidelines, api/oss/src/core/**/{types.py,dtos.py}: "Define domain exceptions in the core layer, include structured context, and use a base exception per domain; do not raise HTTPException from services or DAOs," and api/oss/src/core/**/{dtos.py,types.py,interfaces.py,service.py}: "keep domain contracts in dtos.py or types.py... and business orchestration in service.py."
Source: Coding guidelines
| if request.delta is not None: | ||
| await workflows_service._ensure_request_revision( | ||
| project_id=project_id, | ||
| request=workflow_request, | ||
| ) | ||
| if not workflow_request.data or not workflow_request.data.revision: | ||
| raise PlatformToolHandlerError( | ||
| "test_run could not resolve the target workflow variant revision." | ||
| ) | ||
|
|
||
| resolved = await workflows_service._resolve_revision_delta( | ||
| project_id=project_id, | ||
| workflow_revision_commit=WorkflowRevisionCommit( | ||
| workflow_variant_id=request.target.workflow_variant_id, | ||
| delta=request.delta, | ||
| ), | ||
| ) | ||
| if resolved.data is None: | ||
| raise PlatformToolHandlerError( | ||
| "test_run could not resolve the revision delta." | ||
| ) | ||
| workflow_request.data.revision = {"data": resolved.data.model_dump(mode="json")} | ||
| return workflow_request | ||
|
|
||
| await workflows_service._ensure_request_revision( | ||
| project_id=project_id, | ||
| request=workflow_request, | ||
| ) | ||
| if not workflow_request.data or not workflow_request.data.revision: | ||
| raise PlatformToolHandlerError( | ||
| "test_run could not resolve the target workflow variant revision." | ||
| ) | ||
| return workflow_request |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect WorkflowRevisionDelta and _resolve_revision_delta to confirm which
# fields (e.g. "url") a delta's "set" can mutate.
rg -n -A 30 'class WorkflowRevisionDelta' api/oss/src/core/workflows
rg -n -A 40 'def _resolve_revision_delta' api/oss/src/core/workflows
rg -n -A 30 'def _prepare_invoke' api/oss/src/core/workflowsRepository: Agenta-AI/agenta
Length of output: 7981
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Map the relevant models and invoke path.
ast-grep outline api/oss/src/core/workflows/dtos.py --match WorkflowRevisionData --view expanded
ast-grep outline api/oss/src/core/workflows/dtos.py --match WorkflowRevisionDelta --view expanded
ast-grep outline api/oss/src/core/workflows/service.py --match _prepare_invoke --view expanded
ast-grep outline api/oss/src/core/workflows/service.py --match _invoke_child_workflow --view expanded
# Read the focused sections with line numbers.
sed -n '1,220p' api/oss/src/core/workflows/dtos.py | cat -n
printf '\n----\n'
sed -n '2035,2105p' api/oss/src/core/workflows/service.py | cat -n
printf '\n----\n'
sed -n '2105,2195p' api/oss/src/core/workflows/service.py | cat -nRepository: Agenta-AI/agenta
Length of output: 13712
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find the service URL derivation and the workflow revision data schema.
rg -n -A 40 -B 10 '_get_service_url|service_url|has_url|url:' api/oss/src/core/workflows
rg -n -A 80 'class WorkflowRevisionData|class BaseWorkflowRevisionData|WorkflowRevisionData\(' -g '*.py' .Repository: Agenta-AI/agenta
Length of output: 50373
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the exact workflow revision data schema and URL derivation logic.
rg -n -A 40 -B 10 'class WorkflowRevisionData|class BaseWorkflowRevisionData|url:|uri:|has_url' sdks/python/agenta/sdk/models/workflows.py api/oss/src/core/workflows/dtos.py
printf '\n----\n'
rg -n -A 40 -B 10 '_get_service_url|def _get_service_url|service_url =|has_url' api/oss/src/core/workflows/service.pyRepository: Agenta-AI/agenta
Length of output: 29850
Block url changes in revision deltas. WorkflowRevisionData explicitly allows url, and _resolve_revision_delta() deep-merges delta.set before _get_service_url() reads it. That lets an EDIT_WORKFLOWS caller point the server-side POST at an arbitrary HTTP(S) endpoint and send server-minted credentials there.
| def _tools_from_messages(messages: List[Any]) -> Dict[str, TestRunToolDigest]: | ||
| by_call_id: Dict[str, TestRunToolDigest] = {} | ||
| by_name: Dict[str, TestRunToolDigest] = {} | ||
| for message in messages: | ||
| if not isinstance(message, dict) or message.get("role") != "tool": | ||
| continue | ||
| call_id = message.get("tool_call_id") or message.get("toolCallId") | ||
| name = message.get("tool_name") or message.get("toolName") | ||
| if name: | ||
| digest = by_name.setdefault(name, TestRunToolDigest(name=name)) | ||
| if call_id: | ||
| by_call_id[str(call_id)] = digest | ||
| if _has_tool_result_content(message): | ||
| digest.returned = True | ||
| digest.error = bool(message.get("is_error") or message.get("isError")) | ||
| continue | ||
| if call_id and str(call_id) in by_call_id: | ||
| digest = by_call_id[str(call_id)] | ||
| if _has_tool_result_content(message): | ||
| digest.returned = True | ||
| digest.error = bool(message.get("is_error") or message.get("isError")) | ||
| return by_name |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
_tools_from_messages overwrites, rather than accumulates, returned/error per tool name.
Since by_name.setdefault(name, ...) shares one TestRunToolDigest across all calls to the same tool name, a second call to the same tool overwrites digest.error/digest.returned set by an earlier call. If a tool is invoked twice in one run (e.g. once erroring, once succeeding), the earlier error is silently discarded and _verdict() (line 593-595) will not flag it — masking a real failure in what is meant to be a correctness-verifying test_run. Contrast with _merge_span_observations (lines 506-518), which correctly only sets flags to True and never resets them.
🐛 Proposed fix to accumulate instead of overwrite
if name:
digest = by_name.setdefault(name, TestRunToolDigest(name=name))
if call_id:
by_call_id[str(call_id)] = digest
if _has_tool_result_content(message):
- digest.returned = True
- digest.error = bool(message.get("is_error") or message.get("isError"))
+ digest.returned = True
+ digest.error = digest.error or bool(
+ message.get("is_error") or message.get("isError")
+ )
continue
if call_id and str(call_id) in by_call_id:
digest = by_call_id[str(call_id)]
if _has_tool_result_content(message):
- digest.returned = True
- digest.error = bool(message.get("is_error") or message.get("isError"))
+ digest.returned = True
+ digest.error = digest.error or bool(
+ message.get("is_error") or message.get("isError")
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _tools_from_messages(messages: List[Any]) -> Dict[str, TestRunToolDigest]: | |
| by_call_id: Dict[str, TestRunToolDigest] = {} | |
| by_name: Dict[str, TestRunToolDigest] = {} | |
| for message in messages: | |
| if not isinstance(message, dict) or message.get("role") != "tool": | |
| continue | |
| call_id = message.get("tool_call_id") or message.get("toolCallId") | |
| name = message.get("tool_name") or message.get("toolName") | |
| if name: | |
| digest = by_name.setdefault(name, TestRunToolDigest(name=name)) | |
| if call_id: | |
| by_call_id[str(call_id)] = digest | |
| if _has_tool_result_content(message): | |
| digest.returned = True | |
| digest.error = bool(message.get("is_error") or message.get("isError")) | |
| continue | |
| if call_id and str(call_id) in by_call_id: | |
| digest = by_call_id[str(call_id)] | |
| if _has_tool_result_content(message): | |
| digest.returned = True | |
| digest.error = bool(message.get("is_error") or message.get("isError")) | |
| return by_name | |
| def _tools_from_messages(messages: List[Any]) -> Dict[str, TestRunToolDigest]: | |
| by_call_id: Dict[str, TestRunToolDigest] = {} | |
| by_name: Dict[str, TestRunToolDigest] = {} | |
| for message in messages: | |
| if not isinstance(message, dict) or message.get("role") != "tool": | |
| continue | |
| call_id = message.get("tool_call_id") or message.get("toolCallId") | |
| name = message.get("tool_name") or message.get("toolName") | |
| if name: | |
| digest = by_name.setdefault(name, TestRunToolDigest(name=name)) | |
| if call_id: | |
| by_call_id[str(call_id)] = digest | |
| if _has_tool_result_content(message): | |
| digest.returned = True | |
| digest.error = digest.error or bool( | |
| message.get("is_error") or message.get("isError") | |
| ) | |
| continue | |
| if call_id and str(call_id) in by_call_id: | |
| digest = by_call_id[str(call_id)] | |
| if _has_tool_result_content(message): | |
| digest.returned = True | |
| digest.error = digest.error or bool( | |
| message.get("is_error") or message.get("isError") | |
| ) | |
| return by_name |
| def _select_candidates( | ||
| conn: psycopg.Connection[Any], project_id: str | None | ||
| ) -> list[tuple[str, str, Any]]: | ||
| match_terms = { | ||
| "find_capabilities": "%find_capabilities%", | ||
| "find_triggers": "%find_triggers%", | ||
| "build_your_first_app": "%__ag__build_your_first_app%", | ||
| "discover_and_wire_tools": "%__ag__discover_and_wire_tools%", | ||
| "set_up_triggers": "%__ag__set_up_triggers%", | ||
| } | ||
| where = [ | ||
| "data IS NOT NULL", | ||
| "(" + " OR ".join(f"data::text LIKE %({key})s" for key in match_terms) + ")", | ||
| ] | ||
| params: dict[str, Any] = dict(match_terms) | ||
| if project_id is not None: | ||
| where.append("project_id = %(project_id)s::uuid") | ||
| params["project_id"] = project_id | ||
|
|
||
| query = f""" | ||
| SELECT project_id::text, id::text, data | ||
| FROM workflow_revisions | ||
| WHERE {" AND ".join(where)} | ||
| ORDER BY project_id, id | ||
| """ | ||
| with conn.cursor() as cur: | ||
| cur.execute(query, params) | ||
| return [(row[0], row[1], _json_data(row[2])) for row in cur.fetchall()] |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file='docs/design/agent-workflows/projects/build-kit-tools-cleanup/scripts/sweep_platform_op_renames.py'
wc -l "$file"
printf '\n--- outline ---\n'
ast-grep outline "$file" --view expanded
printf '\n--- candidate selection and cleanup flow ---\n'
rg -n "def _select_candidates|def assert_clean|def main|uncovered|SystemExit|LIKE %\\(" "$file" -A 40 -B 20Repository: Agenta-AI/agenta
Length of output: 10134
Narrow the candidate filter to the covered JSON paths
data::text LIKE can still match unrelated free-form text, and assert_clean turns those rows into a hard SystemExit as “matched but not rewritten (uncovered path)” even when no covered op/embed paths remain. Make uncovered matches non-fatal, or prefilter on the actual JSON paths instead.
PR title
feat(sdk,api): build-kit tools cleanup + test_run server halfPR body
Context
The playground build kit is the tool set a fresh agent uses to build itself. It carried 19 tools plus three cross-referencing authoring skills, and in end-to-end tests the builder agent wandered: it guessed op names, drifted between overlapping skills, and had no way to test the agent it had just built. The agent-creation-lab kit proved the fix outside Agenta: a small ordered tool set, one playbook, and a self-verifying test step. The tools-review compared the 19 inside tools against the lab's 13 outside scripts, with evidence per verdict, and design PR #5060 turned that into this plan. This PR executes it: slices 1-5a, one commit per slice, on
docs/design/agent-workflows/projects/build-kit-tools-cleanup/plan.md.Changes
Slice 1: rename the discovery ops, hard.
find_capabilitiesis nowdiscover_toolsandfind_triggersis nowdiscover_triggers. No aliases (decided; pre-production). The legacy reserved/tools/callroute fortools.agenta.find_capabilitiesis deleted. A sweep script rewrites committed revisions that still carry the old keys (projects/build-kit-tools-cleanup/scripts/sweep_platform_op_renames.py).Slice 2: an explicit default overlay. The overlay previously embedded every catalog op. It now embeds an explicit
DEFAULT_BUILD_KIT_OPSlist of 12 ops plus therequest_connectionclient tool. Cut from the default (all stay in the catalog for opt-in):pause_schedule,resume_schedule,pause_subscription,resume_subscription,query_workflows,list_connections,list_subscriptions.Slice 3:
query_spans. A new read-only platform op overPOST /api/spans/query, so the builder can read its own past runs and verify a tool actually fired. The op schema mirrorsSpansQueryRequest; a drift contract test pins the two together.Slice 4: one playbook skill. A single ordered
build-an-agentskill (slug__ag__build_an_agent) replaces build-your-first-app, discover-and-wire-tools, and set-up-triggers. The overlay embeds only the playbook.getting-startedstays harness-forced onpi_agenta, so it is delivered once instead of twice.Slice 5a:
test_run, server half only. The op that lets the builder test its own variant:PlatformOpgains a handler mode: an op targets exactly one ofmethod+path(endpoint) orhandler(an allowlisted reservedtools.agenta.<op>call-ref).callRef+ spec-levelcontextBindings+timeoutMs. This is flag-gated OFF (AGENTA_AGENT_ENABLE_PLATFORM_HANDLERS) until the runner half lands./tools/callgains a reserved-ref registry: a registeredtools.agenta.*ref dispatches to its Python handler; an unknown one fails loud with a 404.test_runhandler hydrates the bound variant's committed revision, applies an optional in-memorydelta(requiresEDIT_WORKFLOWS), invokes it headless with a server-minted token, and returns a digest with a terminal-result-wins verdict, under a 120s ceiling. A recursion marker is emitted but inert until the runner half lands.test_runis NOT in the overlay yet; it joins when the flag flips.Plus one SDK resolver fix found live: an inline revision committed without
parametersnow seeds the default agent template, the same way a reference resolution does. Before the fix,test_runagainst a fresh playground app failed because the child invoke saw an empty config.Assumed decisions (review these first)
Three calls proceeded on their recommendations as provisional defaults. They were flagged, not decided:
query_spans. The default is a fixed 12-op list, not a conditional event pack. Argued inprojects/build-kit-tools-cleanup/research.md(overlay-scope).test_runis sync with an in-memorydelta. 120s cap, per-optimeout_msplumbing;test_idstays reserved so an async pair remains additive. Argued inapi-design.md(shape-decision).query_spansships now, not held fortest_run. It stays useful aftertest_runlands (reading scheduled fires). Argued inapi-design.md(query_spans stopgap).Scope / risk
test_run(slice 5b) is deferred. The flag (AGENTA_AGENT_ENABLE_PLATFORM_HANDLERS, default off) protects exactly this: with it off, a handler-mode op fails resolution loudly instead of reaching a runner that cannot dispatch a reservedcallRef, inject spec-level context, or honortimeoutMs.protocol.tsand the golden wire fixtures do not carry the new spec fields yet. Open-issues entry filed with the full 5b scope.gateway->serverrename (slice 6) is deferred. Codex review scoped the safe version to docs/UI labels; the storedtype: "gateway"literal is a data migration. Open-issues entry filed.find_capabilities/find_triggersfail loud (UnknownPlatformOpError) at resolve time until the sweep script runs against that database. Dev DBs are swept; a production deploy must run the same sweep. Acceptable pre-production, but it is a deploy step, not a no-op.agentRequest.test.tsop literals and theCapabilitiesResultdocstrings in the generated clients update at the next codegen.test_runsetsread_only=Falseand nothing else, so the merged permission plan resolves it to ask by default.How to QA
Prerequisites: local dev stack (
load-env hosting/docker-compose/ee/.env.ee.dev+bash ./hosting/docker-compose/run.sh --ee --dev --build). For step 4, setAGENTA_AGENT_ENABLE_PLATFORM_HANDLERS=trueon the API and grab a project API key.Steps:
Create a fresh agent application in the playground. Open its config: the build-kit block must show the 12 default ops (
discover_tools,commit_revision,annotate_trace,query_spans,discover_triggers,create_schedule,create_subscription,list_schedules,list_deliveries,test_subscription,remove_schedule,remove_subscription), therequest_connectionclient tool, and exactly one skill embed, thebuild-an-agentplaybook. No pause/resume ops, noquery_workflows.Chat with the agent ("find me a tool that creates GitHub issues"). Watch it call
discover_toolsand come back with Agenta-shaped capabilities and connection state.Confirm the old names are gone: a config carrying
{type:"platform", op:"find_capabilities"}fails resolution withUnknownPlatformOpError.Curl
test_rundirectly (server half):call.data.contentis a JSON digest with a verdict."name": "tools__agenta__nope"(unknown reserved ref).inputs). Adeltafrom a caller withoutEDIT_WORKFLOWSreturns 403.Run the sweep script against the dev DB (dry run is the default; it reads
POSTGRES_URI_CORE, falls back toDATABASE_URL) and confirm it reports zero remaining old-key tools or old-slug skill embeds:uv run --script docs/design/agent-workflows/projects/build-kit-tools-cleanup/scripts/sweep_platform_op_renames.py. Add--applyto commit rewrites.Expected result: steps 1-2 show the leaner kit working end to end; steps 3-5 show the hard migration failing loud and the sweep closing it.
Automated tests:
Expected: 250 passed (api) and 1596 passed, 10 xfailed (sdk).
Edge cases: a revision committed before this PR with the old op keys (fails loud until swept, by design); a handler-mode op with the flag off (must fail resolution with a clear message, not resolve to a broken spec); the playbook embed on a fresh app (must resolve by artifact slug; a bare revision slug 500s).
Live verification
Phase-3 verification ran against the local dev stack: a fresh playground agent showed the 12-op block plus the playbook and resolved
getting-startedonce (forced) and the playbook once (overlay);discover_tools/discover_triggersran end to end in chat;query_spansreturned model-digestible spans for a real trace;test_runreturned a digest and verdict with the flag on, and the 404/400/403 negative paths behaved; the sweep dry-run came back clean after the dev sweep.Two findings from that pass:
parameters, which broketest_run's child invoke. The SDK resolver now seeds the default template for parameters-less inline revisions, matching the reference path.trace_id=None(the sidecar's persist/heartbeat 401s), leavingtest_run's span digest empty on that path. Tracked as a tracing follow-up in the project workspace, not a build-kit bug.https://claude.ai/code/session_014iPB7HL5PjgT9npyPHaFMT