feat(agent): agent playground — turn inspector, HITL dock, and tool I/O fidelity by ardaerzin · Pull Request #5054 · Agenta-AI/agenta

ardaerzin · 2026-07-02T23:29:49Z

What

A batch of agent-playground work layered on big-agents: builder-facing tooling for inspecting a turn, a persistent HITL approval surface, richer chat input, and a set of runner/SDK/FE fixes so tool inputs, outputs, and approvals render faithfully. Includes the end-to-end fix for the HITL approval resume loop.

Why

The agent playground could run an agent but gave builders little insight into what happened in a turn, and several streaming-fidelity bugs made tool activity hard to trust — tool inputs showed {}, and any tool requiring human approval got stuck in an infinite re-approval loop, never showing its output.

Changes

Turn Inspector (Build-mode tooling) — a per-session inline side panel (Timeline / Context / Raw) that reads the live useChat messages: the full round (user message → reasoning → tool I/O → response), the exact config + messages sent, and copyable raw payloads. Mode-adaptive empty state (agent-aware in Build, warm-minimal in Chat). Design + plan docs under docs/design/agent-workflows/projects/agent-turn-inspector/.

Build-mode step log — inline per-tool input/output/error blocks, individually collapsible.

HITL approval dock — a persistent, non-scrolling approval surface with a hardened queue release, replacing the inline-only gate.

Chat UX — rich chat input (links, code blocks), calmer composer, and scroll engineering (pin new turn, stop jump-to-top on stream-state change / tool collapse).

HITL approval resume loop — root cause + fix (runner · SDK · FE)

An approved tool kept re-parking forever and never showed output. Traced through the live runner logs to a chain of same-root issues — the cold-replay runner re-issues the approved tool under a fresh tool-call id, so its output never lands on the approved part:

Runner — the cross-turn approval key drifted. Claude-over-ACP names the same call differently across frames (the session/update tool_call titles it Terminal; the permission request titles it the full command), and neither carries a stable name/spec. The key now anchors on the recorded tool_call name (stamped as resolvedName), so the live re-raised key equals the stored key and the gate resolves. Kept the non-converging loop-breaker + [HITL] diagnostics as a fail-safe.
SDK — the vercel egress mirrors that anchor (resolvedName → spec name → title) and no longer lets a late arg-refresh downgrade the name; [HITL] egress/ingress logging for the round-trip.
FE (resume predicate) — agentShouldResumeAfterApproval re-sent after every completion because the answered approval-responded part lingers in the message; now guarded on "already resumed" (a step-start follows the approval), so it resumes exactly once.
FE (tool rendering) — the lingering answered gate rendered as a perpetual spinner with no output. AgentMessage collapses it into its executed sibling (same tool + input), ToolActivity treats approval-responded as resolved (approved, not running), and tool output (not just errors) has its markdown code fence stripped.

Tool inputs — emit the tool_call up front (so the FE part + HITL approval attach to it), then refresh its input when the real args arrive on a later tool_call_update; fixes the always-{} display for non-gated tools without breaking the emit-first invariant. Unique vercel-stream messageId per turn.

Verification

Runner: tsc clean; unit suite green. Live-log confirmed: gate "Terminal" -> stored allow (resume matched) and the tool executes.
SDK: agents unit suite green + ruff clean; new egress tests (spec/resolved-name anchor, arg-refresh no-clobber, park-refresh).
Frontend: resume-predicate tests green (post-resolve guard + chained approval); tsc clean for the touched agent-chat files.

Merged latest big-agents in (resolved two cosmetic conflicts: copy-button "Copied" feedback in AgentMessage, and chat-input theme.ts quote style kept in lock-step with the message bubble's markdown.tsx). Manual browser verification of the inspector / dock / step-log and a park→approve→run pass is still worth doing before merge.

…ssion button Add min-w-0 to the agent playground Tabs so the session tab strip clamps to the available width and scrolls internally instead of pushing the search and history controls off-screen. Move the New session (+) button out of the scroll container into the fixed right-hand actions cluster so it no longer scrolls away with the tabs.

…overable The agent playground disables the splitter's collapse pills (Build/Chat lives in the header), so the drag handle was the only resize affordance and read as a bare hairline. Add a persistent centered grip on the divider — neutral at rest, accent-tinted (colorPrimary at reduced opacity, so it stays visible in dark without shouting) and taller on hover/drag — scoped to a new playground-splitter-agent class so prompt playgrounds keep antd's defaults.

…title row Extract the agent revision selector (variant picker + version/status chip) into a self-contained AgentRevisionSelector and render it next to the agent name in the page header. The config-panel header (ROW B) now shows a 'Configuration' title instead. Scoped to agent mode; prompt/evaluator and embedded surfaces are unchanged.

Add a railInfoLabel helper to RailField (label + inline info tooltip, so a field keeps its help text without a separate description line) and a disabled prop to SectionRail (for read-only revisions). Both back the config-drawer refactors that follow.

…Field rows Convert SandboxPermissionControl, ClaudePermissionsControl and McpServerFormView from stacked label-above LabeledField groups to flat [label | control] RailField rows using railInfoLabel for per-field help. Drops the redundant nested headers and inner borders (the 'form inside a form' look) so each knob is a peer row that shares the section rail.

…gAccordionSection Replace the hand-rolled collapsible card with the shared ConfigAccordionSection (toggle as the header extra, 'Removed on commit' as the status), and render the overlay groups as RailField rows. Add an optional enabledOverride so the Advanced drawer can buffer the build-kit toggle in its draft and only write the persisted atom on Save.

…nced UX Move the Model & harness and Advanced section drawers to a true scoped draft: edits are buffered and relayed to the entity only on Save, with Save gated on a real diff (a second useModelHarness instance holds the draft so the background accordion summaries keep reflecting the saved entity). Advanced is rebuilt on ConfigAccordionSection + RailField (auth as a SectionRail; sandbox/permissions flattened), and the 'Edit as JSON' escape hatch is removed. Model & harness uses the rail rhythm too; compatibility is now self-contained on the harness cards (the current card owns its own model-availability, and availability also matches on the model's provider family to avoid cross-harness id-namespace false negatives), the 'Current' badge tracks the saved harness (not the draft pick), and the redundant compatibility side panel is dropped in favour of version history — matching the Advanced drawer's shape.

…arity Add paste-a-link-over-selection and a plain code-block fence to the RichChatInput (new LinkPastePlugin + CodeFencePlugin, LinkNode/CodeNode registered), and harmonize link/code/blockquote styling between the composer theme and the message-bubble markdown so a block looks identical while typing and after sending.

…ease Move the tool-approval action out of the scrolling transcript into a persistent ApprovalDock pinned above the composer (neutral surface, tool + payload context, animated show/hide, inert while collapsed); the inline tool row now just marks 'Awaiting approval'. Harden the queue: a user stop voids the pending gate so a new message sends immediately instead of queuing, narrow isHitlPending to approval-requested (lockstep with the dock, avoids a queue-freeze trap), and release on a settled 'error' turn.

…ubble The hover toolbar (metrics + copy/rewind/trace) is anchored to the bottom of a reserved lane below each message. The lane was pb-7 (28px) and the toolbar is ~28px tall, so it filled the lane and hugged the bubble text. Bump to pb-10 (40px) so the extra space falls between the bubble and the toolbar.

In Build mode the agent transcript renders each tool call as a full step — per-tool input and output/error as monospace blocks, expanded reasoning — gated on chatPanelMaximizedAtom; Chat mode keeps the calm collapsed summary.

… to target session The inspector read the settle-only sessionMessagesAtom, so it showed a stale/wrong turn and never updated while streaming; and being mounted per session it popped a drawer per tab. Feed it the live useChat messages + sessionId as props, open only when it's the target session, and drop the AI SDK step-start/step-end boundary noise from the Timeline.

Each detailed tool step in Build mode now has its own caret toggle — click a step header to collapse/expand its input/output blocks (HeightCollapse), independent of the others. Default expanded.

…inimal in Chat Replace the bare 'Ask a question…' text with a mode-adaptive empty state: Chat mode shows a warm welcome (robot mark + prompt); Build mode shows an agent-aware card (name, model, tool/skill counts, a one-line summary from the instructions) plus curated starter prompts that send on click.

…hange and tool collapse Two reported jump-to-top bugs in the agent playground chat, plus perf hardening of the scroll handler: - onScroll only re-arms follow on a real scroll-DOWN-to-edge. A content shrink (tool gutter collapsing to "Used N tools", reasoning folding) clamps scrollTop to the new bottom and fires a non-gesture scroll event; a clamp only decreases scrollTop, so `> prevTop` rejects it. Previously that silently re-enabled follow and the next token snapped the min-h-full active turn to the top. - Coalesce the costly jump-pill measurement (querySelectorAll + getBoundingClientRect) to one rAF/frame; keep the follow decision and SC-3 anchor synchronous. Removes per-scroll-event and per-render forced reflows during streaming. - Dedup the follow-pin: guarded scrollToBottom so the ResizeObserver and the follow effect don't both write scrollTop for the same growth.

The Vercel adapter keyed a parked approval on the ACP display fields (name -> title -> kind). A Claude tool has no ACP `name`, so the key was a drift-prone display string: between the park turn and the re-raise the harness could vary it, the cross-turn resume key silently stopped matching, and the gate re-parked every turn (the HITL resume loop). - _approval_tool_name / _tool_spec_of: prefer the resolved spec's canonical `name` (stable across cold-replay turns), falling back to the old chain when no spec is resolved. Mirrors the runner's permissionToolName precedence so the persisted key and the live re-raised key agree. - tool-input-available now prefers `rawInput` over the often-empty `input`, so every tool-call path shows the real args (approve-empty-input bug). - [HITL] ingress/egress info logs to diff the persisted key against the runner's live gate identity. Covered by test_vercel_stream_park.py.

Runner side of the approval re-park loop, matched to the SDK egress fix. - permissionToolName / specOf: resolve the gated tool's key from the resolved spec's canonical `name` first (stable across cold-replay turns), then the ACP display fields. This is the same precedence the SDK egress persists, so the stored decision key and the live re-raised key agree instead of drifting apart. - nonConvergingToolNames + loop-breaker: when a tool's {approved:true} envelopes outnumber its real executions by a threshold, the resume key never matched; DENY the next gate for it (a clean terminal failure the model stops re-issuing) instead of parking forever. Fail-safe under the key fix above. - [HITL] ground-truth logging across permissions.ts / responder.ts / sandbox_agent.ts (ACP permission, gate hit/miss/park, stored resume state) to diff the persisted keys against the live gate identity field-by-field. Covered by responder.test.ts.

…treatment with semantic elevation tokens Introduce a surface ladder (app/gutter/raised/canvas/card/inset/chat) that separates the Build view's two workspaces (Configuration authoring panel vs Chat observing canvas) via elevation and containment instead of flat hue. The relationship inverts by theme: raised surfaces are lighter than the canvas in dark mode, white over soft-grey in light. Define semantic CSS classes (ag-panel-raised

# Conflicts: # web/oss/src/components/AgentChatSlice/components/AgentMessage.tsx # web/packages/agenta-ui/src/RichChatInput/assets/theme.ts

The live approve-loop is diagnosed: a constant stream messageId plus a level-triggered resume predicate on the frontend (new finding M7), compounded by tool-name drift across ACP frames breaking the decision key (M2's observed form, not argument drift). Updated the explainer's live-warning section, reframed M2 and added M7 in the code review, settled the fix direction in the plan (direct replay of the approved call; absorb #5054's message-id fix and edge-trigger guard; supersede its resolvedName patch and loop-breaker), and recorded the #5054 recommendation in status. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2

…e target path Global policy becomes four explicit modes (allow|ask|deny|allow_reads): read-only-allow is a policy choice, not a hidden per-tool default, and needs_approval is deleted from the model. 'Disposition' renamed to 'effective permission' everywhere. New 'target path' section shows the clean end-state flow; resume is redesigned to replay the approved call directly. Corrected the session-id story (the playground sends a stable per-conversation id). Added the Pi-builtins explanation (selection is Pi's only native control). Plan gains the stacked-on-#5054 baseline (keep the message-id fix and resume guard; delete resolvedName and the loop-breaker) and updated phases/deltas. Status consolidated for final review. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2

ardaerzin · 2026-07-03T12:53:00Z

@coderabbitai review

coderabbitai · 2026-07-03T12:53:25Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

The live approve-loop is diagnosed: a constant stream messageId plus a level-triggered resume predicate on the frontend (new finding M7), compounded by tool-name drift across ACP frames breaking the decision key (M2's observed form, not argument drift). Updated the explainer's live-warning section, reframed M2 and added M7 in the code review, settled the fix direction in the plan (direct replay of the approved call; absorb #5054's message-id fix and edge-trigger guard; supersede its resolvedName patch and loop-breaker), and recorded the #5054 recommendation in status. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2

…e target path Global policy becomes four explicit modes (allow|ask|deny|allow_reads): read-only-allow is a policy choice, not a hidden per-tool default, and needs_approval is deleted from the model. 'Disposition' renamed to 'effective permission' everywhere. New 'target path' section shows the clean end-state flow; resume is redesigned to replay the approved call directly. Corrected the session-id story (the playground sends a stable per-conversation id). Added the Pi-builtins explanation (selection is Pi's only native control). Plan gains the stacked-on-#5054 baseline (keep the message-id fix and resume guard; delete resolvedName and the loop-breaker) and updated phases/deltas. Status consolidated for final review. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2

The live approve-loop is diagnosed: a constant stream messageId plus a level-triggered resume predicate on the frontend (new finding M7), compounded by tool-name drift across ACP frames breaking the decision key (M2's observed form, not argument drift). Updated the explainer's live-warning section, reframed M2 and added M7 in the code review, settled the fix direction in the plan (direct replay of the approved call; absorb #5054's message-id fix and edge-trigger guard; supersede its resolvedName patch and loop-breaker), and recorded the #5054 recommendation in status. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2

…e target path Global policy becomes four explicit modes (allow|ask|deny|allow_reads): read-only-allow is a policy choice, not a hidden per-tool default, and needs_approval is deleted from the model. 'Disposition' renamed to 'effective permission' everywhere. New 'target path' section shows the clean end-state flow; resume is redesigned to replay the approved call directly. Corrected the session-id story (the playground sends a stable per-conversation id). Added the Pi-builtins explanation (selection is Pi's only native control). Plan gains the stacked-on-#5054 baseline (keep the message-id fix and resume guard; delete resolvedName and the loop-breaker) and updated phases/deltas. Status consolidated for final review. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2

ardaerzin added 30 commits July 2, 2026 09:57

Merge remote-tracking branch 'origin/big-agents' into big-agents-work

cc17ab0

Merge remote-tracking branch 'origin/big-agents' into big-agents-work

99462f9

docs(agent): Turn Inspector (Build-mode tooling) design spec

4888081

docs(agent): Turn Inspector implementation plan

84f39b9

feat(frontend): build-mode step log for agent tool calls

d9294a3

In Build mode the agent transcript renders each tool call as a full step — per-tool input and output/error as monospace blocks, expanded reasoning — gated on chatPanelMaximizedAtom; Chat mode keeps the calm collapsed summary.

feat(frontend): turn-inspector open-state atom

5647c76

feat(frontend): turn-inspector Timeline tab

b4c0647

feat(frontend): turn-inspector drawer shell

9d8f9f1

feat(frontend): mount turn inspector + inspect-turn affordance

d47778a

feat(playground): per-turn request capture + correlation helpers

81c727c

feat(frontend): session-scoped turn-capture store

0a56cd0

feat(frontend): capture outgoing agent request per send

a62c380

feat(frontend): turn-inspector Context tab (config + messages sent)

e1c504c

feat(frontend): turn-inspector Raw tab (copyable payloads)

bfd3689

feat(frontend): collapsible individual steps in the build-mode step log

d4fe0d9

Each detailed tool step in Build mode now has its own caret toggle — click a step header to collapse/expand its input/output blocks (HeightCollapse), independent of the others. Default expanded.

vercel Bot deployed to Preview July 3, 2026 08:48 View deployment

Merge remote-tracking branch 'origin/big-agents' into big-agents-work

07b5078

# Conflicts: # web/oss/src/components/AgentChatSlice/components/AgentMessage.tsx # web/packages/agenta-ui/src/RichChatInput/assets/theme.ts

vercel Bot deployed to Preview July 3, 2026 09:53 View deployment

mmabrouk mentioned this pull request Jul 3, 2026

[feat] Approval boundary: pause on authored intent, not session ids #5041

Merged

Merge branch 'big-agents' into big-agents-work

8ab3070

vercel Bot deployed to Preview July 3, 2026 10:44 View deployment

vercel Bot deployed to Preview July 3, 2026 12:31 View deployment

mmabrouk force-pushed the big-agents-work branch from 4bd038a to 8ab3070 Compare July 3, 2026 12:33

vercel Bot deployed to Preview July 3, 2026 12:37 View deployment

vercel Bot deployed to Preview July 3, 2026 12:52 View deployment

mmabrouk force-pushed the big-agents-work branch 2 times, most recently from 3253f42 to 8ab3070 Compare July 3, 2026 13:16

bekossy approved these changes Jul 3, 2026

View reviewed changes

dosubot Bot added the lgtm This PR has been approved by a maintainer label Jul 3, 2026

vercel Bot deployed to Preview July 3, 2026 14:04 View deployment

vercel Bot deployed to Preview July 3, 2026 14:15 View deployment

mmabrouk force-pushed the big-agents-work branch from 68fc038 to 8ab3070 Compare July 3, 2026 14:17

vercel Bot deployed to Preview July 3, 2026 14:39 View deployment

mmabrouk force-pushed the big-agents-work branch from a11b58c to 8ab3070 Compare July 3, 2026 14:40

ardaerzin closed this Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(agent): agent playground — turn inspector, HITL dock, and tool I/O fidelity#5054

feat(agent): agent playground — turn inspector, HITL dock, and tool I/O fidelity#5054
ardaerzin wants to merge 45 commits into
big-agentsfrom
big-agents-work

ardaerzin commented Jul 2, 2026 •

edited

Loading

Uh oh!

ardaerzin commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ardaerzin commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Changes

HITL approval resume loop — root cause + fix (runner · SDK · FE)

Verification

Uh oh!

ardaerzin commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ardaerzin commented Jul 2, 2026 •

edited

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading