feat(agent): agent playground — turn inspector, HITL dock, and tool I/O fidelity#5054
Closed
ardaerzin wants to merge 45 commits into
Closed
feat(agent): agent playground — turn inspector, HITL dock, and tool I/O fidelity#5054ardaerzin wants to merge 45 commits into
ardaerzin wants to merge 45 commits into
Conversation
…ssion button Add min-w-0 to the agent playground Tabs so the session tab strip clamps to the available width and scrolls internally instead of pushing the search and history controls off-screen. Move the New session (+) button out of the scroll container into the fixed right-hand actions cluster so it no longer scrolls away with the tabs.
…overable The agent playground disables the splitter's collapse pills (Build/Chat lives in the header), so the drag handle was the only resize affordance and read as a bare hairline. Add a persistent centered grip on the divider — neutral at rest, accent-tinted (colorPrimary at reduced opacity, so it stays visible in dark without shouting) and taller on hover/drag — scoped to a new playground-splitter-agent class so prompt playgrounds keep antd's defaults.
…title row Extract the agent revision selector (variant picker + version/status chip) into a self-contained AgentRevisionSelector and render it next to the agent name in the page header. The config-panel header (ROW B) now shows a 'Configuration' title instead. Scoped to agent mode; prompt/evaluator and embedded surfaces are unchanged.
Add a railInfoLabel helper to RailField (label + inline info tooltip, so a field keeps its help text without a separate description line) and a disabled prop to SectionRail (for read-only revisions). Both back the config-drawer refactors that follow.
…Field rows Convert SandboxPermissionControl, ClaudePermissionsControl and McpServerFormView from stacked label-above LabeledField groups to flat [label | control] RailField rows using railInfoLabel for per-field help. Drops the redundant nested headers and inner borders (the 'form inside a form' look) so each knob is a peer row that shares the section rail.
…gAccordionSection Replace the hand-rolled collapsible card with the shared ConfigAccordionSection (toggle as the header extra, 'Removed on commit' as the status), and render the overlay groups as RailField rows. Add an optional enabledOverride so the Advanced drawer can buffer the build-kit toggle in its draft and only write the persisted atom on Save.
…nced UX Move the Model & harness and Advanced section drawers to a true scoped draft: edits are buffered and relayed to the entity only on Save, with Save gated on a real diff (a second useModelHarness instance holds the draft so the background accordion summaries keep reflecting the saved entity). Advanced is rebuilt on ConfigAccordionSection + RailField (auth as a SectionRail; sandbox/permissions flattened), and the 'Edit as JSON' escape hatch is removed. Model & harness uses the rail rhythm too; compatibility is now self-contained on the harness cards (the current card owns its own model-availability, and availability also matches on the model's provider family to avoid cross-harness id-namespace false negatives), the 'Current' badge tracks the saved harness (not the draft pick), and the redundant compatibility side panel is dropped in favour of version history — matching the Advanced drawer's shape.
…arity Add paste-a-link-over-selection and a plain code-block fence to the RichChatInput (new LinkPastePlugin + CodeFencePlugin, LinkNode/CodeNode registered), and harmonize link/code/blockquote styling between the composer theme and the message-bubble markdown so a block looks identical while typing and after sending.
…ease Move the tool-approval action out of the scrolling transcript into a persistent ApprovalDock pinned above the composer (neutral surface, tool + payload context, animated show/hide, inert while collapsed); the inline tool row now just marks 'Awaiting approval'. Harden the queue: a user stop voids the pending gate so a new message sends immediately instead of queuing, narrow isHitlPending to approval-requested (lockstep with the dock, avoids a queue-freeze trap), and release on a settled 'error' turn.
…ubble The hover toolbar (metrics + copy/rewind/trace) is anchored to the bottom of a reserved lane below each message. The lane was pb-7 (28px) and the toolbar is ~28px tall, so it filled the lane and hugged the bubble text. Bump to pb-10 (40px) so the extra space falls between the bubble and the toolbar.
In Build mode the agent transcript renders each tool call as a full step — per-tool input and output/error as monospace blocks, expanded reasoning — gated on chatPanelMaximizedAtom; Chat mode keeps the calm collapsed summary.
… to target session The inspector read the settle-only sessionMessagesAtom, so it showed a stale/wrong turn and never updated while streaming; and being mounted per session it popped a drawer per tab. Feed it the live useChat messages + sessionId as props, open only when it's the target session, and drop the AI SDK step-start/step-end boundary noise from the Timeline.
Each detailed tool step in Build mode now has its own caret toggle — click a step header to collapse/expand its input/output blocks (HeightCollapse), independent of the others. Default expanded.
…inimal in Chat Replace the bare 'Ask a question…' text with a mode-adaptive empty state: Chat mode shows a warm welcome (robot mark + prompt); Build mode shows an agent-aware card (name, model, tool/skill counts, a one-line summary from the instructions) plus curated starter prompts that send on click.
…hange and tool collapse Two reported jump-to-top bugs in the agent playground chat, plus perf hardening of the scroll handler: - onScroll only re-arms follow on a real scroll-DOWN-to-edge. A content shrink (tool gutter collapsing to "Used N tools", reasoning folding) clamps scrollTop to the new bottom and fires a non-gesture scroll event; a clamp only decreases scrollTop, so `> prevTop` rejects it. Previously that silently re-enabled follow and the next token snapped the min-h-full active turn to the top. - Coalesce the costly jump-pill measurement (querySelectorAll + getBoundingClientRect) to one rAF/frame; keep the follow decision and SC-3 anchor synchronous. Removes per-scroll-event and per-render forced reflows during streaming. - Dedup the follow-pin: guarded scrollToBottom so the ResizeObserver and the follow effect don't both write scrollTop for the same growth.
The Vercel adapter keyed a parked approval on the ACP display fields (name -> title -> kind). A Claude tool has no ACP `name`, so the key was a drift-prone display string: between the park turn and the re-raise the harness could vary it, the cross-turn resume key silently stopped matching, and the gate re-parked every turn (the HITL resume loop). - _approval_tool_name / _tool_spec_of: prefer the resolved spec's canonical `name` (stable across cold-replay turns), falling back to the old chain when no spec is resolved. Mirrors the runner's permissionToolName precedence so the persisted key and the live re-raised key agree. - tool-input-available now prefers `rawInput` over the often-empty `input`, so every tool-call path shows the real args (approve-empty-input bug). - [HITL] ingress/egress info logs to diff the persisted key against the runner's live gate identity. Covered by test_vercel_stream_park.py.
Runner side of the approval re-park loop, matched to the SDK egress fix.
- permissionToolName / specOf: resolve the gated tool's key from the resolved
spec's canonical `name` first (stable across cold-replay turns), then the ACP
display fields. This is the same precedence the SDK egress persists, so the
stored decision key and the live re-raised key agree instead of drifting apart.
- nonConvergingToolNames + loop-breaker: when a tool's {approved:true} envelopes
outnumber its real executions by a threshold, the resume key never matched;
DENY the next gate for it (a clean terminal failure the model stops re-issuing)
instead of parking forever. Fail-safe under the key fix above.
- [HITL] ground-truth logging across permissions.ts / responder.ts /
sandbox_agent.ts (ACP permission, gate hit/miss/park, stored resume state) to
diff the persisted keys against the live gate identity field-by-field.
Covered by responder.test.ts.
…treatment with semantic elevation tokens Introduce a surface ladder (app/gutter/raised/canvas/card/inset/chat) that separates the Build view's two workspaces (Configuration authoring panel vs Chat observing canvas) via elevation and containment instead of flat hue. The relationship inverts by theme: raised surfaces are lighter than the canvas in dark mode, white over soft-grey in light. Define semantic CSS classes (ag-panel-raised
# Conflicts: # web/oss/src/components/AgentChatSlice/components/AgentMessage.tsx # web/packages/agenta-ui/src/RichChatInput/assets/theme.ts
mmabrouk
added a commit
that referenced
this pull request
Jul 3, 2026
The live approve-loop is diagnosed: a constant stream messageId plus a level-triggered resume predicate on the frontend (new finding M7), compounded by tool-name drift across ACP frames breaking the decision key (M2's observed form, not argument drift). Updated the explainer's live-warning section, reframed M2 and added M7 in the code review, settled the fix direction in the plan (direct replay of the approved call; absorb #5054's message-id fix and edge-trigger guard; supersede its resolvedName patch and loop-breaker), and recorded the #5054 recommendation in status. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
mmabrouk
added a commit
that referenced
this pull request
Jul 3, 2026
The live approve-loop is diagnosed: a constant stream messageId plus a level-triggered resume predicate on the frontend (new finding M7), compounded by tool-name drift across ACP frames breaking the decision key (M2's observed form, not argument drift). Updated the explainer's live-warning section, reframed M2 and added M7 in the code review, settled the fix direction in the plan (direct replay of the approved call; absorb #5054's message-id fix and edge-trigger guard; supersede its resolvedName patch and loop-breaker), and recorded the #5054 recommendation in status. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
mmabrouk
added a commit
that referenced
this pull request
Jul 3, 2026
…e target path Global policy becomes four explicit modes (allow|ask|deny|allow_reads): read-only-allow is a policy choice, not a hidden per-tool default, and needs_approval is deleted from the model. 'Disposition' renamed to 'effective permission' everywhere. New 'target path' section shows the clean end-state flow; resume is redesigned to replay the approved call directly. Corrected the session-id story (the playground sends a stable per-conversation id). Added the Pi-builtins explanation (selection is Pi's only native control). Plan gains the stacked-on-#5054 baseline (keep the message-id fix and resume guard; delete resolvedName and the loop-breaker) and updated phases/deltas. Status consolidated for final review. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
Contributor
Author
|
@coderabbitai review |
✅ Action performedReview finished.
|
3253f42 to
8ab3070
Compare
bekossy
approved these changes
Jul 3, 2026
mmabrouk
added a commit
that referenced
this pull request
Jul 3, 2026
The live approve-loop is diagnosed: a constant stream messageId plus a level-triggered resume predicate on the frontend (new finding M7), compounded by tool-name drift across ACP frames breaking the decision key (M2's observed form, not argument drift). Updated the explainer's live-warning section, reframed M2 and added M7 in the code review, settled the fix direction in the plan (direct replay of the approved call; absorb #5054's message-id fix and edge-trigger guard; supersede its resolvedName patch and loop-breaker), and recorded the #5054 recommendation in status. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
mmabrouk
added a commit
that referenced
this pull request
Jul 3, 2026
…e target path Global policy becomes four explicit modes (allow|ask|deny|allow_reads): read-only-allow is a policy choice, not a hidden per-tool default, and needs_approval is deleted from the model. 'Disposition' renamed to 'effective permission' everywhere. New 'target path' section shows the clean end-state flow; resume is redesigned to replay the approved call directly. Corrected the session-id story (the playground sends a stable per-conversation id). Added the Pi-builtins explanation (selection is Pi's only native control). Plan gains the stacked-on-#5054 baseline (keep the message-id fix and resume guard; delete resolvedName and the loop-breaker) and updated phases/deltas. Status consolidated for final review. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
mmabrouk
added a commit
that referenced
this pull request
Jul 4, 2026
The live approve-loop is diagnosed: a constant stream messageId plus a level-triggered resume predicate on the frontend (new finding M7), compounded by tool-name drift across ACP frames breaking the decision key (M2's observed form, not argument drift). Updated the explainer's live-warning section, reframed M2 and added M7 in the code review, settled the fix direction in the plan (direct replay of the approved call; absorb #5054's message-id fix and edge-trigger guard; supersede its resolvedName patch and loop-breaker), and recorded the #5054 recommendation in status. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
mmabrouk
added a commit
that referenced
this pull request
Jul 4, 2026
…e target path Global policy becomes four explicit modes (allow|ask|deny|allow_reads): read-only-allow is a policy choice, not a hidden per-tool default, and needs_approval is deleted from the model. 'Disposition' renamed to 'effective permission' everywhere. New 'target path' section shows the clean end-state flow; resume is redesigned to replay the approved call directly. Corrected the session-id story (the playground sends a stable per-conversation id). Added the Pi-builtins explanation (selection is Pi's only native control). Plan gains the stacked-on-#5054 baseline (keep the message-id fix and resume guard; delete resolvedName and the loop-breaker) and updated phases/deltas. Status consolidated for final review. Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A batch of agent-playground work layered on
big-agents: builder-facing tooling for inspecting a turn, a persistent HITL approval surface, richer chat input, and a set of runner/SDK/FE fixes so tool inputs, outputs, and approvals render faithfully. Includes the end-to-end fix for the HITL approval resume loop.Why
The agent playground could run an agent but gave builders little insight into what happened in a turn, and several streaming-fidelity bugs made tool activity hard to trust — tool inputs showed
{}, and any tool requiring human approval got stuck in an infinite re-approval loop, never showing its output.Changes
Turn Inspector (Build-mode tooling) — a per-session inline side panel (Timeline / Context / Raw) that reads the live
useChatmessages: the full round (user message → reasoning → tool I/O → response), the exact config + messages sent, and copyable raw payloads. Mode-adaptive empty state (agent-aware in Build, warm-minimal in Chat). Design + plan docs underdocs/design/agent-workflows/projects/agent-turn-inspector/.Build-mode step log — inline per-tool input/output/error blocks, individually collapsible.
HITL approval dock — a persistent, non-scrolling approval surface with a hardened queue release, replacing the inline-only gate.
Chat UX — rich chat input (links, code blocks), calmer composer, and scroll engineering (pin new turn, stop jump-to-top on stream-state change / tool collapse).
HITL approval resume loop — root cause + fix (runner · SDK · FE)
An approved tool kept re-parking forever and never showed output. Traced through the live runner logs to a chain of same-root issues — the cold-replay runner re-issues the approved tool under a fresh tool-call id, so its output never lands on the approved part:
session/updatetool_call titles itTerminal; the permission request titles it the full command), and neither carries a stablename/spec. The key now anchors on the recorded tool_call name (stamped asresolvedName), so the live re-raised key equals the stored key and the gate resolves. Kept the non-converging loop-breaker +[HITL]diagnostics as a fail-safe.resolvedName→ spec name → title) and no longer lets a late arg-refresh downgrade the name;[HITL]egress/ingress logging for the round-trip.agentShouldResumeAfterApprovalre-sent after every completion because the answeredapproval-respondedpart lingers in the message; now guarded on "already resumed" (astep-startfollows the approval), so it resumes exactly once.AgentMessagecollapses it into its executed sibling (same tool + input),ToolActivitytreatsapproval-respondedas resolved (approved, not running), and tool output (not just errors) has its markdown code fence stripped.Tool inputs — emit the
tool_callup front (so the FE part + HITL approval attach to it), then refresh its input when the real args arrive on a latertool_call_update; fixes the always-{}display for non-gated tools without breaking the emit-first invariant. Unique vercel-streammessageIdper turn.Verification
tscclean; unit suite green. Live-log confirmed:gate "Terminal" -> stored allow (resume matched)and the tool executes.tscclean for the touched agent-chat files.Merged latest
big-agentsin (resolved two cosmetic conflicts: copy-button "Copied" feedback inAgentMessage, and chat-inputtheme.tsquote style kept in lock-step with the message bubble'smarkdown.tsx). Manual browser verification of the inspector / dock / step-log and a park→approve→run pass is still worth doing before merge.