Skip to content

feat(agent): agent playground — turn inspector, HITL dock, and tool I/O fidelity#5054

Closed
ardaerzin wants to merge 45 commits into
big-agentsfrom
big-agents-work
Closed

feat(agent): agent playground — turn inspector, HITL dock, and tool I/O fidelity#5054
ardaerzin wants to merge 45 commits into
big-agentsfrom
big-agents-work

Conversation

@ardaerzin

@ardaerzin ardaerzin commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

What

A batch of agent-playground work layered on big-agents: builder-facing tooling for inspecting a turn, a persistent HITL approval surface, richer chat input, and a set of runner/SDK/FE fixes so tool inputs, outputs, and approvals render faithfully. Includes the end-to-end fix for the HITL approval resume loop.

Why

The agent playground could run an agent but gave builders little insight into what happened in a turn, and several streaming-fidelity bugs made tool activity hard to trust — tool inputs showed {}, and any tool requiring human approval got stuck in an infinite re-approval loop, never showing its output.

Changes

Turn Inspector (Build-mode tooling) — a per-session inline side panel (Timeline / Context / Raw) that reads the live useChat messages: the full round (user message → reasoning → tool I/O → response), the exact config + messages sent, and copyable raw payloads. Mode-adaptive empty state (agent-aware in Build, warm-minimal in Chat). Design + plan docs under docs/design/agent-workflows/projects/agent-turn-inspector/.

Build-mode step log — inline per-tool input/output/error blocks, individually collapsible.

HITL approval dock — a persistent, non-scrolling approval surface with a hardened queue release, replacing the inline-only gate.

Chat UX — rich chat input (links, code blocks), calmer composer, and scroll engineering (pin new turn, stop jump-to-top on stream-state change / tool collapse).

HITL approval resume loop — root cause + fix (runner · SDK · FE)

An approved tool kept re-parking forever and never showed output. Traced through the live runner logs to a chain of same-root issues — the cold-replay runner re-issues the approved tool under a fresh tool-call id, so its output never lands on the approved part:

  • Runner — the cross-turn approval key drifted. Claude-over-ACP names the same call differently across frames (the session/update tool_call titles it Terminal; the permission request titles it the full command), and neither carries a stable name/spec. The key now anchors on the recorded tool_call name (stamped as resolvedName), so the live re-raised key equals the stored key and the gate resolves. Kept the non-converging loop-breaker + [HITL] diagnostics as a fail-safe.
  • SDK — the vercel egress mirrors that anchor (resolvedName → spec name → title) and no longer lets a late arg-refresh downgrade the name; [HITL] egress/ingress logging for the round-trip.
  • FE (resume predicate)agentShouldResumeAfterApproval re-sent after every completion because the answered approval-responded part lingers in the message; now guarded on "already resumed" (a step-start follows the approval), so it resumes exactly once.
  • FE (tool rendering) — the lingering answered gate rendered as a perpetual spinner with no output. AgentMessage collapses it into its executed sibling (same tool + input), ToolActivity treats approval-responded as resolved (approved, not running), and tool output (not just errors) has its markdown code fence stripped.

Tool inputs — emit the tool_call up front (so the FE part + HITL approval attach to it), then refresh its input when the real args arrive on a later tool_call_update; fixes the always-{} display for non-gated tools without breaking the emit-first invariant. Unique vercel-stream messageId per turn.

Verification

  • Runner: tsc clean; unit suite green. Live-log confirmed: gate "Terminal" -> stored allow (resume matched) and the tool executes.
  • SDK: agents unit suite green + ruff clean; new egress tests (spec/resolved-name anchor, arg-refresh no-clobber, park-refresh).
  • Frontend: resume-predicate tests green (post-resolve guard + chained approval); tsc clean for the touched agent-chat files.

Merged latest big-agents in (resolved two cosmetic conflicts: copy-button "Copied" feedback in AgentMessage, and chat-input theme.ts quote style kept in lock-step with the message bubble's markdown.tsx). Manual browser verification of the inspector / dock / step-log and a park→approve→run pass is still worth doing before merge.

ardaerzin added 30 commits July 2, 2026 09:57
…ssion button

Add min-w-0 to the agent playground Tabs so the session tab strip clamps to
the available width and scrolls internally instead of pushing the search and
history controls off-screen. Move the New session (+) button out of the scroll
container into the fixed right-hand actions cluster so it no longer scrolls
away with the tabs.
…overable

The agent playground disables the splitter's collapse pills (Build/Chat lives
in the header), so the drag handle was the only resize affordance and read as
a bare hairline. Add a persistent centered grip on the divider — neutral at
rest, accent-tinted (colorPrimary at reduced opacity, so it stays visible in
dark without shouting) and taller on hover/drag — scoped to a new
playground-splitter-agent class so prompt playgrounds keep antd's defaults.
…title row

Extract the agent revision selector (variant picker + version/status chip)
into a self-contained AgentRevisionSelector and render it next to the agent
name in the page header. The config-panel header (ROW B) now shows a
'Configuration' title instead. Scoped to agent mode; prompt/evaluator and
embedded surfaces are unchanged.
Add a railInfoLabel helper to RailField (label + inline info tooltip, so a
field keeps its help text without a separate description line) and a disabled
prop to SectionRail (for read-only revisions). Both back the config-drawer
refactors that follow.
…Field rows

Convert SandboxPermissionControl, ClaudePermissionsControl and McpServerFormView
from stacked label-above LabeledField groups to flat [label | control] RailField
rows using railInfoLabel for per-field help. Drops the redundant nested headers
and inner borders (the 'form inside a form' look) so each knob is a peer row that
shares the section rail.
…gAccordionSection

Replace the hand-rolled collapsible card with the shared ConfigAccordionSection
(toggle as the header extra, 'Removed on commit' as the status), and render the
overlay groups as RailField rows. Add an optional enabledOverride so the Advanced
drawer can buffer the build-kit toggle in its draft and only write the persisted
atom on Save.
…nced UX

Move the Model & harness and Advanced section drawers to a true scoped draft:
edits are buffered and relayed to the entity only on Save, with Save gated on a
real diff (a second useModelHarness instance holds the draft so the background
accordion summaries keep reflecting the saved entity). Advanced is rebuilt on
ConfigAccordionSection + RailField (auth as a SectionRail; sandbox/permissions
flattened), and the 'Edit as JSON' escape hatch is removed. Model & harness uses
the rail rhythm too; compatibility is now self-contained on the harness cards
(the current card owns its own model-availability, and availability also matches
on the model's provider family to avoid cross-harness id-namespace false
negatives), the 'Current' badge tracks the saved harness (not the draft pick),
and the redundant compatibility side panel is dropped in favour of version
history — matching the Advanced drawer's shape.
…arity

Add paste-a-link-over-selection and a plain code-block fence to the RichChatInput (new LinkPastePlugin + CodeFencePlugin, LinkNode/CodeNode registered), and harmonize link/code/blockquote styling between the composer theme and the message-bubble markdown so a block looks identical while typing and after sending.
…ease

Move the tool-approval action out of the scrolling transcript into a persistent ApprovalDock pinned above the composer (neutral surface, tool + payload context, animated show/hide, inert while collapsed); the inline tool row now just marks 'Awaiting approval'. Harden the queue: a user stop voids the pending gate so a new message sends immediately instead of queuing, narrow isHitlPending to approval-requested (lockstep with the dock, avoids a queue-freeze trap), and release on a settled 'error' turn.
…ubble

The hover toolbar (metrics + copy/rewind/trace) is anchored to the bottom of a reserved lane below each message. The lane was pb-7 (28px) and the toolbar is ~28px tall, so it filled the lane and hugged the bubble text. Bump to pb-10 (40px) so the extra space falls between the bubble and the toolbar.
In Build mode the agent transcript renders each tool call as a full step — per-tool input and output/error as monospace blocks, expanded reasoning — gated on chatPanelMaximizedAtom; Chat mode keeps the calm collapsed summary.
… to target session

The inspector read the settle-only sessionMessagesAtom, so it showed a stale/wrong turn and never updated while streaming; and being mounted per session it popped a drawer per tab. Feed it the live useChat messages + sessionId as props, open only when it's the target session, and drop the AI SDK step-start/step-end boundary noise from the Timeline.
Each detailed tool step in Build mode now has its own caret toggle — click a step header to collapse/expand its input/output blocks (HeightCollapse), independent of the others. Default expanded.
…inimal in Chat

Replace the bare 'Ask a question…' text with a mode-adaptive empty state: Chat mode shows a warm welcome (robot mark + prompt); Build mode shows an agent-aware card (name, model, tool/skill counts, a one-line summary from the instructions) plus curated starter prompts that send on click.
…hange and tool collapse

Two reported jump-to-top bugs in the agent playground chat, plus perf hardening
of the scroll handler:

- onScroll only re-arms follow on a real scroll-DOWN-to-edge. A content shrink
  (tool gutter collapsing to "Used N tools", reasoning folding) clamps scrollTop
  to the new bottom and fires a non-gesture scroll event; a clamp only decreases
  scrollTop, so `> prevTop` rejects it. Previously that silently re-enabled follow
  and the next token snapped the min-h-full active turn to the top.
- Coalesce the costly jump-pill measurement (querySelectorAll + getBoundingClientRect)
  to one rAF/frame; keep the follow decision and SC-3 anchor synchronous. Removes
  per-scroll-event and per-render forced reflows during streaming.
- Dedup the follow-pin: guarded scrollToBottom so the ResizeObserver and the follow
  effect don't both write scrollTop for the same growth.
The Vercel adapter keyed a parked approval on the ACP display fields
(name -> title -> kind). A Claude tool has no ACP `name`, so the key was a
drift-prone display string: between the park turn and the re-raise the harness
could vary it, the cross-turn resume key silently stopped matching, and the gate
re-parked every turn (the HITL resume loop).

- _approval_tool_name / _tool_spec_of: prefer the resolved spec's canonical
  `name` (stable across cold-replay turns), falling back to the old chain when no
  spec is resolved. Mirrors the runner's permissionToolName precedence so the
  persisted key and the live re-raised key agree.
- tool-input-available now prefers `rawInput` over the often-empty `input`, so
  every tool-call path shows the real args (approve-empty-input bug).
- [HITL] ingress/egress info logs to diff the persisted key against the runner's
  live gate identity.

Covered by test_vercel_stream_park.py.
Runner side of the approval re-park loop, matched to the SDK egress fix.

- permissionToolName / specOf: resolve the gated tool's key from the resolved
  spec's canonical `name` first (stable across cold-replay turns), then the ACP
  display fields. This is the same precedence the SDK egress persists, so the
  stored decision key and the live re-raised key agree instead of drifting apart.
- nonConvergingToolNames + loop-breaker: when a tool's {approved:true} envelopes
  outnumber its real executions by a threshold, the resume key never matched;
  DENY the next gate for it (a clean terminal failure the model stops re-issuing)
  instead of parking forever. Fail-safe under the key fix above.
- [HITL] ground-truth logging across permissions.ts / responder.ts /
  sandbox_agent.ts (ACP permission, gate hit/miss/park, stored resume state) to
  diff the persisted keys against the live gate identity field-by-field.

Covered by responder.test.ts.
…treatment with semantic elevation tokens

Introduce a surface ladder (app/gutter/raised/canvas/card/inset/chat) that separates the Build view's two workspaces (Configuration authoring panel vs Chat observing canvas) via elevation and containment instead of flat hue. The relationship inverts by theme: raised surfaces are lighter than the canvas in dark mode, white over soft-grey in light. Define semantic CSS classes (ag-panel-raised
# Conflicts:
#	web/oss/src/components/AgentChatSlice/components/AgentMessage.tsx
#	web/packages/agenta-ui/src/RichChatInput/assets/theme.ts
mmabrouk added a commit that referenced this pull request Jul 3, 2026
The live approve-loop is diagnosed: a constant stream messageId plus a
level-triggered resume predicate on the frontend (new finding M7), compounded
by tool-name drift across ACP frames breaking the decision key (M2's observed
form, not argument drift). Updated the explainer's live-warning section,
reframed M2 and added M7 in the code review, settled the fix direction in the
plan (direct replay of the approved call; absorb #5054's message-id fix and
edge-trigger guard; supersede its resolvedName patch and loop-breaker), and
recorded the #5054 recommendation in status.

Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
mmabrouk added a commit that referenced this pull request Jul 3, 2026
The live approve-loop is diagnosed: a constant stream messageId plus a
level-triggered resume predicate on the frontend (new finding M7), compounded
by tool-name drift across ACP frames breaking the decision key (M2's observed
form, not argument drift). Updated the explainer's live-warning section,
reframed M2 and added M7 in the code review, settled the fix direction in the
plan (direct replay of the approved call; absorb #5054's message-id fix and
edge-trigger guard; supersede its resolvedName patch and loop-breaker), and
recorded the #5054 recommendation in status.

Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
mmabrouk added a commit that referenced this pull request Jul 3, 2026
…e target path

Global policy becomes four explicit modes (allow|ask|deny|allow_reads):
read-only-allow is a policy choice, not a hidden per-tool default, and
needs_approval is deleted from the model. 'Disposition' renamed to 'effective
permission' everywhere. New 'target path' section shows the clean end-state
flow; resume is redesigned to replay the approved call directly. Corrected the
session-id story (the playground sends a stable per-conversation id). Added
the Pi-builtins explanation (selection is Pi's only native control). Plan
gains the stacked-on-#5054 baseline (keep the message-id fix and resume guard;
delete resolvedName and the loop-breaker) and updated phases/deltas. Status
consolidated for final review.

Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
@ardaerzin

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@mmabrouk mmabrouk force-pushed the big-agents-work branch 2 times, most recently from 3253f42 to 8ab3070 Compare July 3, 2026 13:16
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jul 3, 2026
mmabrouk added a commit that referenced this pull request Jul 3, 2026
The live approve-loop is diagnosed: a constant stream messageId plus a
level-triggered resume predicate on the frontend (new finding M7), compounded
by tool-name drift across ACP frames breaking the decision key (M2's observed
form, not argument drift). Updated the explainer's live-warning section,
reframed M2 and added M7 in the code review, settled the fix direction in the
plan (direct replay of the approved call; absorb #5054's message-id fix and
edge-trigger guard; supersede its resolvedName patch and loop-breaker), and
recorded the #5054 recommendation in status.

Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
mmabrouk added a commit that referenced this pull request Jul 3, 2026
…e target path

Global policy becomes four explicit modes (allow|ask|deny|allow_reads):
read-only-allow is a policy choice, not a hidden per-tool default, and
needs_approval is deleted from the model. 'Disposition' renamed to 'effective
permission' everywhere. New 'target path' section shows the clean end-state
flow; resume is redesigned to replay the approved call directly. Corrected the
session-id story (the playground sends a stable per-conversation id). Added
the Pi-builtins explanation (selection is Pi's only native control). Plan
gains the stacked-on-#5054 baseline (keep the message-id fix and resume guard;
delete resolvedName and the loop-breaker) and updated phases/deltas. Status
consolidated for final review.

Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
@ardaerzin ardaerzin closed this Jul 3, 2026
mmabrouk added a commit that referenced this pull request Jul 4, 2026
The live approve-loop is diagnosed: a constant stream messageId plus a
level-triggered resume predicate on the frontend (new finding M7), compounded
by tool-name drift across ACP frames breaking the decision key (M2's observed
form, not argument drift). Updated the explainer's live-warning section,
reframed M2 and added M7 in the code review, settled the fix direction in the
plan (direct replay of the approved call; absorb #5054's message-id fix and
edge-trigger guard; supersede its resolvedName patch and loop-breaker), and
recorded the #5054 recommendation in status.

Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
mmabrouk added a commit that referenced this pull request Jul 4, 2026
…e target path

Global policy becomes four explicit modes (allow|ask|deny|allow_reads):
read-only-allow is a policy choice, not a hidden per-tool default, and
needs_approval is deleted from the model. 'Disposition' renamed to 'effective
permission' everywhere. New 'target path' section shows the clean end-state
flow; resume is redesigned to replay the approved call directly. Corrected the
session-id story (the playground sends a stable per-conversation id). Added
the Pi-builtins explanation (selection is Pi's only native control). Plan
gains the stacked-on-#5054 baseline (keep the message-id fix and resume guard;
delete resolvedName and the loop-breaker) and updated phases/deltas. Status
consolidated for final review.

Claude-Session: https://claude.ai/code/session_01DGj7GKafjkZeQXMsryWhb2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Frontend lgtm This PR has been approved by a maintainer size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants