Skip to content

fix(copilot): SDK streaming reliability, parallel tools, incremental saves, frontend reconnection#12173

Merged
majdyz merged 27 commits into
devfrom
fix/messed-up-copilot
Feb 20, 2026
Merged

fix(copilot): SDK streaming reliability, parallel tools, incremental saves, frontend reconnection#12173
majdyz merged 27 commits into
devfrom
fix/messed-up-copilot

Conversation

@majdyz
Copy link
Copy Markdown
Contributor

@majdyz majdyz commented Feb 20, 2026

Summary

Fixes multiple reliability issues in the copilot's Claude Agent SDK streaming pipeline — tool outputs getting stuck, parallel tool calls flushing prematurely, messages lost on page refresh, and SSE reconnection failures.

Changes

Backend: Streaming loop rewrite (sdk/service.py)

  • Non-cancelling heartbeat pattern: Replace asyncio.timeout() with asyncio.wait() for SDK message iteration. The old approach corrupted the SDK's internal anyio memory stream when timeouts fired mid-__anext__(), causing StopAsyncIteration on the next call and silently dropping all in-flight tool results.
  • Hook synchronization: Add wait_for_stash() before convert_message() — the SDK fires PostToolUse hooks via start_soon() (fire-and-forget), so the next message can arrive before the hook stashes its output. The new asyncio.Event-based mechanism bridges this gap without arbitrary sleeps.
  • Error handling: Add asyncio.CancelledError handling at both inner (streaming loop) and outer (session) levels, plus pending task cleanup in finally block to prevent leaked coroutines. Catch Exception from done.pop().result() for SDK error messages.
  • Safety-net flush: After streaming loop ends, flush any remaining unresolved tool calls so the frontend stops showing spinners even if the stream drops unexpectedly.
  • StreamFinish fallback: Emit StreamFinishStep + StreamFinish when stream ends without ResultMessage (StopAsyncIteration) so the frontend transitions to "ready" state.
  • Incremental session saves: Save session to PostgreSQL after each tool input/output event (not just at stream end), so page refresh and other devices see recent messages.
  • Enhanced logging: All log lines now include session_id[:12] prefix and tool call resolution state (unresolved/current/resolved counts).

Backend: Response adapter (sdk/response_adapter.py)

  • Parallel tool call support: Skip _flush_unresolved_tool_calls() when an AssistantMessage contains only ToolUseBlocks (parallel continuation) — the prior tools are still executing concurrently and haven't finished yet.
  • Duplicate output prevention: Skip already-resolved tool results in both UserMessage (ToolResultBlock) and parent_tool_use_id handling to prevent duplicate StreamToolOutputAvailable events.
  • has_unresolved_tool_calls property: Used by the streaming loop to decide whether to wait for PostToolUse hooks.
  • session_id parameter: Passed through for structured logging.

Backend: Hook synchronization (sdk/tool_adapter.py)

  • _stash_event ContextVar: asyncio.Event signaled by stash_pending_tool_output() whenever a PostToolUse hook stashes output.
  • wait_for_stash(): Awaits the event with configurable timeout — replaces the racy "hope the hook finished" approach.

Backend: Security hooks (sdk/security_hooks.py)

  • Enhanced logging in post_tool_use_hook — log whether tool is built-in, preview of stashed output, warning when tool_response is None.

Backend: Incremental save optimization (model.py)

  • existing_message_count parameter on upsert_chat_session: Skip the DB query to count existing messages when the caller already tracks this (streaming loop).
  • skip_existence_check parameter on _save_session_to_db: Skip the get_chat_session existence query when we know the session row already exists. Reduces from 4 DB round trips to 2 per incremental save.

Backend: SDK version bump (pyproject.toml, poetry.lock)

  • Bump claude-agent-sdk from ^0.1.0 to ^0.1.39.

Backend: New tests

  • sdk_compat_test.py (new file): SDK compatibility tests — verify the installed SDK exposes every class, attribute, and method the copilot integration relies on. Catches SDK upgrade breakage immediately.
  • response_adapter_test.py: 9 new tests covering flush-at-ResultMessage, flush-at-next-AssistantMessage, stashed output flush, wait_for_stash signaling/timeout/fast-path, parallel tool call non-premature-flush, text-message flush of prior tools, and already-resolved tool skip in UserMessage.

Frontend: Session hydration (convertChatSessionToUiMessages.ts)

  • isComplete option: When session has no active stream, mark dangling tool calls (no output in DB) as output-available with empty output — stops stale spinners after page refresh.

Frontend: Chat session hook (useChatSession.ts)

  • Reorder hasActiveStream memo before hydratedMessages so isComplete flag is available.
  • Pass { isComplete: !hasActiveStream } to convertChatSessionMessagesToUiMessages.

Frontend: Copilot page hook (useCopilotPage.ts)

  • Cache invalidation on stream end: Invalidate React Query session cache when stream transitions active → idle, so next hydration fetches fresh messages from backend (staleTime: Infinity otherwise keeps stale data).
  • Resume ref reset: Reset hasResumedRef on stream end to allow re-resume if SSE drops but backend task is still running.
  • Remove old resolveInProgressTools effect: Replaced by backend-side safety-net flush + hydration-time isComplete marking.

Test plan

  • Existing copilot tests pass (pytest backend/copilot/ -x -q)
  • SDK compat tests pass (pytest backend/copilot/sdk/sdk_compat_test.py -v)
  • Tool outputs (bash_exec, web_fetch, WebSearch) appear in the UI instead of getting stuck
  • Parallel tool calls (e.g. multiple WebSearch) complete and display results without premature flush
  • Page refresh during active stream reconnects and recovers messages
  • Opening session from another device shows recent tool results
  • SSE drop → automatic reconnection without losing messages
  • Long-running tools (create_agent) still delegate to background infrastructure

- Block Task tool's run_in_background param in security hooks — background
  agents stall the SSE stream and get killed when the main agent exits
- Add heartbeats (15s interval) to SDK streaming loop so proxies/LBs don't
  close idle SSE connections during long tool execution
- Fix summarization prompt that forced LLM to fabricate content for all 9
  mandatory sections; now sections are optional and hallucination is
  explicitly prohibited
- Include tool error/denial outcomes in conversation context formatting —
  previously all tool messages were dropped, so the agent couldn't see
  that security denials blocked its file writes and hallucinated success
…or detection

- Move run_in_background check before task_spawn_count increment so
  denied background Tasks don't consume a subtask slot
- Replace asyncio.wait_for() with asyncio.timeout() for heartbeat loop
  to avoid leaving the async generator in a broken state
- Tighten _is_tool_error_or_denial: remove overly broad markers
  ("error", "failed", "not found") that cause false positives; add
  markers for actual denial messages ("not supported", "maximum")
- Fix sandbox process kill: use start_new_session + os.killpg to kill
  the entire bwrap process group on timeout (proc.kill alone only kills
  the parent, leaving children running until natural completion)
- Add StreamToolInputAvailable/StreamToolOutputAvailable to publish_chunk
  logging filter so tool events are visible in Docker logs
- Add system prompt instruction telling Claude not to use
  run_in_background on Task tool (gets denied by security hooks)
- Add tool event debug logging in SDK streaming loop for tracing
  tool execution visibility issues
… tests

- Move task_spawn_count increment after limit check so denied spawns
  don't consume a slot (greptile feedback)
- Add "failed" marker to _is_tool_error_or_denial to catch internal
  tool execution failures from _mcp_error (coderabbit feedback)
- Add 17 unit tests for _is_tool_error_or_denial covering all markers,
  denial messages, and false-positive scenarios
- Replace brittle line-count check (< 3) in read_transcript_file with
  proper validate_transcript() which checks for actual user/assistant
  entries — avoids rejecting valid short transcripts while still
  filtering metadata-only files
- Add debug logging for transcript source selection and fallback path
  to aid diagnosing resume issues in Docker
- Make test_sdk_resume_multi_turn skip gracefully when the CLI doesn't
  produce usable transcripts (environment-dependent: CLI version,
  platform) instead of hard-failing
…sses

Catch OSError broadly (not just ProcessLookupError) when calling
os.killpg so that EPERM or other errors don't skip the subsequent
await proc.communicate(), which would leave the subprocess un-reaped.
…le spinner safety nets

- Elevate flush logging from debug to info/warning with structured messages
  showing tool names and IDs for production diagnostics
- Capture raw SDK output for transcript instead of relying on Stop hook
  file path (CLI doesn't write JSONL in SDK mode)
- Add _build_transcript() to reconstruct JSONL from captured entries
- Add isComplete option to hydration conversion — marks dangling tool calls
  as completed when session has no active stream (fixes stale spinners on
  page refresh)
- Add resolveInProgressTools safety net on streaming→ready transition
  (catches tool parts the backend didn't emit output for)
- Add 3 new tests for flush mechanism (ResultMessage, AssistantMessage,
  stashed output)
@majdyz majdyz requested a review from a team as a code owner February 20, 2026 02:13
@majdyz majdyz requested review from 0ubbe and kcze and removed request for a team February 20, 2026 02:13
@github-project-automation github-project-automation Bot moved this to 🆕 Needs initial review in AutoGPT development kanban Feb 20, 2026
@github-actions github-actions Bot added platform/frontend AutoGPT Platform - Front end platform/backend AutoGPT Platform - Back end conflicts Automatically applied to PRs with merge conflicts labels Feb 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR enhances the Copilot SDK's response handling and service layer by introducing session tracking, improving unresolved tool-call flushing with better logging, implementing asynchronous stash coordination for tool outputs, adding context compression and injection during conversations, and including comprehensive test coverage and SDK compatibility verification.

Changes

Cohort / File(s) Summary
Response Adapter Core
autogpt_platform/backend/backend/copilot/sdk/response_adapter.py
Added optional session_id parameter to SDKResponseAdapter; introduced has_unresolved_tool_calls property; enhanced _flush_unresolved_tool_calls with early-return guard, improved logging (INFO/WARNING levels), and unresolved tool pair computation.
Response Adapter Tests
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
Added six new unit tests covering flush behavior at ResultMessage and next AssistantMessage, stashed output handling, and wait_for_stash synchronization scenarios including timeout and pre-existing stash cases.
Service Layer Enhancements
autogpt_platform/backend/backend/copilot/sdk/service.py
Refactored _compress_messages to accept message lists; updated SDKResponseAdapter instantiation with session_id; introduced context injection flow with compression and approach logging; added race-condition mitigation via wait_for_stash; enhanced tool event logging and added StreamHeartbeat behavior.
Tool Adapter Stash Coordination
autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py
Introduced asyncio-based stash signaling mechanism with ContextVar _stash_event and new public async function wait_for_stash(timeout) for coordinating tool-output availability across hook boundaries.
Frontend UI Updates
autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts, autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts
Added hasActiveStream memo to detect active streams; updated convertChatSessionMessagesToUiMessages to accept isComplete flag, pushing output-available state for tool calls without output when session is complete.
SDK Dependencies & Compatibility
autogpt_platform/backend/pyproject.toml, autogpt_platform/backend/backend/copilot/sdk/sdk_compat_test.py
Updated claude-agent-sdk dependency from ^0.1.0 to ^0.1.35; added comprehensive SDK compatibility verification tests covering public API surface (types, methods, fields, hooks).

Sequence Diagram

sequenceDiagram
    participant Client
    participant Service as Service Layer
    participant Adapter as SDKResponseAdapter
    participant ToolAdapter as Tool Adapter
    participant Hook as PostToolUse Hook

    Client->>Service: Send query with session
    Service->>ToolAdapter: set_execution_context (init stash event)
    Service->>Adapter: Create SDKResponseAdapter(session_id)
    
    Note over ToolAdapter: Tool execution begins
    Hook->>ToolAdapter: stash_pending_tool_output()
    ToolAdapter->>ToolAdapter: Signal _stash_event
    
    Note over Service: Waiting for output
    Service->>ToolAdapter: wait_for_stash(timeout=0.5)
    ToolAdapter-->>Service: Event signaled (True)
    
    Service->>Adapter: _flush_unresolved_tool_calls()
    Adapter->>Adapter: Check has_unresolved_tool_calls
    Adapter->>Adapter: Emit StreamToolOutputAvailable
    Adapter->>Client: Return resolved stream events
    
    alt Timeout occurred
        Service->>ToolAdapter: wait_for_stash times out
        ToolAdapter-->>Service: False (timeout)
        Service->>Service: Log race condition warning
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

review-effort-4-complex

Suggested reviewers

  • 0ubbe
  • Pwuts
  • ntindle

Poem

🐰 Hoppy stashing streams so fine,
Tool calls flush in perfect time,
Context flows with heartbeat's beat,
Sessions tracked make flows complete,
Race conditions fade away,
SDK shines bright today!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: fixing SDK streaming reliability issues with tool output handling, parallel tools, incremental saves, and frontend reconnection.
Description check ✅ Passed The description comprehensively documents the changeset, covering backend streaming loop rewrites, hook synchronization, response adapter enhancements, frontend hydration fixes, dependency updates, and new tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/messed-up-copilot

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 20, 2026

Greptile Summary

Fixed critical bugs in SDK mode transcript capture and improved reliability of the Claude Code copilot integration:

  • Transcript capture refactored: The Claude CLI doesn't write JSONL files in SDK mode (--output-format stream-json). Previously attempted to read non-existent files. Now captures raw JSON from CLI stdout during streaming, constructs synthetic user entries, and concatenates previous transcripts for multi-turn --resume.
  • Heartbeat mechanism added: Implemented 15-second timeout with StreamHeartbeat during SDK streaming to keep SSE connections alive through proxies/load balancers during long tool execution.
  • Security hook counter fix: Moved task_spawn_count increment after the limit check so denied calls don't consume subtask slots. Background Task execution now blocked separately before counter increment.
  • Tool error detection improved: Added _is_tool_error_or_denial() to include tool errors/denials in conversation context compression, preventing agent from hallucinating success on blocked operations.
  • Response adapter logging enhanced: Better visibility into unresolved SDK tool call flushing with info/warning log levels and pre-computed unresolved lists.
  • Stale spinner fixes (frontend): Resolves in-progress tool parts when stream transitions to ready, marks dangling tool calls as completed on session hydration when no active stream present.
  • Sandbox cleanup improved: Kills entire process group with os.killpg() instead of just bwrap parent to prevent zombie child processes. Added OSError catch.
  • Summarization prompt refined: Reduced hallucination by making sections optional (9→7), adding anti-fabrication instructions, removing redundant sections.

Confidence Score: 4/5

  • This PR is safe to merge with low risk
  • Well-tested changes (59 passing tests) addressing critical bugs with clear implementation. Concerns include accessing private SDK internals (client._query) which could break with SDK updates, and the use of assert in production code that gets removed with -O optimization. The heartbeat mechanism and security hook fixes are solid improvements.
  • Pay close attention to autogpt_platform/backend/backend/copilot/sdk/service.py due to private API access and production assert usage

Sequence Diagram

sequenceDiagram
    participant Frontend
    participant Backend
    participant SDKService
    participant ClaudeCLI
    participant SecurityHooks
    participant TranscriptStore

    Frontend->>Backend: POST /chat (SSE)
    Backend->>SDKService: stream_chat_completion_sdk()
    
    SDKService->>TranscriptStore: download_transcript() (resume)
    TranscriptStore-->>SDKService: previous transcript JSONL
    
    SDKService->>ClaudeCLI: spawn CLI --output-format stream-json
    SDKService->>SecurityHooks: create_security_hooks()
    
    loop Streaming Loop
        ClaudeCLI-->>SDKService: raw JSON message
        SDKService->>SDKService: capture to raw_entries[]
        SDKService->>SDKService: parse & adapt to UI protocol
        
        alt Tool execution starts
            ClaudeCLI-->>SDKService: ToolUseBlock
            SDKService->>SecurityHooks: PreToolUse hook
            
            alt Background Task blocked
                SecurityHooks-->>SDKService: deny (no counter increment)
            else Subtask limit reached
                SecurityHooks-->>SDKService: deny (counter >= max)
            else Allowed
                SecurityHooks-->>SDKService: allow
                SecurityHooks->>SecurityHooks: task_spawn_count++
            end
        end
        
        alt 15s timeout (no message)
            SDKService->>Frontend: StreamHeartbeat
        end
        
        SDKService->>Frontend: StreamTextDelta/ToolInput/ToolOutput
    end
    
    ClaudeCLI-->>SDKService: StreamFinish
    
    SDKService->>SDKService: _build_transcript(raw_entries, user_msg, prev)
    SDKService->>TranscriptStore: upload_transcript() (for next resume)
    
    SDKService-->>Backend: complete
    Backend-->>Frontend: SSE close
Loading

Last reviewed commit: b1c5000

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment thread autogpt_platform/backend/backend/copilot/sdk/service.py Outdated
Comment thread autogpt_platform/backend/backend/copilot/sdk/service.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
autogpt_platform/backend/backend/copilot/tools/sandbox.py (1)

249-272: Process-group kill is correct; comment could be more precise about --die-with-parent

start_new_session=True creates a new process group for bwrap (pgid = bwrap.pid), and os.killpg(proc.pid, SIGKILL) correctly sends SIGKILL to that group. However, bwrap's --new-session flag (in _build_bwrap_command, line 191) calls setsid() for the sandboxed command, giving it its own separate process group — so killpg(bwrap.pid) doesn't reach the sandboxed process directly. It is --die-with-parent (via prctl PR_SET_PDEATHSIG) that sends SIGKILL to the sandboxed process when bwrap dies. The comment at line 260–262 saying "bwrap + all children" implies killpg covers the children, which it doesn't in this setup.

📝 Suggested comment update
-            # Kill entire process group (bwrap + all children).
-            # proc.kill() alone only kills the bwrap parent, leaving
-            # children running until they finish naturally.
+            # Kill bwrap's process group. The sandboxed command runs in its
+            # own session (bwrap --new-session) so it is not in this group,
+            # but --die-with-parent (prctl PR_SET_PDEATHSIG) ensures it
+            # receives SIGKILL when bwrap dies.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autogpt_platform/backend/backend/copilot/tools/sandbox.py` around lines 249 -
272, Update the misleading comment around the process-kill path: clarify that
start_new_session=True creates a new session for bwrap (see start_new_session)
and that os.killpg(proc.pid, signal.SIGKILL) sends SIGKILL to that bwrap process
group, but sandboxed child processes are placed in their own group by bwrap's
--new-session (see _build_bwrap_command) and thus are terminated via bwrap's
--die-with-parent/prctl(PR_SET_PDEATHSIG) rather than by killpg; replace the
"bwrap + all children" wording with a precise note that killpg targets bwrap's
group and that child processes rely on --die-with-parent to get killed when
bwrap exits.
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py (1)

208-215: _hooks fixture couples to HookMatcher internal attribute — guard against SDK absence

The fixture accesses hooks["PreToolUse"][0].hooks[0], relying on HookMatcher having a public .hooks attribute. If claude_agent_sdk is unavailable, create_security_hooks returns {} and the "PreToolUse" key lookup raises KeyError. This is safe today because every caller of _hooks is guarded by @pytest.mark.skipif(not _sdk_available(), …), but adding a defensive guard inside the fixture would prevent confusing failures if a future test accidentally drops that decorator.

🛡️ Optional guard
 `@pytest.fixture`()
 def _hooks():
     """Create security hooks and return the PreToolUse handler."""
+    if not _sdk_available():
+        pytest.skip("claude_agent_sdk not installed")
     from .security_hooks import create_security_hooks

     hooks = create_security_hooks(user_id="u1", sdk_cwd=SDK_CWD, max_subtasks=2)
     pre = hooks["PreToolUse"][0].hooks[0]
     return pre
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py` around
lines 208 - 215, The _hooks fixture currently assumes create_security_hooks(...)
returns a dict with "PreToolUse" and that its first item exposes a .hooks list;
change _hooks to defensively handle SDK absence by checking the result of
create_security_hooks (e.g., hooks = create_security_hooks(...); if not hooks or
"PreToolUse" not in hooks: pytest.skip("claude_agent_sdk not available") or
return a no-op handler), and when accessing the matcher use safe access like
first_matcher = hooks.get("PreToolUse", []) and pre = getattr(first_matcher[0],
"hooks", [None])[0] if present, otherwise skip/return a no-op; reference
create_security_hooks and the _hooks fixture to locate and update the code.
autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts (1)

173-179: Consider resolving in-progress tools on stream error, not only on "ready"

The current effect only fires on the "streaming" → "ready" transition. If the stream ends with status === "error" (e.g. network drop, backend exception), any tool parts still in "input-streaming" or "input-available" state will continue showing spinners until the user reloads the session.

♻️ Proposed fix
-    if (prev === "streaming" && status === "ready") {
+    if (prev === "streaming" && (status === "ready" || status === "error")) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autogpt_platform/frontend/src/app/`(platform)/copilot/useCopilotPage.ts
around lines 173 - 179, The effect currently only resolves in-progress tool
parts when prevStatusRef transitions from "streaming" to "ready"; update the
condition in the useEffect (referencing useEffect, prevStatusRef, status,
setMessages, resolveInProgressTools) to also handle the "streaming" → "error"
transition (e.g., change the if to check prev === "streaming" && (status ===
"ready" || status === "error") or otherwise detect prev === "streaming" &&
status !== "streaming") so that resolveInProgressTools is invoked when the
stream fails and spinners are cleared.
autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (1)

245-288: flushed is now always True — consider simplifying.

Since unresolved is guaranteed non-empty (early return on line 236), and every iteration unconditionally sets flushed = True, the variable is always True after the loop. You can drop it and simplify the tail.

♻️ Proposed simplification
-        flushed = False
         for tool_id, tool_name in unresolved:
             output = pop_pending_tool_output(tool_name)
             if output is not None:
@@ ..
                 self.resolved_tool_calls.add(tool_id)
-                flushed = True
                 logger.info(...)
             else:
@@ ..
                 self.resolved_tool_calls.add(tool_id)
-                flushed = True
                 logger.warning(...)

-        if flushed and self.step_open:
+        if self.step_open:
             responses.append(StreamFinishStep())
             self.step_open = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autogpt_platform/backend/backend/copilot/sdk/response_adapter.py` around
lines 245 - 288, The loop always sets flushed=True, so remove the redundant
flushed variable and its uses: in the block iterating over unresolved (and using
pop_pending_tool_output, StreamToolOutputAvailable, self.resolved_tool_calls),
drop flushed assignments and the final if flushed and self.step_open check;
instead, after the loop if unresolved was non-empty (the caller already
guarantees this) and self.step_open is True, append StreamFinishStep() and set
self.step_open = False; ensure all current behavior (adding
StreamToolOutputAvailable entries and logging) remains unchanged.
autogpt_platform/backend/backend/copilot/sdk/service.py (1)

421-443: Markers in _is_tool_error_or_denial are intentionally broad and tested against real denial messages.

The markers like "failed", "maximum", and "not supported" are designed to detect system-generated denial and error messages from security hooks (e.g., "Maximum {N} sub-tasks per session..." and "Background task execution is not supported...") and tool execution failures. While they could theoretically match unrelated content, the function is applied to tool outputs (structured MCP data or natural language), and false positives only add extra context lines to conversation history—they don't break functionality. The test suite (TestIsToolErrorOrDenial) validates these markers against real denial messages and includes negative tests to ensure benign content is correctly excluded.

If higher precision is desired, refactoring to anchor broad markers to more specific patterns (e.g., "is not supported" or "execution failed") could reduce noise, but this is optional given the acceptable trade-off between recall and precision for the use case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autogpt_platform/backend/backend/copilot/sdk/service.py` around lines 421 -
443, The current broad substring markers in _is_tool_error_or_denial (the tuple
literal containing "failed", "maximum", "not supported", etc.) can produce false
positives; tighten them by replacing generic tokens with more specific anchored
patterns (e.g., change "not supported" -> "is not supported" or "background task
execution is not supported", "failed" -> "execution failed" or '"iserror": true'
stays as-is, and "maximum" -> patterns like "maximum .* sub-task" or "maximum .*
sub-tasks per session") and update the membership check to test these more
specific phrases against content.lower() so the function still operates on tool
outputs but reduces accidental matches.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0818cd6 and b1c5000.

📒 Files selected for processing (13)
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/service_test.py
  • autogpt_platform/backend/backend/copilot/stream_registry.py
  • autogpt_platform/backend/backend/copilot/tools/sandbox.py
  • autogpt_platform/backend/backend/util/prompt.py
  • autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
🧰 Additional context used
📓 Path-based instructions (15)
autogpt_platform/frontend/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

autogpt_platform/frontend/**/*.{ts,tsx,js,jsx}: Use Node.js 21+ with pnpm package manager for frontend development
Always run 'pnpm format' for formatting and linting code in frontend development

Files:

  • autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/**/*.{tsx,ts}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

autogpt_platform/frontend/**/*.{tsx,ts}: Use function declarations for components and handlers (not arrow functions) in React components
Only use arrow functions for small inline lambdas (map, filter, etc.) in React components
Use PascalCase for component names and camelCase with 'use' prefix for hook names in React
Use Tailwind CSS utilities only for styling in frontend components
Use design system components from 'src/components/' (atoms, molecules, organisms) in frontend development
Never use 'src/components/legacy/' in frontend code
Only use Phosphor Icons (@phosphor-icons/react) for icons in frontend components
Use generated API hooks from '@/app/api/generated/endpoints/' instead of deprecated 'BackendAPI' or 'src/lib/autogpt-server-api/
'
Use React Query for server state (via generated hooks) in frontend development
Default to client components ('use client') in Next.js; only use server components for SEO or extreme TTFB needs
Use '' component for rendering errors in frontend UI; use toast notifications for mutation errors; use 'Sentry.captureException()' for manual exceptions
Separate render logic from data/behavior in React components; keep comments minimal (code should be self-documenting)

Files:

  • autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

autogpt_platform/frontend/**/*.{ts,tsx}: No barrel files or 'index.ts' re-exports in frontend code
Regenerate API hooks with 'pnpm generate:api' after backend OpenAPI spec changes in frontend development

Files:

  • autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (autogpt_platform/frontend/CLAUDE.md)

autogpt_platform/frontend/src/**/*.{ts,tsx}: Fully capitalize acronyms in symbols, e.g. graphID, useBackendAPI
Use function declarations (not arrow functions) for components and handlers
Separate render logic (.tsx) from business logic (use*.ts hooks)
Use shadcn/ui (Radix UI primitives) with Tailwind CSS styling for UI components
Use Phosphor Icons only for icons
Use ErrorCard for render errors, toast for mutations, and Sentry for exceptions
Use design system components from src/components/ (atoms, molecules, organisms)
Never use src/components/__legacy__/* components
Use generated API hooks from @/app/api/__generated__/endpoints/ with pattern use{Method}{Version}{OperationName}
Use Tailwind CSS only for styling, with design tokens
Do not use useCallback or useMemo unless asked to optimize a given function
Never type with any unless a variable/attribute can ACTUALLY be of any type

autogpt_platform/frontend/src/**/*.{ts,tsx}: Structure components as ComponentName/ComponentName.tsx + useComponentName.ts + helpers.ts and use design system components from src/components/ (atoms, molecules, organisms)
Use generated API hooks from @/app/api/__generated__/endpoints/ with pattern use{Method}{Version}{OperationName} and regenerate with pnpm generate:api
Use function declarations (not arrow functions) for components and handlers
Separate render logic from business logic with component.tsx + useComponent.ts + helpers.ts structure
Colocate state when possible, avoid creating large components, use sub-components in local /components folder
Avoid large hooks, abstract logic into helpers.ts files when sensible
Use arrow functions only for callbacks, not for component declarations
Avoid comments at all times unless the code is very complex
Do not use useCallback or useMemo unless asked to optimize a given function

Files:

  • autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/src/**/*use*.ts

📄 CodeRabbit inference engine (autogpt_platform/frontend/CLAUDE.md)

Do not type hook returns, let TypeScript infer as much as possible

Files:

  • autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/**/*.{js,jsx,ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

autogpt_platform/frontend/**/*.{js,jsx,ts,tsx}: Format frontend code using pnpm format
Never use components from src/components/__legacy__/*

Files:

  • autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/**/*.{js,jsx,ts,tsx,css}

📄 CodeRabbit inference engine (AGENTS.md)

Use Tailwind CSS only for styling, use design tokens, and use Phosphor Icons only

Files:

  • autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

Do not type hook returns, let Typescript infer as much as possible

Files:

  • autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Never type with any, if no types available use unknown

Files:

  • autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts
  • autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/backend/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

autogpt_platform/backend/**/*.py: Use Python 3.11 (required; managed by Poetry via pyproject.toml) for backend development
Always run 'poetry run format' (Black + isort) before linting in backend development
Always run 'poetry run lint' (ruff) after formatting in backend development

Files:

  • autogpt_platform/backend/backend/copilot/service_test.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
  • autogpt_platform/backend/backend/copilot/stream_registry.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/tools/sandbox.py
  • autogpt_platform/backend/backend/util/prompt.py
autogpt_platform/backend/**/*.{py,txt}

📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)

Use poetry run prefix for all Python commands, including testing, linting, formatting, and migrations

Files:

  • autogpt_platform/backend/backend/copilot/service_test.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
  • autogpt_platform/backend/backend/copilot/stream_registry.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/tools/sandbox.py
  • autogpt_platform/backend/backend/util/prompt.py
autogpt_platform/backend/**/*_test.py

📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)

autogpt_platform/backend/**/*_test.py: Always review snapshot changes with git diff before committing when updating snapshots with poetry run pytest --snapshot-update
Use pytest with snapshot testing for API responses in test files
Colocate test files with source files using the *_test.py naming convention

Files:

  • autogpt_platform/backend/backend/copilot/service_test.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
autogpt_platform/backend/backend/**/*.py

📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)

Use Prisma ORM for database operations in PostgreSQL with pgvector for embeddings

Files:

  • autogpt_platform/backend/backend/copilot/service_test.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
  • autogpt_platform/backend/backend/copilot/stream_registry.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/tools/sandbox.py
  • autogpt_platform/backend/backend/util/prompt.py
autogpt_platform/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Format Python code with poetry run format

Files:

  • autogpt_platform/backend/backend/copilot/service_test.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
  • autogpt_platform/backend/backend/copilot/stream_registry.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/tools/sandbox.py
  • autogpt_platform/backend/backend/util/prompt.py
autogpt_platform/backend/**/*test*.py

📄 CodeRabbit inference engine (AGENTS.md)

Run poetry run test for backend testing (runs pytest with docker based postgres + prisma)

Files:

  • autogpt_platform/backend/backend/copilot/service_test.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
🧠 Learnings (2)
📚 Learning: 2026-02-04T16:50:33.615Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: autogpt_platform/frontend/CLAUDE.md:0-0
Timestamp: 2026-02-04T16:50:33.615Z
Learning: Use React Query for server state, co-located UI state in components/hooks

Applied to files:

  • autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
📚 Learning: 2026-02-04T16:49:42.490Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-02-04T16:49:42.490Z
Learning: Applies to autogpt_platform/backend/**/test/**/*.py : Use snapshot testing with '--snapshot-update' flag in backend tests when output changes; always review with 'git diff'

Applied to files:

  • autogpt_platform/backend/backend/copilot/service_test.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
🧬 Code graph analysis (8)
autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts (1)
autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts (1)
  • convertChatSessionMessagesToUiMessages (58-139)
autogpt_platform/backend/backend/copilot/service_test.py (1)
autogpt_platform/backend/backend/copilot/sdk/transcript.py (1)
  • download_transcript (402-453)
autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (2)
autogpt_platform/backend/backend/util/logging.py (2)
  • info (41-43)
  • warning (45-47)
autogpt_platform/backend/backend/copilot/tools/base.py (1)
  • name (20-22)
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py (1)
autogpt_platform/backend/backend/util/logging.py (1)
  • info (41-43)
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py (2)
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py (3)
  • _validate_tool_access (83-118)
  • _validate_user_isolation (121-142)
  • create_security_hooks (145-333)
autogpt_platform/backend/backend/copilot/sdk/service.py (1)
  • _is_tool_error_or_denial (421-443)
autogpt_platform/backend/backend/copilot/sdk/transcript.py (1)
autogpt_platform/backend/backend/util/logging.py (1)
  • debug (53-55)
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py (3)
autogpt_platform/backend/backend/copilot/response_model.py (2)
  • StreamBaseResponse (43-50)
  • StreamToolOutputAvailable (150-173)
autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (1)
  • convert_message (64-205)
autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py (1)
  • stash_pending_tool_output (116-136)
autogpt_platform/backend/backend/copilot/tools/sandbox.py (1)
autogpt_platform/backend/backend/util/logging.py (1)
  • warning (45-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Seer Code Review
  • GitHub Check: Greptile Review
🔇 Additional comments (14)
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py (1)

269-334: TestIsToolErrorOrDenial — thorough coverage, LGTM

The class exercises every marker in _is_tool_error_or_denial including the case-insensitive JSON field ('"isError": true' → lowercased match), benign look-alikes ("0 errors found", "permission_level", "file-not-found-handler.py"), and both None/empty inputs. Coverage is complete.

autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts (1)

108-117: Stale-spinner fix is clean and consistent with resolveInProgressTools

The output: "" sentinel for completed-but-missing tool outputs matches the value used in useCopilotPage.ts's resolveInProgressTools. The 3-way branch (output present → use output; session complete → empty output-available; streaming → input-available) correctly represents all reachable states.

autogpt_platform/backend/backend/util/prompt.py (1)

538-565: Anti-fabrication instructions and streamlined section list are good improvements

The anti-hallucination guard ("CRITICAL: Only include…EXPLICITLY present") combined with "skip sections with nothing to report" should meaningfully reduce fabricated summaries. Collapsing 9 → 7 sections and replacing the separate "Current Work"/"Next Steps" with a single "Current State" avoids speculative forward-looking content.

autogpt_platform/backend/backend/copilot/stream_registry.py (1)

228-238: LGTM — tool-event timing logs are consistent with _reconstruct_chunk class names

The two added type names (StreamToolInputAvailable, StreamToolOutputAvailable) match the imports in _reconstruct_chunk and align with the new tool-output streaming introduced in the response adapter.

autogpt_platform/backend/backend/copilot/service_test.py (1)

147-151: Graceful skip for missing transcript is correct for environment-dependent test

return pytest.skip(...) mid-test is the standard pytest idiom when a skip condition can only be determined at runtime. The 5 s polling window is reasonable. The test still fully validates both turns when a transcript is present.

autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts (1)

20-30: resolveInProgressTools — clean implementation

Function declaration ✓, guard with "state" in part before accessing part.state ✓, output: "" consistent with the isComplete path in convertChatSessionToUiMessages.ts ✓.

autogpt_platform/backend/backend/copilot/sdk/transcript.py (1)

140-147: Semantic transcript validation is a clear improvement over the old line-count heuristic

Checking for at least one "user" and one "assistant" entry correctly rejects metadata-only files (e.g., only queue-operation entries) regardless of line count. The safe-fail behavior on any json.JSONDecodeError is intentional — a truncated last line from a mid-write SIGKILL is safer to discard than to resume with.

autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts (1)

43-62: Looks good — hydration correctly marks dangling tool calls as complete.

The hasActiveStream derivation and isComplete flag flow correctly to convertChatSessionMessagesToUiMessages, resolving stale spinners when a session has no active stream. The dependency arrays are correct (ESLint exhaustive-deps would require hasActiveStream even though it's derived from sessionQuery.data).

Minor note per coding guidelines: the comments on lines 43–45 and 53–54 are helpful for explaining why the isComplete flag exists, though the guidelines suggest keeping comments minimal. Given this is a non-obvious behavioral fix, they're justified here.

autogpt_platform/backend/backend/copilot/sdk/security_hooks.py (1)

190-214: Counter-order fix and background-task gating look correct.

The reordering — block background first, check limit second, increment only on success — ensures denied calls never consume a subtask slot. Each session gets its own closure, so the non-atomic task_spawn_count is safe within the single-session async flow.

autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (1)

231-237: Good improvement — early return and pre-computed unresolved list.

The early return avoids unnecessary logging and iteration. The list comprehension cleanly identifies only unresolved calls.

autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py (1)

372-514: Solid test coverage for the three flush scenarios.

The tests exercise the key paths: flush at ResultMessage, flush at next AssistantMessage, and flush with stashed output. Event sequences and assertions are correct.

autogpt_platform/backend/backend/copilot/sdk/service.py (3)

855-913: _build_transcript — clean reconstruction logic.

The concatenation of previous transcript + synthetic user entry + new entries produces valid JSONL, and the validate_transcript guard prevents uploading corrupt data. The synthetic user entry structure with optional parentUuid is a pragmatic approach.

One observation: previous_transcript.rstrip("\n") on line 886 ensures no double newlines when joining, which is correct for JSONL.


86-87: Heartbeat interval constant and integration look correct.

15 seconds is a reasonable SSE keep-alive interval that falls well under typical proxy/LB timeouts (usually 30–60s). The asyncio.timeout + StreamHeartbeat pattern on lines 690–694 is clean.


66-81: CapturedTranscript refactored for raw entry capture — looks good.

The shift from file-path-based transcript tracking to in-memory raw_entries collection is the right approach given the CLI doesn't write JSONL in SDK mode.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 686-687: Replace the bare assert on client._query with an explicit
runtime check: verify client._query is not None before using it and raise a
descriptive exception (e.g., ValueError or RuntimeError) that explains the SDK
internal state is invalid and suggests calling connect() first; then proceed to
set msg_iter = client._query.receive_messages().__aiter__() as before. Ensure
the check references client._query and that the error message clearly identifies
the failing condition and preferred caller action.
- Around line 682-707: The dependency constraint for claude-agent-sdk is too
loose given we rely on private internals (client._query and
claude_agent_sdk._internal.message_parser.parse_message); update pyproject.toml
to pin the package to the exact tested release (e.g., 0.1.35) or a very narrow
range (e.g., ==0.1.35 or >=0.1.35,<0.1.36) and add a short comment in the file
referencing client._query and parse_message to mark this as tech debt requiring
audit when bumping the pinned version.

---

Nitpick comments:
In `@autogpt_platform/backend/backend/copilot/sdk/response_adapter.py`:
- Around line 245-288: The loop always sets flushed=True, so remove the
redundant flushed variable and its uses: in the block iterating over unresolved
(and using pop_pending_tool_output, StreamToolOutputAvailable,
self.resolved_tool_calls), drop flushed assignments and the final if flushed and
self.step_open check; instead, after the loop if unresolved was non-empty (the
caller already guarantees this) and self.step_open is True, append
StreamFinishStep() and set self.step_open = False; ensure all current behavior
(adding StreamToolOutputAvailable entries and logging) remains unchanged.

In `@autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py`:
- Around line 208-215: The _hooks fixture currently assumes
create_security_hooks(...) returns a dict with "PreToolUse" and that its first
item exposes a .hooks list; change _hooks to defensively handle SDK absence by
checking the result of create_security_hooks (e.g., hooks =
create_security_hooks(...); if not hooks or "PreToolUse" not in hooks:
pytest.skip("claude_agent_sdk not available") or return a no-op handler), and
when accessing the matcher use safe access like first_matcher =
hooks.get("PreToolUse", []) and pre = getattr(first_matcher[0], "hooks",
[None])[0] if present, otherwise skip/return a no-op; reference
create_security_hooks and the _hooks fixture to locate and update the code.

In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 421-443: The current broad substring markers in
_is_tool_error_or_denial (the tuple literal containing "failed", "maximum", "not
supported", etc.) can produce false positives; tighten them by replacing generic
tokens with more specific anchored patterns (e.g., change "not supported" -> "is
not supported" or "background task execution is not supported", "failed" ->
"execution failed" or '"iserror": true' stays as-is, and "maximum" -> patterns
like "maximum .* sub-task" or "maximum .* sub-tasks per session") and update the
membership check to test these more specific phrases against content.lower() so
the function still operates on tool outputs but reduces accidental matches.

In `@autogpt_platform/backend/backend/copilot/tools/sandbox.py`:
- Around line 249-272: Update the misleading comment around the process-kill
path: clarify that start_new_session=True creates a new session for bwrap (see
start_new_session) and that os.killpg(proc.pid, signal.SIGKILL) sends SIGKILL to
that bwrap process group, but sandboxed child processes are placed in their own
group by bwrap's --new-session (see _build_bwrap_command) and thus are
terminated via bwrap's --die-with-parent/prctl(PR_SET_PDEATHSIG) rather than by
killpg; replace the "bwrap + all children" wording with a precise note that
killpg targets bwrap's group and that child processes rely on --die-with-parent
to get killed when bwrap exits.

In `@autogpt_platform/frontend/src/app/`(platform)/copilot/useCopilotPage.ts:
- Around line 173-179: The effect currently only resolves in-progress tool parts
when prevStatusRef transitions from "streaming" to "ready"; update the condition
in the useEffect (referencing useEffect, prevStatusRef, status, setMessages,
resolveInProgressTools) to also handle the "streaming" → "error" transition
(e.g., change the if to check prev === "streaming" && (status === "ready" ||
status === "error") or otherwise detect prev === "streaming" && status !==
"streaming") so that resolveInProgressTools is invoked when the stream fails and
spinners are cleared.

Comment thread autogpt_platform/backend/backend/copilot/sdk/service.py Outdated
Comment thread autogpt_platform/backend/backend/copilot/sdk/service.py Outdated
… logging

- Add 100ms yield before flush when unresolved tool calls exist, giving
  PostToolUse hooks time to complete before the stash is checked. This
  mitigates the race condition in claude_agent_sdk where hooks are
  fire-and-forget (start_soon) while messages arrive immediately.
- Add has_unresolved_tool_calls property to SDKResponseAdapter
- Differentiate empty flush warnings to flag likely race conditions
- Add session_id to all SDK log messages ([SDK] [<session>] ...)
- Log session approach (resume/compression/single-turn) with context sizes
- Elevate session save log from debug to info
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
autogpt_platform/backend/backend/copilot/sdk/service.py (2)

732-740: Timing-based race mitigation is inherently fragile.

The 100ms asyncio.sleep is a best-effort workaround for the hook race, and won't reliably cover cases where the PostToolUse hook takes longer (e.g., network latency in the stash path). The adapter's fallback (emit empty output + WARNING) provides a safety net, so this isn't a correctness issue, but it may produce noisy logs under load.

Consider a more deterministic approach if this becomes frequent — e.g., an asyncio.Event set by the hook that the flush path awaits with a short timeout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autogpt_platform/backend/backend/copilot/sdk/service.py` around lines 732 -
740, The current 100ms asyncio.sleep in Query._read_messages to mitigate a race
with PostToolUse hooks is fragile; replace it with a deterministic wait: add an
asyncio.Event (e.g., adapter.post_tool_use_done_event) that the PostToolUse hook
sets when its stash/output is complete, then in the flush path (where
adapter.has_unresolved_tool_calls and isinstance(sdk_msg, (AssistantMessage,
ResultMessage))) await that event with a short timeout using asyncio.wait_for
(fallback to current behavior if timed out) so the flush deterministically waits
for the hook to finish instead of relying on sleep.

855-860: Minor logging inconsistency: full session_id vs truncated.

Other log lines in this function consistently use session_id[:12] for brevity, but this warning uses the full session_id. Consider truncating for consistency.

                     logger.warning(
-                        "[SDK] No transcript to upload for %s " "(%d captured entries)",
-                        session_id,
+                        "[SDK] [%s] No transcript to upload (%d captured entries)",
+                        session_id[:12],
                         len(captured_transcript.raw_entries),
                     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autogpt_platform/backend/backend/copilot/sdk/service.py` around lines 855 -
860, The warning log is inconsistent because it logs the full session_id while
other logs use a truncated form; update the logger.warning call that references
session_id and captured_transcript.raw_entries to use session_id[:12] (same
truncation used elsewhere in this function) so the message reads e.g.
logger.warning("[SDK] No transcript to upload for %s (%d captured entries)",
session_id[:12], len(captured_transcript.raw_entries)).
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b1c5000 and e476185.

📒 Files selected for processing (2)
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
🧰 Additional context used
📓 Path-based instructions (4)
autogpt_platform/backend/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

autogpt_platform/backend/**/*.py: Use Python 3.11 (required; managed by Poetry via pyproject.toml) for backend development
Always run 'poetry run format' (Black + isort) before linting in backend development
Always run 'poetry run lint' (ruff) after formatting in backend development

Files:

  • autogpt_platform/backend/backend/copilot/sdk/response_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/backend/**/*.{py,txt}

📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)

Use poetry run prefix for all Python commands, including testing, linting, formatting, and migrations

Files:

  • autogpt_platform/backend/backend/copilot/sdk/response_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/backend/backend/**/*.py

📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)

Use Prisma ORM for database operations in PostgreSQL with pgvector for embeddings

Files:

  • autogpt_platform/backend/backend/copilot/sdk/response_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Format Python code with poetry run format

Files:

  • autogpt_platform/backend/backend/copilot/sdk/response_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
🧬 Code graph analysis (1)
autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (2)
autogpt_platform/backend/backend/util/logging.py (2)
  • info (41-43)
  • warning (45-47)
autogpt_platform/backend/backend/copilot/tools/base.py (1)
  • name (20-22)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Seer Code Review
🔇 Additional comments (8)
autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (2)

50-68: Clean additions for session awareness and unresolved-call tracking.

The session_id parameter and has_unresolved_tool_calls property are well-designed. Using dict.keys() - set for the set difference is idiomatic and efficient.


237-300: Good early-return guard and improved logging granularity.

The pre-computed unresolved list avoids repeated set-difference checks inside the loop, and the early return when nothing to flush is a nice short-circuit. INFO vs. WARNING distinction correctly differentiates normal flushes from race-condition symptoms.

One minor note: after the early-return guard, flushed (line 253) is now always set to True by at least one loop iteration, making the if flushed check on line 298 technically redundant. Not worth changing — it's a safe defensive guard.

autogpt_platform/backend/backend/copilot/sdk/service.py (6)

66-88: Clean refactor of CapturedTranscript to in-memory capture.

Switching from file-path-based transcript handling to in-memory raw_entries is a good simplification that avoids filesystem I/O race conditions. The 15-second heartbeat interval is a reasonable default for keeping SSE alive through typical proxy/LB timeouts (often 30–60s).


597-598: LGTM — adapter instantiation correctly passes session_id.

Aligns with the updated SDKResponseAdapter.__init__ signature in response_adapter.py.


705-713: Heartbeat streaming loop is well-structured.

Using asyncio.timeout per-iteration with an explicit async iterator is the correct pattern for Python 3.11+. The comment explaining why asyncio.timeout is preferred over asyncio.wait_for (Task cancellation leaving the generator broken) is a valuable implementation note.


894-952: _build_transcript is well-guarded with validation.

The function correctly handles edge cases (empty entries → None, invalid transcript → None with warning). Concatenating previous transcript + synthetic user entry + new entries produces a proper JSONL sequence. The trailing-newline handling (rstrip("\n") on previous + final + "\n") ensures consistent formatting.


930-937: The synthetic user entry format is correct; concerns about format-related resume failures are not founded.

The validate_transcript() function only checks for the presence of type: "user" and type: "assistant" entries—it does not enforce strict field schema validation. Extra fields like session_id are safely ignored. Additionally, the codebase explicitly handles empty parentUuid values (defaulting to "" via .get("parentUuid", "") throughout transcript.py), and the reparenting logic in strip_progress_entries() correctly walks up chains with empty strings. A comment at service.py:929 confirms "The uuid/parentUuid fields are optional for --resume," indicating this design is intentional.

Likely an incorrect or invalid review comment.


421-443: The function already handles the false positive scenarios you're concerned about. The test suite explicitly validates that benign content like "0 errors found" or filenames like "file-not-found-handler.py" return False, while legitimate errors and denials return True. The markers ("failed", "maximum", "not supported") appear only in actual error/denial messages from the security hooks, not in arbitrary tool output. No refactoring is needed.

Likely an incorrect or invalid review comment.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 698-704: The code illegally depends on private internals: remove
the import from claude_agent_sdk._internal.message_parser and the direct access
to client._query; instead use the SDK's public APIs (e.g., call the public
receive_messages/stream method exposed on the client or client.query object and
the public message parsing/constructors such as AssistantMessage/ResultMessage
or any public parse_message API) to get an async iterator and parse messages.
Replace uses of _parse_sdk_msg and client._query with the SDK's documented
methods (e.g., client.receive_messages() or client.query.receive_messages() and
AssistantMessage/ResultMessage parsing utilities) so the code no longer
references _parse_sdk_msg or client._query.
- Around line 703-704: Replace the bare assert on client._query with an explicit
runtime check: if client._query is None raise a clear exception (e.g.,
RuntimeError or a custom error) with a message indicating that connect() must be
called before iterating messages; then proceed to call
client._query.receive_messages().__aiter__() to obtain msg_iter. This change
should occur where client._query is referenced so the error is descriptive and
not stripped by -O.

---

Nitpick comments:
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 732-740: The current 100ms asyncio.sleep in Query._read_messages
to mitigate a race with PostToolUse hooks is fragile; replace it with a
deterministic wait: add an asyncio.Event (e.g.,
adapter.post_tool_use_done_event) that the PostToolUse hook sets when its
stash/output is complete, then in the flush path (where
adapter.has_unresolved_tool_calls and isinstance(sdk_msg, (AssistantMessage,
ResultMessage))) await that event with a short timeout using asyncio.wait_for
(fallback to current behavior if timed out) so the flush deterministically waits
for the hook to finish instead of relying on sleep.
- Around line 855-860: The warning log is inconsistent because it logs the full
session_id while other logs use a truncated form; update the logger.warning call
that references session_id and captured_transcript.raw_entries to use
session_id[:12] (same truncation used elsewhere in this function) so the message
reads e.g. logger.warning("[SDK] No transcript to upload for %s (%d captured
entries)", session_id[:12], len(captured_transcript.raw_entries)).

… pin, event-based stash

- Replace bare `assert client._query` with proper RuntimeError check
- Add TECH DEBT comments on private SDK internal usage
- Pin claude-agent-sdk to ~0.1.35 (tighter constraint for private API access)
- Replace sleep(0.1) with event-based wait_for_stash() for race-condition fix
- Add wait_for_stash synchronisation tests
@majdyz majdyz changed the title fix(copilot): SDK transcript capture, security hooks, and stale spinner fixes fix(copilot): SDK reliability — transcript capture, race-condition fix, stale spinner resolution Feb 20, 2026
Resolve service.py conflicts: take dev's file-based transcript approach
(CapturedTranscript.path + read_transcript_file) and public client API,
layer our fixes on top (wait_for_stash race-condition fix, session_id
logging, approach logging).
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py (1)

525-545: Add try/finally to async test cleanup to prevent ContextVar state leakage on failure.

All three async wait-for-stash tests reset _pto and _stash_event at the end of the function body, but this cleanup is skipped if an assertion fails before it's reached. A failed test could leave _pto set to {} (instead of the None default) and _stash_event set to a live Event, which may affect sync tests that follow.

♻️ Proposed fix — wrap body in try/finally for each async test
 `@pytest.mark.asyncio`
 async def test_wait_for_stash_signaled():
     """wait_for_stash returns True when stash_pending_tool_output signals."""
     _pto.set({})
     event = asyncio.Event()
     _stash_event.set(event)

-    # Simulate a PostToolUse hook that stashes output after a short delay
-    async def delayed_stash():
-        await asyncio.sleep(0.01)
-        _stash("WebSearch", "result data")
-
-    asyncio.create_task(delayed_stash())
-    result = await wait_for_stash(timeout=1.0)
-
-    assert result is True
-    assert _pto.get({}).get("WebSearch") == ["result data"]
-
-    # Cleanup
-    _pto.set({})  # type: ignore[arg-type]
-    _stash_event.set(None)
+    try:
+        async def delayed_stash():
+            await asyncio.sleep(0.01)
+            _stash("WebSearch", "result data")
+
+        asyncio.create_task(delayed_stash())
+        result = await wait_for_stash(timeout=1.0)
+
+        assert result is True
+        assert _pto.get({}).get("WebSearch") == ["result data"]
+    finally:
+        _pto.set({})  # type: ignore[arg-type]
+        _stash_event.set(None)

Apply the same try/finally pattern to test_wait_for_stash_timeout and test_wait_for_stash_already_stashed.

Also applies to: 548-560, 563-587

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py` around
lines 525 - 545, The async tests test_wait_for_stash_signaled,
test_wait_for_stash_timeout, and test_wait_for_stash_already_stashed leak
ContextVar state on failure because they perform cleanup at the end of the test
body; wrap each test's main body in a try/finally and move the cleanup into the
finally block so _pto and _stash_event are always reset (use _pto.set(None) and
_stash_event.set(None) to restore the original defaults) and keep the existing
simulated stash logic (delayed_stash/task creation) and assertions inside the
try.
autogpt_platform/backend/backend/copilot/sdk/service.py (1)

67-82: sdk_session_id field in CapturedTranscript is never read after assignment.

The field is set in the _on_stop callback (line 551), but no code path reads it afterwards. The _build_transcript method uses a separate session_id parameter instead. Remove this unused field to reduce confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autogpt_platform/backend/backend/copilot/sdk/service.py` around lines 67 -
82, CapturedTranscript defines an unused field sdk_session_id which is set in
the _on_stop callback but never read (the _build_transcript path uses its
session_id parameter instead); remove the sdk_session_id attribute from the
CapturedTranscript dataclass and delete any assignments to it in the _on_stop
callback to avoid dead state, making sure to keep _build_transcript usage
unchanged and verify no other code references sdk_session_id.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e476185 and 78b52b9.

⛔ Files ignored due to path filters (1)
  • autogpt_platform/backend/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py
  • autogpt_platform/backend/pyproject.toml
🧰 Additional context used
📓 Path-based instructions (6)
autogpt_platform/backend/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

autogpt_platform/backend/**/*.py: Use Python 3.11 (required; managed by Poetry via pyproject.toml) for backend development
Always run 'poetry run format' (Black + isort) before linting in backend development
Always run 'poetry run lint' (ruff) after formatting in backend development

Files:

  • autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/backend/**/*.{py,txt}

📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)

Use poetry run prefix for all Python commands, including testing, linting, formatting, and migrations

Files:

  • autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/backend/backend/**/*.py

📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)

Use Prisma ORM for database operations in PostgreSQL with pgvector for embeddings

Files:

  • autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Format Python code with poetry run format

Files:

  • autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py
  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/backend/**/*_test.py

📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)

autogpt_platform/backend/**/*_test.py: Always review snapshot changes with git diff before committing when updating snapshots with poetry run pytest --snapshot-update
Use pytest with snapshot testing for API responses in test files
Colocate test files with source files using the *_test.py naming convention

Files:

  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
autogpt_platform/backend/**/*test*.py

📄 CodeRabbit inference engine (AGENTS.md)

Run poetry run test for backend testing (runs pytest with docker based postgres + prisma)

Files:

  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
🧠 Learnings (7)
📚 Learning: 2026-02-04T16:49:42.490Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-02-04T16:49:42.490Z
Learning: Applies to autogpt_platform/backend/**/*.py : Use Python 3.11 (required; managed by Poetry via pyproject.toml) for backend development

Applied to files:

  • autogpt_platform/backend/pyproject.toml
📚 Learning: 2026-02-04T16:50:20.508Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: autogpt_platform/backend/CLAUDE.md:0-0
Timestamp: 2026-02-04T16:50:20.508Z
Learning: Applies to autogpt_platform/backend/**/*.{py,txt} : Use `poetry run` prefix for all Python commands, including testing, linting, formatting, and migrations

Applied to files:

  • autogpt_platform/backend/pyproject.toml
📚 Learning: 2026-01-31T18:44:56.328Z
Learnt from: Bentlybro
Repo: Significant-Gravitas/AutoGPT PR: 0
File: :0-0
Timestamp: 2026-01-31T18:44:56.328Z
Learning: In the AutoGPT backend Docker build with Poetry path dependencies, `autogpt_libs` requires a double-copy pattern: first COPY from builder brings the installed package state (so venv references work), then second COPY from build context overwrites with latest source (to ensure freshness). Both copies are necessary for correct functionality.

Applied to files:

  • autogpt_platform/backend/pyproject.toml
📚 Learning: 2026-02-04T16:49:42.490Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-02-04T16:49:42.490Z
Learning: Applies to autogpt_platform/backend/schema.prisma : Run database migrations with 'poetry run prisma migrate dev' and 'poetry run prisma generate' after schema changes in backend

Applied to files:

  • autogpt_platform/backend/pyproject.toml
📚 Learning: 2026-01-23T19:58:10.520Z
Learnt from: ntindle
Repo: Significant-Gravitas/AutoGPT PR: 11796
File: autogpt_platform/backend/backend/blocks/video/loop.py:80-87
Timestamp: 2026-01-23T19:58:10.520Z
Learning: Ensure MoviePy is constrained to version ^2.1.2 (2.x) in pyproject.toml files where MoviePy is declared, so the backend video processing relies on a compatible API. This should cover all relevant pyproject.toml files (e.g., autogpt_platform/backend/pyproject.toml) to maintain consistency.

Applied to files:

  • autogpt_platform/backend/pyproject.toml
📚 Learning: 2026-02-04T16:49:42.490Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-02-04T16:49:42.490Z
Learning: Applies to autogpt_platform/backend/**/test/**/*.py : Use snapshot testing with '--snapshot-update' flag in backend tests when output changes; always review with 'git diff'

Applied to files:

  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
📚 Learning: 2026-02-04T16:50:20.508Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: autogpt_platform/backend/CLAUDE.md:0-0
Timestamp: 2026-02-04T16:50:20.508Z
Learning: Applies to autogpt_platform/backend/**/*_test.py : Always review snapshot changes with `git diff` before committing when updating snapshots with `poetry run pytest --snapshot-update`

Applied to files:

  • autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
🧬 Code graph analysis (2)
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py (3)
autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py (2)
  • stash_pending_tool_output (126-150)
  • wait_for_stash (153-179)
autogpt_platform/backend/backend/copilot/response_model.py (2)
  • StreamBaseResponse (43-50)
  • StreamToolOutputAvailable (150-173)
autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (1)
  • convert_message (70-211)
autogpt_platform/backend/backend/copilot/sdk/service.py (4)
autogpt_platform/backend/backend/copilot/response_model.py (1)
  • StreamHeartbeat (211-225)
autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py (1)
  • wait_for_stash (153-179)
autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (2)
  • SDKResponseAdapter (43-300)
  • has_unresolved_tool_calls (66-68)
autogpt_platform/backend/backend/copilot/sdk/transcript.py (1)
  • validate_transcript (248-276)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Seer Code Review
🔇 Additional comments (3)
autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py (1)

48-55: Stash-event mechanism is well-designed.

The _stash_event ContextVar, its initialisation in set_execution_context, and the event.set() call in stash_pending_tool_output correctly bridge the PostToolUse hook timing gap described in the comment. The intentional "clear-before-wait" semantics in wait_for_stash (verified by test_wait_for_stash_already_stashed) are appropriately documented.

Also applies to: 88-88, 147-150

autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py (1)

379-429: Flush tests correctly validate the _flush_unresolved_tool_calls behavior.

test_flush_unresolved_at_result_message and test_flush_unresolved_at_next_assistant_message exercise both flush triggers (ResultMessage and the next AssistantMessage) and assert exact stream-type sequences, including that flushed output defaults to "" when no stash exists.

Also applies to: 432-474

autogpt_platform/backend/backend/copilot/sdk/service.py (1)

688-760: Heartbeat streaming loop is correctly implemented.

The explicit __aiter__ + asyncio.timeout pattern avoids the wait_for task-cancellation pitfall documented in the comment, the StopAsyncIteration break is clean, and the wait_for_stash integration before convert_message correctly bridges the PostToolUse hook timing race without requiring an arbitrary sleep.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 422-444: The _is_tool_error_or_denial function currently treats
the substring "failed" as an error marker which is too broad; update
_is_tool_error_or_denial to remove the bare "failed" marker and instead check
for narrower, unambiguous patterns such as "error:", "execution failed", "tool
error:", or the MCP flag '"iserror": true' (or add a distinct sentinel like
"tool error:" when producing denial messages) so benign outputs like "3 tests
failed" aren't misclassified—modify the marker tuple in _is_tool_error_or_denial
accordingly and ensure any code that stores denial messages uses the same
sentinel convention if added.
- Around line 914-972: The synthetic user entry in _build_transcript sets
"parentUuid": "" which can be rejected by the CLI; change how user_entry is
constructed in _build_transcript so that parentUuid is omitted (or set to
None/null) for root-level user messages instead of an empty string, e.g. only
include parentUuid when you have a non-empty parent id; ensure
validate_transcript still runs against the resulting raw JSONL and keep
captured_entries and previous_transcript handling unchanged.

In `@autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py`:
- Around line 153-179: The function wait_for_stash uses the Python 3.11-only
context manager asyncio.timeout; replace that with asyncio.wait_for to maintain
compatibility with Python 3.10: call await asyncio.wait_for(event.wait(),
timeout) instead of using async with asyncio.timeout(timeout): await
event.wait(), and update the exception handling to catch asyncio.TimeoutError
(or alias TimeoutError to asyncio.TimeoutError) so the timeout branch still
returns False; apply this change in wait_for_stash (and mirror the same
replacement in other uses like in service.py if you touch those locations).

---

Duplicate comments:
In `@autogpt_platform/backend/pyproject.toml`:
- Line 19: The dependency entry for claude-agent-sdk is loosely pinned with
"~0.1.35", which allows patch upgrades (>=0.1.35, <0.2.0) and risks breaking
usage of private internals like client._query and
client._internal.message_parser; change the version specifier to a strict pin
(e.g., "0.1.35" or "==0.1.35") in the pyproject.toml dependency line to lock the
exact version and ensure reproducible behavior when relying on those private
APIs.

---

Nitpick comments:
In `@autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py`:
- Around line 525-545: The async tests test_wait_for_stash_signaled,
test_wait_for_stash_timeout, and test_wait_for_stash_already_stashed leak
ContextVar state on failure because they perform cleanup at the end of the test
body; wrap each test's main body in a try/finally and move the cleanup into the
finally block so _pto and _stash_event are always reset (use _pto.set(None) and
_stash_event.set(None) to restore the original defaults) and keep the existing
simulated stash logic (delayed_stash/task creation) and assertions inside the
try.

In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 67-82: CapturedTranscript defines an unused field sdk_session_id
which is set in the _on_stop callback but never read (the _build_transcript path
uses its session_id parameter instead); remove the sdk_session_id attribute from
the CapturedTranscript dataclass and delete any assignments to it in the
_on_stop callback to avoid dead state, making sure to keep _build_transcript
usage unchanged and verify no other code references sdk_session_id.

Comment thread autogpt_platform/backend/backend/copilot/sdk/service.py
Comment thread autogpt_platform/backend/backend/copilot/sdk/service.py Outdated
Comment thread autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py
@majdyz majdyz enabled auto-merge February 20, 2026 08:24
@majdyz majdyz added this pull request to the merge queue Feb 20, 2026
Merged via the queue into dev with commit fc64f83 Feb 20, 2026
27 checks passed
@majdyz majdyz deleted the fix/messed-up-copilot branch February 20, 2026 08:41
@github-project-automation github-project-automation Bot moved this to Done in Frontend Feb 20, 2026
@github-project-automation github-project-automation Bot moved this from 👍🏼 Mergeable to ✅ Done in AutoGPT development kanban Feb 20, 2026
@majdyz majdyz restored the fix/messed-up-copilot branch February 20, 2026 08:52
github-merge-queue Bot pushed a commit that referenced this pull request Feb 20, 2026
…urrent message saves (#12177)

Requested by @majdyz

Concurrent writers (incremental streaming saves from PR #12173 and
long-running tool callbacks) can race to persist messages with the same
`(sessionId, sequence)` pair, causing unique constraint violations on
`ChatMessage`.

**Root cause:** The streaming loop tracks `saved_msg_count` in-memory,
but the long-running tool callback (`_build_long_running_callback`) also
appends messages and calls `upsert_chat_session` independently — without
coordinating sequence numbers. When the streaming loop does its next
incremental save with the stale `saved_msg_count`, it tries to insert at
a sequence that already exists.

**Fix:** Multi-layered defense-in-depth approach:

1. **Collision detection with retry** (db.py): `add_chat_messages_batch`
uses `create_many()` in a transaction. On `UniqueViolationError`,
queries `MAX(sequence)+1` from DB and retries with the correct offset
(max 5 attempts).

2. **Robust sequence tracking** (db.py): `get_next_sequence()` uses
indexed `find_first` with `order={"sequence": "desc"}` for O(1) MAX
lookup, immune to deleted messages.

3. **Session-based counter** (model.py): Added `saved_message_count`
field to `ChatSession`. `upsert_chat_session` returns the session with
updated count, eliminating tuple returns throughout the codebase.

4. **MessageCounter dataclass** (sdk/service.py): Replaced list[int]
mutable reference pattern with a clean `MessageCounter` dataclass for
shared state between streaming loop and long-running callbacks.

5. **Session locking** (sdk/service.py): Prevent concurrent streams on
the same session using Redis `SET NX EX` distributed locks with TTL
refresh on heartbeats (config.stream_ttl = 3600s).

6. **Atomic operations** (db.py): Single timestamp for all messages and
session update in batch operations for consistency. Parallel queries
with `asyncio.gather` for lower latency.

7. **Config-based TTL** (sdk/service.py, config.py): Consolidated all
TTL constants to use `config.stream_ttl` (3600s) with lock refresh on
heartbeats.

### Key implementation details

- **create_many**: Uses `sessionId` directly (not nested
`Session.connect`) as `create_many` doesn't support nested creates
- **Type narrowing**: Added explicit `assert session is not None`
statements for pyright type checking in async contexts
- **Parallel operations**: Use `asyncio.gather` for independent DB
operations (create_many + session update)
- **Single timestamp**: All messages in a batch share the same
`createdAt` timestamp for atomicity

### Changes
- `backend/copilot/db.py`: Collision detection with `create_many` +
retry, indexed sequence lookup, single timestamp, parallel queries
- `backend/copilot/model.py`: Added `saved_message_count` field,
simplified return types
- `backend/copilot/sdk/service.py`: MessageCounter dataclass, session
locking with refresh, config-based TTL, type narrowing
- `backend/copilot/service.py`: Updated all callers to handle new return
types
- `backend/copilot/config.py`: Increased long_running_operation_ttl to
3600s with clarified docstring
- `backend/copilot/*_test.py`: Tests updated for new signatures

---------

Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
jayvenn21 pushed a commit to jayvenn21/AutoGPT that referenced this pull request Feb 20, 2026
…saves, frontend reconnection (Significant-Gravitas#12173)

## Summary

Fixes multiple reliability issues in the copilot's Claude Agent SDK
streaming pipeline — tool outputs getting stuck, parallel tool calls
flushing prematurely, messages lost on page refresh, and SSE
reconnection failures.

## Changes

### Backend: Streaming loop rewrite (`sdk/service.py`)
- **Non-cancelling heartbeat pattern**: Replace `asyncio.timeout()` with
`asyncio.wait()` for SDK message iteration. The old approach corrupted
the SDK's internal anyio memory stream when timeouts fired
mid-`__anext__()`, causing `StopAsyncIteration` on the next call and
silently dropping all in-flight tool results.
- **Hook synchronization**: Add `wait_for_stash()` before
`convert_message()` — the SDK fires PostToolUse hooks via `start_soon()`
(fire-and-forget), so the next message can arrive before the hook
stashes its output. The new asyncio.Event-based mechanism bridges this
gap without arbitrary sleeps.
- **Error handling**: Add `asyncio.CancelledError` handling at both
inner (streaming loop) and outer (session) levels, plus pending task
cleanup in `finally` block to prevent leaked coroutines. Catch
`Exception` from `done.pop().result()` for SDK error messages.
- **Safety-net flush**: After streaming loop ends, flush any remaining
unresolved tool calls so the frontend stops showing spinners even if the
stream drops unexpectedly.
- **StreamFinish fallback**: Emit `StreamFinishStep` + `StreamFinish`
when stream ends without `ResultMessage` (StopAsyncIteration) so the
frontend transitions to "ready" state.
- **Incremental session saves**: Save session to PostgreSQL after each
tool input/output event (not just at stream end), so page refresh and
other devices see recent messages.
- **Enhanced logging**: All log lines now include `session_id[:12]`
prefix and tool call resolution state (unresolved/current/resolved
counts).

### Backend: Response adapter (`sdk/response_adapter.py`)
- **Parallel tool call support**: Skip `_flush_unresolved_tool_calls()`
when an AssistantMessage contains only ToolUseBlocks (parallel
continuation) — the prior tools are still executing concurrently and
haven't finished yet.
- **Duplicate output prevention**: Skip already-resolved tool results in
both UserMessage (ToolResultBlock) and parent_tool_use_id handling to
prevent duplicate `StreamToolOutputAvailable` events.
- **`has_unresolved_tool_calls` property**: Used by the streaming loop
to decide whether to wait for PostToolUse hooks.
- **`session_id` parameter**: Passed through for structured logging.

### Backend: Hook synchronization (`sdk/tool_adapter.py`)
- **`_stash_event` ContextVar**: asyncio.Event signaled by
`stash_pending_tool_output()` whenever a PostToolUse hook stashes
output.
- **`wait_for_stash()`**: Awaits the event with configurable timeout —
replaces the racy "hope the hook finished" approach.

### Backend: Security hooks (`sdk/security_hooks.py`)
- Enhanced logging in `post_tool_use_hook` — log whether tool is
built-in, preview of stashed output, warning when `tool_response` is
None.

### Backend: Incremental save optimization (`model.py`)
- **`existing_message_count` parameter** on `upsert_chat_session`: Skip
the DB query to count existing messages when the caller already tracks
this (streaming loop).
- **`skip_existence_check` parameter** on `_save_session_to_db`: Skip
the `get_chat_session` existence query when we know the session row
already exists. Reduces from 4 DB round trips to 2 per incremental save.

### Backend: SDK version bump (`pyproject.toml`, `poetry.lock`)
- Bump `claude-agent-sdk` from `^0.1.0` to `^0.1.39`.

### Backend: New tests
- **`sdk_compat_test.py`** (new file): SDK compatibility tests — verify
the installed SDK exposes every class, attribute, and method the copilot
integration relies on. Catches SDK upgrade breakage immediately.
- **`response_adapter_test.py`**: 9 new tests covering
flush-at-ResultMessage, flush-at-next-AssistantMessage, stashed output
flush, wait_for_stash signaling/timeout/fast-path, parallel tool call
non-premature-flush, text-message flush of prior tools, and
already-resolved tool skip in UserMessage.

### Frontend: Session hydration (`convertChatSessionToUiMessages.ts`)
- **`isComplete` option**: When session has no active stream, mark
dangling tool calls (no output in DB) as `output-available` with empty
output — stops stale spinners after page refresh.

### Frontend: Chat session hook (`useChatSession.ts`)
- Reorder `hasActiveStream` memo before `hydratedMessages` so
`isComplete` flag is available.
- Pass `{ isComplete: !hasActiveStream }` to
`convertChatSessionMessagesToUiMessages`.

### Frontend: Copilot page hook (`useCopilotPage.ts`)
- **Cache invalidation on stream end**: Invalidate React Query session
cache when stream transitions active → idle, so next hydration fetches
fresh messages from backend (staleTime: Infinity otherwise keeps stale
data).
- **Resume ref reset**: Reset `hasResumedRef` on stream end to allow
re-resume if SSE drops but backend task is still running.
- **Remove old `resolveInProgressTools` effect**: Replaced by
backend-side safety-net flush + hydration-time `isComplete` marking.

## Test plan
- [ ] Existing copilot tests pass (`pytest backend/copilot/ -x -q`)
- [ ] SDK compat tests pass (`pytest
backend/copilot/sdk/sdk_compat_test.py -v`)
- [ ] Tool outputs (bash_exec, web_fetch, WebSearch) appear in the UI
instead of getting stuck
- [ ] Parallel tool calls (e.g. multiple WebSearch) complete and display
results without premature flush
- [ ] Page refresh during active stream reconnects and recovers messages
- [ ] Opening session from another device shows recent tool results
- [ ] SSE drop → automatic reconnection without losing messages
- [ ] Long-running tools (create_agent) still delegate to background
infrastructure
jayvenn21 pushed a commit to jayvenn21/AutoGPT that referenced this pull request Feb 20, 2026
…urrent message saves (Significant-Gravitas#12177)

Requested by @majdyz

Concurrent writers (incremental streaming saves from PR Significant-Gravitas#12173 and
long-running tool callbacks) can race to persist messages with the same
`(sessionId, sequence)` pair, causing unique constraint violations on
`ChatMessage`.

**Root cause:** The streaming loop tracks `saved_msg_count` in-memory,
but the long-running tool callback (`_build_long_running_callback`) also
appends messages and calls `upsert_chat_session` independently — without
coordinating sequence numbers. When the streaming loop does its next
incremental save with the stale `saved_msg_count`, it tries to insert at
a sequence that already exists.

**Fix:** Multi-layered defense-in-depth approach:

1. **Collision detection with retry** (db.py): `add_chat_messages_batch`
uses `create_many()` in a transaction. On `UniqueViolationError`,
queries `MAX(sequence)+1` from DB and retries with the correct offset
(max 5 attempts).

2. **Robust sequence tracking** (db.py): `get_next_sequence()` uses
indexed `find_first` with `order={"sequence": "desc"}` for O(1) MAX
lookup, immune to deleted messages.

3. **Session-based counter** (model.py): Added `saved_message_count`
field to `ChatSession`. `upsert_chat_session` returns the session with
updated count, eliminating tuple returns throughout the codebase.

4. **MessageCounter dataclass** (sdk/service.py): Replaced list[int]
mutable reference pattern with a clean `MessageCounter` dataclass for
shared state between streaming loop and long-running callbacks.

5. **Session locking** (sdk/service.py): Prevent concurrent streams on
the same session using Redis `SET NX EX` distributed locks with TTL
refresh on heartbeats (config.stream_ttl = 3600s).

6. **Atomic operations** (db.py): Single timestamp for all messages and
session update in batch operations for consistency. Parallel queries
with `asyncio.gather` for lower latency.

7. **Config-based TTL** (sdk/service.py, config.py): Consolidated all
TTL constants to use `config.stream_ttl` (3600s) with lock refresh on
heartbeats.

### Key implementation details

- **create_many**: Uses `sessionId` directly (not nested
`Session.connect`) as `create_many` doesn't support nested creates
- **Type narrowing**: Added explicit `assert session is not None`
statements for pyright type checking in async contexts
- **Parallel operations**: Use `asyncio.gather` for independent DB
operations (create_many + session update)
- **Single timestamp**: All messages in a batch share the same
`createdAt` timestamp for atomicity

### Changes
- `backend/copilot/db.py`: Collision detection with `create_many` +
retry, indexed sequence lookup, single timestamp, parallel queries
- `backend/copilot/model.py`: Added `saved_message_count` field,
simplified return types
- `backend/copilot/sdk/service.py`: MessageCounter dataclass, session
locking with refresh, config-based TTL, type narrowing
- `backend/copilot/service.py`: Updated all callers to handle new return
types
- `backend/copilot/config.py`: Increased long_running_operation_ttl to
3600s with clarified docstring
- `backend/copilot/*_test.py`: Tests updated for new signatures

---------

Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
@sentry
Copy link
Copy Markdown

sentry Bot commented Feb 27, 2026

Issues attributed to commits in this pull request

This pull request was merged and Sentry observed the following issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

platform/backend AutoGPT Platform - Back end platform/frontend AutoGPT Platform - Front end size/l size/xl

Projects

Status: ✅ Done
Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants