fix(copilot): SDK streaming reliability, parallel tools, incremental saves, frontend reconnection#12173
Conversation
- Block Task tool's run_in_background param in security hooks — background agents stall the SSE stream and get killed when the main agent exits - Add heartbeats (15s interval) to SDK streaming loop so proxies/LBs don't close idle SSE connections during long tool execution - Fix summarization prompt that forced LLM to fabricate content for all 9 mandatory sections; now sections are optional and hallucination is explicitly prohibited - Include tool error/denial outcomes in conversation context formatting — previously all tool messages were dropped, so the agent couldn't see that security denials blocked its file writes and hallucinated success
…x/messed-up-copilot
…or detection
- Move run_in_background check before task_spawn_count increment so
denied background Tasks don't consume a subtask slot
- Replace asyncio.wait_for() with asyncio.timeout() for heartbeat loop
to avoid leaving the async generator in a broken state
- Tighten _is_tool_error_or_denial: remove overly broad markers
("error", "failed", "not found") that cause false positives; add
markers for actual denial messages ("not supported", "maximum")
- Fix sandbox process kill: use start_new_session + os.killpg to kill the entire bwrap process group on timeout (proc.kill alone only kills the parent, leaving children running until natural completion) - Add StreamToolInputAvailable/StreamToolOutputAvailable to publish_chunk logging filter so tool events are visible in Docker logs - Add system prompt instruction telling Claude not to use run_in_background on Task tool (gets denied by security hooks) - Add tool event debug logging in SDK streaming loop for tracing tool execution visibility issues
… tests - Move task_spawn_count increment after limit check so denied spawns don't consume a slot (greptile feedback) - Add "failed" marker to _is_tool_error_or_denial to catch internal tool execution failures from _mcp_error (coderabbit feedback) - Add 17 unit tests for _is_tool_error_or_denial covering all markers, denial messages, and false-positive scenarios
- Replace brittle line-count check (< 3) in read_transcript_file with proper validate_transcript() which checks for actual user/assistant entries — avoids rejecting valid short transcripts while still filtering metadata-only files - Add debug logging for transcript source selection and fallback path to aid diagnosing resume issues in Docker - Make test_sdk_resume_multi_turn skip gracefully when the CLI doesn't produce usable transcripts (environment-dependent: CLI version, platform) instead of hard-failing
…sses Catch OSError broadly (not just ProcessLookupError) when calling os.killpg so that EPERM or other errors don't skip the subsequent await proc.communicate(), which would leave the subprocess un-reaped.
…le spinner safety nets - Elevate flush logging from debug to info/warning with structured messages showing tool names and IDs for production diagnostics - Capture raw SDK output for transcript instead of relying on Stop hook file path (CLI doesn't write JSONL in SDK mode) - Add _build_transcript() to reconstruct JSONL from captured entries - Add isComplete option to hydration conversion — marks dangling tool calls as completed when session has no active stream (fixes stale spinners on page refresh) - Add resolveInProgressTools safety net on streaming→ready transition (catches tool parts the backend didn't emit output for) - Add 3 new tests for flush mechanism (ResultMessage, AssistantMessage, stashed output)
|
This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR enhances the Copilot SDK's response handling and service layer by introducing session tracking, improving unresolved tool-call flushing with better logging, implementing asynchronous stash coordination for tool outputs, adding context compression and injection during conversations, and including comprehensive test coverage and SDK compatibility verification. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Service as Service Layer
participant Adapter as SDKResponseAdapter
participant ToolAdapter as Tool Adapter
participant Hook as PostToolUse Hook
Client->>Service: Send query with session
Service->>ToolAdapter: set_execution_context (init stash event)
Service->>Adapter: Create SDKResponseAdapter(session_id)
Note over ToolAdapter: Tool execution begins
Hook->>ToolAdapter: stash_pending_tool_output()
ToolAdapter->>ToolAdapter: Signal _stash_event
Note over Service: Waiting for output
Service->>ToolAdapter: wait_for_stash(timeout=0.5)
ToolAdapter-->>Service: Event signaled (True)
Service->>Adapter: _flush_unresolved_tool_calls()
Adapter->>Adapter: Check has_unresolved_tool_calls
Adapter->>Adapter: Emit StreamToolOutputAvailable
Adapter->>Client: Return resolved stream events
alt Timeout occurred
Service->>ToolAdapter: wait_for_stash times out
ToolAdapter-->>Service: False (timeout)
Service->>Service: Log race condition warning
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
autogpt_platform/backend/backend/copilot/tools/sandbox.py (1)
249-272: Process-group kill is correct; comment could be more precise about--die-with-parent
start_new_session=Truecreates a new process group for bwrap (pgid = bwrap.pid), andos.killpg(proc.pid, SIGKILL)correctly sends SIGKILL to that group. However, bwrap's--new-sessionflag (in_build_bwrap_command, line 191) callssetsid()for the sandboxed command, giving it its own separate process group — sokillpg(bwrap.pid)doesn't reach the sandboxed process directly. It is--die-with-parent(viaprctl PR_SET_PDEATHSIG) that sends SIGKILL to the sandboxed process when bwrap dies. The comment at line 260–262 saying "bwrap + all children" implieskillpgcovers the children, which it doesn't in this setup.📝 Suggested comment update
- # Kill entire process group (bwrap + all children). - # proc.kill() alone only kills the bwrap parent, leaving - # children running until they finish naturally. + # Kill bwrap's process group. The sandboxed command runs in its + # own session (bwrap --new-session) so it is not in this group, + # but --die-with-parent (prctl PR_SET_PDEATHSIG) ensures it + # receives SIGKILL when bwrap dies.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/backend/backend/copilot/tools/sandbox.py` around lines 249 - 272, Update the misleading comment around the process-kill path: clarify that start_new_session=True creates a new session for bwrap (see start_new_session) and that os.killpg(proc.pid, signal.SIGKILL) sends SIGKILL to that bwrap process group, but sandboxed child processes are placed in their own group by bwrap's --new-session (see _build_bwrap_command) and thus are terminated via bwrap's --die-with-parent/prctl(PR_SET_PDEATHSIG) rather than by killpg; replace the "bwrap + all children" wording with a precise note that killpg targets bwrap's group and that child processes rely on --die-with-parent to get killed when bwrap exits.autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py (1)
208-215:_hooksfixture couples toHookMatcherinternal attribute — guard against SDK absenceThe fixture accesses
hooks["PreToolUse"][0].hooks[0], relying onHookMatcherhaving a public.hooksattribute. Ifclaude_agent_sdkis unavailable,create_security_hooksreturns{}and the"PreToolUse"key lookup raisesKeyError. This is safe today because every caller of_hooksis guarded by@pytest.mark.skipif(not _sdk_available(), …), but adding a defensive guard inside the fixture would prevent confusing failures if a future test accidentally drops that decorator.🛡️ Optional guard
`@pytest.fixture`() def _hooks(): """Create security hooks and return the PreToolUse handler.""" + if not _sdk_available(): + pytest.skip("claude_agent_sdk not installed") from .security_hooks import create_security_hooks hooks = create_security_hooks(user_id="u1", sdk_cwd=SDK_CWD, max_subtasks=2) pre = hooks["PreToolUse"][0].hooks[0] return pre🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py` around lines 208 - 215, The _hooks fixture currently assumes create_security_hooks(...) returns a dict with "PreToolUse" and that its first item exposes a .hooks list; change _hooks to defensively handle SDK absence by checking the result of create_security_hooks (e.g., hooks = create_security_hooks(...); if not hooks or "PreToolUse" not in hooks: pytest.skip("claude_agent_sdk not available") or return a no-op handler), and when accessing the matcher use safe access like first_matcher = hooks.get("PreToolUse", []) and pre = getattr(first_matcher[0], "hooks", [None])[0] if present, otherwise skip/return a no-op; reference create_security_hooks and the _hooks fixture to locate and update the code.autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts (1)
173-179: Consider resolving in-progress tools on stream error, not only on"ready"The current effect only fires on the
"streaming" → "ready"transition. If the stream ends withstatus === "error"(e.g. network drop, backend exception), any tool parts still in"input-streaming"or"input-available"state will continue showing spinners until the user reloads the session.♻️ Proposed fix
- if (prev === "streaming" && status === "ready") { + if (prev === "streaming" && (status === "ready" || status === "error")) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/frontend/src/app/`(platform)/copilot/useCopilotPage.ts around lines 173 - 179, The effect currently only resolves in-progress tool parts when prevStatusRef transitions from "streaming" to "ready"; update the condition in the useEffect (referencing useEffect, prevStatusRef, status, setMessages, resolveInProgressTools) to also handle the "streaming" → "error" transition (e.g., change the if to check prev === "streaming" && (status === "ready" || status === "error") or otherwise detect prev === "streaming" && status !== "streaming") so that resolveInProgressTools is invoked when the stream fails and spinners are cleared.autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (1)
245-288:flushedis now alwaysTrue— consider simplifying.Since
unresolvedis guaranteed non-empty (early return on line 236), and every iteration unconditionally setsflushed = True, the variable is alwaysTrueafter the loop. You can drop it and simplify the tail.♻️ Proposed simplification
- flushed = False for tool_id, tool_name in unresolved: output = pop_pending_tool_output(tool_name) if output is not None: @@ .. self.resolved_tool_calls.add(tool_id) - flushed = True logger.info(...) else: @@ .. self.resolved_tool_calls.add(tool_id) - flushed = True logger.warning(...) - if flushed and self.step_open: + if self.step_open: responses.append(StreamFinishStep()) self.step_open = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/backend/backend/copilot/sdk/response_adapter.py` around lines 245 - 288, The loop always sets flushed=True, so remove the redundant flushed variable and its uses: in the block iterating over unresolved (and using pop_pending_tool_output, StreamToolOutputAvailable, self.resolved_tool_calls), drop flushed assignments and the final if flushed and self.step_open check; instead, after the loop if unresolved was non-empty (the caller already guarantees this) and self.step_open is True, append StreamFinishStep() and set self.step_open = False; ensure all current behavior (adding StreamToolOutputAvailable entries and logging) remains unchanged.autogpt_platform/backend/backend/copilot/sdk/service.py (1)
421-443: Markers in_is_tool_error_or_denialare intentionally broad and tested against real denial messages.The markers like
"failed","maximum", and"not supported"are designed to detect system-generated denial and error messages from security hooks (e.g.,"Maximum {N} sub-tasks per session..."and"Background task execution is not supported...") and tool execution failures. While they could theoretically match unrelated content, the function is applied to tool outputs (structured MCP data or natural language), and false positives only add extra context lines to conversation history—they don't break functionality. The test suite (TestIsToolErrorOrDenial) validates these markers against real denial messages and includes negative tests to ensure benign content is correctly excluded.If higher precision is desired, refactoring to anchor broad markers to more specific patterns (e.g.,
"is not supported"or"execution failed") could reduce noise, but this is optional given the acceptable trade-off between recall and precision for the use case.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/backend/backend/copilot/sdk/service.py` around lines 421 - 443, The current broad substring markers in _is_tool_error_or_denial (the tuple literal containing "failed", "maximum", "not supported", etc.) can produce false positives; tighten them by replacing generic tokens with more specific anchored patterns (e.g., change "not supported" -> "is not supported" or "background task execution is not supported", "failed" -> "execution failed" or '"iserror": true' stays as-is, and "maximum" -> patterns like "maximum .* sub-task" or "maximum .* sub-tasks per session") and update the membership check to test these more specific phrases against content.lower() so the function still operates on tool outputs but reduces accidental matches.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
autogpt_platform/backend/backend/copilot/sdk/response_adapter.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter_test.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/service_test.pyautogpt_platform/backend/backend/copilot/stream_registry.pyautogpt_platform/backend/backend/copilot/tools/sandbox.pyautogpt_platform/backend/backend/util/prompt.pyautogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.tsautogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.tsautogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
🧰 Additional context used
📓 Path-based instructions (15)
autogpt_platform/frontend/**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
autogpt_platform/frontend/**/*.{ts,tsx,js,jsx}: Use Node.js 21+ with pnpm package manager for frontend development
Always run 'pnpm format' for formatting and linting code in frontend development
Files:
autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.tsautogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.tsautogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/**/*.{tsx,ts}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
autogpt_platform/frontend/**/*.{tsx,ts}: Use function declarations for components and handlers (not arrow functions) in React components
Only use arrow functions for small inline lambdas (map, filter, etc.) in React components
Use PascalCase for component names and camelCase with 'use' prefix for hook names in React
Use Tailwind CSS utilities only for styling in frontend components
Use design system components from 'src/components/' (atoms, molecules, organisms) in frontend development
Never use 'src/components/legacy/' in frontend code
Only use Phosphor Icons (@phosphor-icons/react) for icons in frontend components
Use generated API hooks from '@/app/api/generated/endpoints/' instead of deprecated 'BackendAPI' or 'src/lib/autogpt-server-api/'
Use React Query for server state (via generated hooks) in frontend development
Default to client components ('use client') in Next.js; only use server components for SEO or extreme TTFB needs
Use '' component for rendering errors in frontend UI; use toast notifications for mutation errors; use 'Sentry.captureException()' for manual exceptions
Separate render logic from data/behavior in React components; keep comments minimal (code should be self-documenting)
Files:
autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.tsautogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.tsautogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
autogpt_platform/frontend/**/*.{ts,tsx}: No barrel files or 'index.ts' re-exports in frontend code
Regenerate API hooks with 'pnpm generate:api' after backend OpenAPI spec changes in frontend development
Files:
autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.tsautogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.tsautogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/src/**/*.{ts,tsx}
📄 CodeRabbit inference engine (autogpt_platform/frontend/CLAUDE.md)
autogpt_platform/frontend/src/**/*.{ts,tsx}: Fully capitalize acronyms in symbols, e.g.graphID,useBackendAPI
Use function declarations (not arrow functions) for components and handlers
Separate render logic (.tsx) from business logic (use*.tshooks)
Use shadcn/ui (Radix UI primitives) with Tailwind CSS styling for UI components
Use Phosphor Icons only for icons
Use ErrorCard for render errors, toast for mutations, and Sentry for exceptions
Use design system components fromsrc/components/(atoms, molecules, organisms)
Never usesrc/components/__legacy__/*components
Use generated API hooks from@/app/api/__generated__/endpoints/with patternuse{Method}{Version}{OperationName}
Use Tailwind CSS only for styling, with design tokens
Do not useuseCallbackoruseMemounless asked to optimize a given function
Never type withanyunless a variable/attribute can ACTUALLY be of any type
autogpt_platform/frontend/src/**/*.{ts,tsx}: Structure components asComponentName/ComponentName.tsx+useComponentName.ts+helpers.tsand use design system components fromsrc/components/(atoms, molecules, organisms)
Use generated API hooks from@/app/api/__generated__/endpoints/with patternuse{Method}{Version}{OperationName}and regenerate withpnpm generate:api
Use function declarations (not arrow functions) for components and handlers
Separate render logic from business logic with component.tsx + useComponent.ts + helpers.ts structure
Colocate state when possible, avoid creating large components, use sub-components in local/componentsfolder
Avoid large hooks, abstract logic intohelpers.tsfiles when sensible
Use arrow functions only for callbacks, not for component declarations
Avoid comments at all times unless the code is very complex
Do not useuseCallbackoruseMemounless asked to optimize a given function
Files:
autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.tsautogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.tsautogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/src/**/*use*.ts
📄 CodeRabbit inference engine (autogpt_platform/frontend/CLAUDE.md)
Do not type hook returns, let TypeScript infer as much as possible
Files:
autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.tsautogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/**/*.{js,jsx,ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
autogpt_platform/frontend/**/*.{js,jsx,ts,tsx}: Format frontend code usingpnpm format
Never use components fromsrc/components/__legacy__/*
Files:
autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.tsautogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.tsautogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/**/*.{js,jsx,ts,tsx,css}
📄 CodeRabbit inference engine (AGENTS.md)
Use Tailwind CSS only for styling, use design tokens, and use Phosphor Icons only
Files:
autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.tsautogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.tsautogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/frontend/src/**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
Do not type hook returns, let Typescript infer as much as possible
Files:
autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.tsautogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.tsautogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Never type with
any, if no types available useunknown
Files:
autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.tsautogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.tsautogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts
autogpt_platform/backend/**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
autogpt_platform/backend/**/*.py: Use Python 3.11 (required; managed by Poetry via pyproject.toml) for backend development
Always run 'poetry run format' (Black + isort) before linting in backend development
Always run 'poetry run lint' (ruff) after formatting in backend development
Files:
autogpt_platform/backend/backend/copilot/service_test.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter_test.pyautogpt_platform/backend/backend/copilot/stream_registry.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/tools/sandbox.pyautogpt_platform/backend/backend/util/prompt.py
autogpt_platform/backend/**/*.{py,txt}
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
Use
poetry runprefix for all Python commands, including testing, linting, formatting, and migrations
Files:
autogpt_platform/backend/backend/copilot/service_test.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter_test.pyautogpt_platform/backend/backend/copilot/stream_registry.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/tools/sandbox.pyautogpt_platform/backend/backend/util/prompt.py
autogpt_platform/backend/**/*_test.py
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
autogpt_platform/backend/**/*_test.py: Always review snapshot changes withgit diffbefore committing when updating snapshots withpoetry run pytest --snapshot-update
Use pytest with snapshot testing for API responses in test files
Colocate test files with source files using the*_test.pynaming convention
Files:
autogpt_platform/backend/backend/copilot/service_test.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
autogpt_platform/backend/backend/**/*.py
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
Use Prisma ORM for database operations in PostgreSQL with pgvector for embeddings
Files:
autogpt_platform/backend/backend/copilot/service_test.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter_test.pyautogpt_platform/backend/backend/copilot/stream_registry.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/tools/sandbox.pyautogpt_platform/backend/backend/util/prompt.py
autogpt_platform/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Format Python code with
poetry run format
Files:
autogpt_platform/backend/backend/copilot/service_test.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter_test.pyautogpt_platform/backend/backend/copilot/stream_registry.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/tools/sandbox.pyautogpt_platform/backend/backend/util/prompt.py
autogpt_platform/backend/**/*test*.py
📄 CodeRabbit inference engine (AGENTS.md)
Run
poetry run testfor backend testing (runs pytest with docker based postgres + prisma)
Files:
autogpt_platform/backend/backend/copilot/service_test.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
🧠 Learnings (2)
📚 Learning: 2026-02-04T16:50:33.615Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: autogpt_platform/frontend/CLAUDE.md:0-0
Timestamp: 2026-02-04T16:50:33.615Z
Learning: Use React Query for server state, co-located UI state in components/hooks
Applied to files:
autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts
📚 Learning: 2026-02-04T16:49:42.490Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-02-04T16:49:42.490Z
Learning: Applies to autogpt_platform/backend/**/test/**/*.py : Use snapshot testing with '--snapshot-update' flag in backend tests when output changes; always review with 'git diff'
Applied to files:
autogpt_platform/backend/backend/copilot/service_test.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
🧬 Code graph analysis (8)
autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts (1)
autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts (1)
convertChatSessionMessagesToUiMessages(58-139)
autogpt_platform/backend/backend/copilot/service_test.py (1)
autogpt_platform/backend/backend/copilot/sdk/transcript.py (1)
download_transcript(402-453)
autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (2)
autogpt_platform/backend/backend/util/logging.py (2)
info(41-43)warning(45-47)autogpt_platform/backend/backend/copilot/tools/base.py (1)
name(20-22)
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py (1)
autogpt_platform/backend/backend/util/logging.py (1)
info(41-43)
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py (2)
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py (3)
_validate_tool_access(83-118)_validate_user_isolation(121-142)create_security_hooks(145-333)autogpt_platform/backend/backend/copilot/sdk/service.py (1)
_is_tool_error_or_denial(421-443)
autogpt_platform/backend/backend/copilot/sdk/transcript.py (1)
autogpt_platform/backend/backend/util/logging.py (1)
debug(53-55)
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py (3)
autogpt_platform/backend/backend/copilot/response_model.py (2)
StreamBaseResponse(43-50)StreamToolOutputAvailable(150-173)autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (1)
convert_message(64-205)autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py (1)
stash_pending_tool_output(116-136)
autogpt_platform/backend/backend/copilot/tools/sandbox.py (1)
autogpt_platform/backend/backend/util/logging.py (1)
warning(45-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Seer Code Review
- GitHub Check: Greptile Review
🔇 Additional comments (14)
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py (1)
269-334:TestIsToolErrorOrDenial— thorough coverage, LGTMThe class exercises every marker in
_is_tool_error_or_denialincluding the case-insensitive JSON field ('"isError": true'→ lowercased match), benign look-alikes ("0 errors found","permission_level","file-not-found-handler.py"), and bothNone/empty inputs. Coverage is complete.autogpt_platform/frontend/src/app/(platform)/copilot/helpers/convertChatSessionToUiMessages.ts (1)
108-117: Stale-spinner fix is clean and consistent withresolveInProgressToolsThe
output: ""sentinel for completed-but-missing tool outputs matches the value used inuseCopilotPage.ts'sresolveInProgressTools. The 3-way branch (output present → use output; session complete → emptyoutput-available; streaming →input-available) correctly represents all reachable states.autogpt_platform/backend/backend/util/prompt.py (1)
538-565: Anti-fabrication instructions and streamlined section list are good improvementsThe anti-hallucination guard ("CRITICAL: Only include…EXPLICITLY present") combined with "skip sections with nothing to report" should meaningfully reduce fabricated summaries. Collapsing 9 → 7 sections and replacing the separate "Current Work"/"Next Steps" with a single "Current State" avoids speculative forward-looking content.
autogpt_platform/backend/backend/copilot/stream_registry.py (1)
228-238: LGTM — tool-event timing logs are consistent with_reconstruct_chunkclass namesThe two added type names (
StreamToolInputAvailable,StreamToolOutputAvailable) match the imports in_reconstruct_chunkand align with the new tool-output streaming introduced in the response adapter.autogpt_platform/backend/backend/copilot/service_test.py (1)
147-151: Graceful skip for missing transcript is correct for environment-dependent test
return pytest.skip(...)mid-test is the standard pytest idiom when a skip condition can only be determined at runtime. The 5 s polling window is reasonable. The test still fully validates both turns when a transcript is present.autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotPage.ts (1)
20-30:resolveInProgressTools— clean implementationFunction declaration ✓, guard with
"state" in partbefore accessingpart.state✓,output: ""consistent with theisCompletepath inconvertChatSessionToUiMessages.ts✓.autogpt_platform/backend/backend/copilot/sdk/transcript.py (1)
140-147: Semantic transcript validation is a clear improvement over the old line-count heuristicChecking for at least one
"user"and one"assistant"entry correctly rejects metadata-only files (e.g., onlyqueue-operationentries) regardless of line count. The safe-fail behavior on anyjson.JSONDecodeErroris intentional — a truncated last line from a mid-write SIGKILL is safer to discard than to resume with.autogpt_platform/frontend/src/app/(platform)/copilot/useChatSession.ts (1)
43-62: Looks good — hydration correctly marks dangling tool calls as complete.The
hasActiveStreamderivation andisCompleteflag flow correctly toconvertChatSessionMessagesToUiMessages, resolving stale spinners when a session has no active stream. The dependency arrays are correct (ESLint exhaustive-deps would requirehasActiveStreameven though it's derived fromsessionQuery.data).Minor note per coding guidelines: the comments on lines 43–45 and 53–54 are helpful for explaining why the
isCompleteflag exists, though the guidelines suggest keeping comments minimal. Given this is a non-obvious behavioral fix, they're justified here.autogpt_platform/backend/backend/copilot/sdk/security_hooks.py (1)
190-214: Counter-order fix and background-task gating look correct.The reordering — block background first, check limit second, increment only on success — ensures denied calls never consume a subtask slot. Each session gets its own closure, so the non-atomic
task_spawn_countis safe within the single-session async flow.autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (1)
231-237: Good improvement — early return and pre-computed unresolved list.The early return avoids unnecessary logging and iteration. The list comprehension cleanly identifies only unresolved calls.
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py (1)
372-514: Solid test coverage for the three flush scenarios.The tests exercise the key paths: flush at
ResultMessage, flush at nextAssistantMessage, and flush with stashed output. Event sequences and assertions are correct.autogpt_platform/backend/backend/copilot/sdk/service.py (3)
855-913:_build_transcript— clean reconstruction logic.The concatenation of previous transcript + synthetic user entry + new entries produces valid JSONL, and the
validate_transcriptguard prevents uploading corrupt data. The synthetic user entry structure with optionalparentUuidis a pragmatic approach.One observation:
previous_transcript.rstrip("\n")on line 886 ensures no double newlines when joining, which is correct for JSONL.
86-87: Heartbeat interval constant and integration look correct.15 seconds is a reasonable SSE keep-alive interval that falls well under typical proxy/LB timeouts (usually 30–60s). The
asyncio.timeout+StreamHeartbeatpattern on lines 690–694 is clean.
66-81: CapturedTranscript refactored for raw entry capture — looks good.The shift from file-path-based transcript tracking to in-memory
raw_entriescollection is the right approach given the CLI doesn't write JSONL in SDK mode.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 686-687: Replace the bare assert on client._query with an explicit
runtime check: verify client._query is not None before using it and raise a
descriptive exception (e.g., ValueError or RuntimeError) that explains the SDK
internal state is invalid and suggests calling connect() first; then proceed to
set msg_iter = client._query.receive_messages().__aiter__() as before. Ensure
the check references client._query and that the error message clearly identifies
the failing condition and preferred caller action.
- Around line 682-707: The dependency constraint for claude-agent-sdk is too
loose given we rely on private internals (client._query and
claude_agent_sdk._internal.message_parser.parse_message); update pyproject.toml
to pin the package to the exact tested release (e.g., 0.1.35) or a very narrow
range (e.g., ==0.1.35 or >=0.1.35,<0.1.36) and add a short comment in the file
referencing client._query and parse_message to mark this as tech debt requiring
audit when bumping the pinned version.
---
Nitpick comments:
In `@autogpt_platform/backend/backend/copilot/sdk/response_adapter.py`:
- Around line 245-288: The loop always sets flushed=True, so remove the
redundant flushed variable and its uses: in the block iterating over unresolved
(and using pop_pending_tool_output, StreamToolOutputAvailable,
self.resolved_tool_calls), drop flushed assignments and the final if flushed and
self.step_open check; instead, after the loop if unresolved was non-empty (the
caller already guarantees this) and self.step_open is True, append
StreamFinishStep() and set self.step_open = False; ensure all current behavior
(adding StreamToolOutputAvailable entries and logging) remains unchanged.
In `@autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py`:
- Around line 208-215: The _hooks fixture currently assumes
create_security_hooks(...) returns a dict with "PreToolUse" and that its first
item exposes a .hooks list; change _hooks to defensively handle SDK absence by
checking the result of create_security_hooks (e.g., hooks =
create_security_hooks(...); if not hooks or "PreToolUse" not in hooks:
pytest.skip("claude_agent_sdk not available") or return a no-op handler), and
when accessing the matcher use safe access like first_matcher =
hooks.get("PreToolUse", []) and pre = getattr(first_matcher[0], "hooks",
[None])[0] if present, otherwise skip/return a no-op; reference
create_security_hooks and the _hooks fixture to locate and update the code.
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 421-443: The current broad substring markers in
_is_tool_error_or_denial (the tuple literal containing "failed", "maximum", "not
supported", etc.) can produce false positives; tighten them by replacing generic
tokens with more specific anchored patterns (e.g., change "not supported" -> "is
not supported" or "background task execution is not supported", "failed" ->
"execution failed" or '"iserror": true' stays as-is, and "maximum" -> patterns
like "maximum .* sub-task" or "maximum .* sub-tasks per session") and update the
membership check to test these more specific phrases against content.lower() so
the function still operates on tool outputs but reduces accidental matches.
In `@autogpt_platform/backend/backend/copilot/tools/sandbox.py`:
- Around line 249-272: Update the misleading comment around the process-kill
path: clarify that start_new_session=True creates a new session for bwrap (see
start_new_session) and that os.killpg(proc.pid, signal.SIGKILL) sends SIGKILL to
that bwrap process group, but sandboxed child processes are placed in their own
group by bwrap's --new-session (see _build_bwrap_command) and thus are
terminated via bwrap's --die-with-parent/prctl(PR_SET_PDEATHSIG) rather than by
killpg; replace the "bwrap + all children" wording with a precise note that
killpg targets bwrap's group and that child processes rely on --die-with-parent
to get killed when bwrap exits.
In `@autogpt_platform/frontend/src/app/`(platform)/copilot/useCopilotPage.ts:
- Around line 173-179: The effect currently only resolves in-progress tool parts
when prevStatusRef transitions from "streaming" to "ready"; update the condition
in the useEffect (referencing useEffect, prevStatusRef, status, setMessages,
resolveInProgressTools) to also handle the "streaming" → "error" transition
(e.g., change the if to check prev === "streaming" && (status === "ready" ||
status === "error") or otherwise detect prev === "streaming" && status !==
"streaming") so that resolveInProgressTools is invoked when the stream fails and
spinners are cleared.
… logging - Add 100ms yield before flush when unresolved tool calls exist, giving PostToolUse hooks time to complete before the stash is checked. This mitigates the race condition in claude_agent_sdk where hooks are fire-and-forget (start_soon) while messages arrive immediately. - Add has_unresolved_tool_calls property to SDKResponseAdapter - Differentiate empty flush warnings to flag likely race conditions - Add session_id to all SDK log messages ([SDK] [<session>] ...) - Log session approach (resume/compression/single-turn) with context sizes - Elevate session save log from debug to info
There was a problem hiding this comment.
🧹 Nitpick comments (2)
autogpt_platform/backend/backend/copilot/sdk/service.py (2)
732-740: Timing-based race mitigation is inherently fragile.The 100ms
asyncio.sleepis a best-effort workaround for the hook race, and won't reliably cover cases where thePostToolUsehook takes longer (e.g., network latency in the stash path). The adapter's fallback (emit empty output + WARNING) provides a safety net, so this isn't a correctness issue, but it may produce noisy logs under load.Consider a more deterministic approach if this becomes frequent — e.g., an
asyncio.Eventset by the hook that the flush path awaits with a short timeout.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/backend/backend/copilot/sdk/service.py` around lines 732 - 740, The current 100ms asyncio.sleep in Query._read_messages to mitigate a race with PostToolUse hooks is fragile; replace it with a deterministic wait: add an asyncio.Event (e.g., adapter.post_tool_use_done_event) that the PostToolUse hook sets when its stash/output is complete, then in the flush path (where adapter.has_unresolved_tool_calls and isinstance(sdk_msg, (AssistantMessage, ResultMessage))) await that event with a short timeout using asyncio.wait_for (fallback to current behavior if timed out) so the flush deterministically waits for the hook to finish instead of relying on sleep.
855-860: Minor logging inconsistency: fullsession_idvs truncated.Other log lines in this function consistently use
session_id[:12]for brevity, but this warning uses the fullsession_id. Consider truncating for consistency.logger.warning( - "[SDK] No transcript to upload for %s " "(%d captured entries)", - session_id, + "[SDK] [%s] No transcript to upload (%d captured entries)", + session_id[:12], len(captured_transcript.raw_entries), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/backend/backend/copilot/sdk/service.py` around lines 855 - 860, The warning log is inconsistent because it logs the full session_id while other logs use a truncated form; update the logger.warning call that references session_id and captured_transcript.raw_entries to use session_id[:12] (same truncation used elsewhere in this function) so the message reads e.g. logger.warning("[SDK] No transcript to upload for %s (%d captured entries)", session_id[:12], len(captured_transcript.raw_entries)).
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
autogpt_platform/backend/backend/copilot/sdk/response_adapter.pyautogpt_platform/backend/backend/copilot/sdk/service.py
🧰 Additional context used
📓 Path-based instructions (4)
autogpt_platform/backend/**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
autogpt_platform/backend/**/*.py: Use Python 3.11 (required; managed by Poetry via pyproject.toml) for backend development
Always run 'poetry run format' (Black + isort) before linting in backend development
Always run 'poetry run lint' (ruff) after formatting in backend development
Files:
autogpt_platform/backend/backend/copilot/sdk/response_adapter.pyautogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/backend/**/*.{py,txt}
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
Use
poetry runprefix for all Python commands, including testing, linting, formatting, and migrations
Files:
autogpt_platform/backend/backend/copilot/sdk/response_adapter.pyautogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/backend/backend/**/*.py
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
Use Prisma ORM for database operations in PostgreSQL with pgvector for embeddings
Files:
autogpt_platform/backend/backend/copilot/sdk/response_adapter.pyautogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Format Python code with
poetry run format
Files:
autogpt_platform/backend/backend/copilot/sdk/response_adapter.pyautogpt_platform/backend/backend/copilot/sdk/service.py
🧬 Code graph analysis (1)
autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (2)
autogpt_platform/backend/backend/util/logging.py (2)
info(41-43)warning(45-47)autogpt_platform/backend/backend/copilot/tools/base.py (1)
name(20-22)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Seer Code Review
🔇 Additional comments (8)
autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (2)
50-68: Clean additions for session awareness and unresolved-call tracking.The
session_idparameter andhas_unresolved_tool_callsproperty are well-designed. Usingdict.keys() - setfor the set difference is idiomatic and efficient.
237-300: Good early-return guard and improved logging granularity.The pre-computed
unresolvedlist avoids repeated set-difference checks inside the loop, and the early return when nothing to flush is a nice short-circuit. INFO vs. WARNING distinction correctly differentiates normal flushes from race-condition symptoms.One minor note: after the early-return guard,
flushed(line 253) is now always set toTrueby at least one loop iteration, making theif flushedcheck on line 298 technically redundant. Not worth changing — it's a safe defensive guard.autogpt_platform/backend/backend/copilot/sdk/service.py (6)
66-88: Clean refactor ofCapturedTranscriptto in-memory capture.Switching from file-path-based transcript handling to in-memory
raw_entriesis a good simplification that avoids filesystem I/O race conditions. The 15-second heartbeat interval is a reasonable default for keeping SSE alive through typical proxy/LB timeouts (often 30–60s).
597-598: LGTM — adapter instantiation correctly passessession_id.Aligns with the updated
SDKResponseAdapter.__init__signature inresponse_adapter.py.
705-713: Heartbeat streaming loop is well-structured.Using
asyncio.timeoutper-iteration with an explicit async iterator is the correct pattern for Python 3.11+. The comment explaining whyasyncio.timeoutis preferred overasyncio.wait_for(Task cancellation leaving the generator broken) is a valuable implementation note.
894-952:_build_transcriptis well-guarded with validation.The function correctly handles edge cases (empty entries →
None, invalid transcript →Nonewith warning). Concatenating previous transcript + synthetic user entry + new entries produces a proper JSONL sequence. The trailing-newline handling (rstrip("\n")on previous + final+ "\n") ensures consistent formatting.
930-937: The synthetic user entry format is correct; concerns about format-related resume failures are not founded.The
validate_transcript()function only checks for the presence oftype: "user"andtype: "assistant"entries—it does not enforce strict field schema validation. Extra fields likesession_idare safely ignored. Additionally, the codebase explicitly handles emptyparentUuidvalues (defaulting to""via.get("parentUuid", "")throughouttranscript.py), and the reparenting logic instrip_progress_entries()correctly walks up chains with empty strings. A comment at service.py:929 confirms "The uuid/parentUuid fields are optional for --resume," indicating this design is intentional.Likely an incorrect or invalid review comment.
421-443: The function already handles the false positive scenarios you're concerned about. The test suite explicitly validates that benign content like"0 errors found"or filenames like"file-not-found-handler.py"returnFalse, while legitimate errors and denials returnTrue. The markers ("failed","maximum","not supported") appear only in actual error/denial messages from the security hooks, not in arbitrary tool output. No refactoring is needed.Likely an incorrect or invalid review comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 698-704: The code illegally depends on private internals: remove
the import from claude_agent_sdk._internal.message_parser and the direct access
to client._query; instead use the SDK's public APIs (e.g., call the public
receive_messages/stream method exposed on the client or client.query object and
the public message parsing/constructors such as AssistantMessage/ResultMessage
or any public parse_message API) to get an async iterator and parse messages.
Replace uses of _parse_sdk_msg and client._query with the SDK's documented
methods (e.g., client.receive_messages() or client.query.receive_messages() and
AssistantMessage/ResultMessage parsing utilities) so the code no longer
references _parse_sdk_msg or client._query.
- Around line 703-704: Replace the bare assert on client._query with an explicit
runtime check: if client._query is None raise a clear exception (e.g.,
RuntimeError or a custom error) with a message indicating that connect() must be
called before iterating messages; then proceed to call
client._query.receive_messages().__aiter__() to obtain msg_iter. This change
should occur where client._query is referenced so the error is descriptive and
not stripped by -O.
---
Nitpick comments:
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 732-740: The current 100ms asyncio.sleep in Query._read_messages
to mitigate a race with PostToolUse hooks is fragile; replace it with a
deterministic wait: add an asyncio.Event (e.g.,
adapter.post_tool_use_done_event) that the PostToolUse hook sets when its
stash/output is complete, then in the flush path (where
adapter.has_unresolved_tool_calls and isinstance(sdk_msg, (AssistantMessage,
ResultMessage))) await that event with a short timeout using asyncio.wait_for
(fallback to current behavior if timed out) so the flush deterministically waits
for the hook to finish instead of relying on sleep.
- Around line 855-860: The warning log is inconsistent because it logs the full
session_id while other logs use a truncated form; update the logger.warning call
that references session_id and captured_transcript.raw_entries to use
session_id[:12] (same truncation used elsewhere in this function) so the message
reads e.g. logger.warning("[SDK] No transcript to upload for %s (%d captured
entries)", session_id[:12], len(captured_transcript.raw_entries)).
… pin, event-based stash - Replace bare `assert client._query` with proper RuntimeError check - Add TECH DEBT comments on private SDK internal usage - Pin claude-agent-sdk to ~0.1.35 (tighter constraint for private API access) - Replace sleep(0.1) with event-based wait_for_stash() for race-condition fix - Add wait_for_stash synchronisation tests
Resolve service.py conflicts: take dev's file-based transcript approach (CapturedTranscript.path + read_transcript_file) and public client API, layer our fixes on top (wait_for_stash race-condition fix, session_id logging, approach logging).
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py (1)
525-545: Addtry/finallyto async test cleanup to prevent ContextVar state leakage on failure.All three async wait-for-stash tests reset
_ptoand_stash_eventat the end of the function body, but this cleanup is skipped if an assertion fails before it's reached. A failed test could leave_ptoset to{}(instead of theNonedefault) and_stash_eventset to a liveEvent, which may affect sync tests that follow.♻️ Proposed fix — wrap body in try/finally for each async test
`@pytest.mark.asyncio` async def test_wait_for_stash_signaled(): """wait_for_stash returns True when stash_pending_tool_output signals.""" _pto.set({}) event = asyncio.Event() _stash_event.set(event) - # Simulate a PostToolUse hook that stashes output after a short delay - async def delayed_stash(): - await asyncio.sleep(0.01) - _stash("WebSearch", "result data") - - asyncio.create_task(delayed_stash()) - result = await wait_for_stash(timeout=1.0) - - assert result is True - assert _pto.get({}).get("WebSearch") == ["result data"] - - # Cleanup - _pto.set({}) # type: ignore[arg-type] - _stash_event.set(None) + try: + async def delayed_stash(): + await asyncio.sleep(0.01) + _stash("WebSearch", "result data") + + asyncio.create_task(delayed_stash()) + result = await wait_for_stash(timeout=1.0) + + assert result is True + assert _pto.get({}).get("WebSearch") == ["result data"] + finally: + _pto.set({}) # type: ignore[arg-type] + _stash_event.set(None)Apply the same
try/finallypattern totest_wait_for_stash_timeoutandtest_wait_for_stash_already_stashed.Also applies to: 548-560, 563-587
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py` around lines 525 - 545, The async tests test_wait_for_stash_signaled, test_wait_for_stash_timeout, and test_wait_for_stash_already_stashed leak ContextVar state on failure because they perform cleanup at the end of the test body; wrap each test's main body in a try/finally and move the cleanup into the finally block so _pto and _stash_event are always reset (use _pto.set(None) and _stash_event.set(None) to restore the original defaults) and keep the existing simulated stash logic (delayed_stash/task creation) and assertions inside the try.autogpt_platform/backend/backend/copilot/sdk/service.py (1)
67-82:sdk_session_idfield inCapturedTranscriptis never read after assignment.The field is set in the
_on_stopcallback (line 551), but no code path reads it afterwards. The_build_transcriptmethod uses a separatesession_idparameter instead. Remove this unused field to reduce confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/backend/backend/copilot/sdk/service.py` around lines 67 - 82, CapturedTranscript defines an unused field sdk_session_id which is set in the _on_stop callback but never read (the _build_transcript path uses its session_id parameter instead); remove the sdk_session_id attribute from the CapturedTranscript dataclass and delete any assignments to it in the _on_stop callback to avoid dead state, making sure to keep _build_transcript usage unchanged and verify no other code references sdk_session_id.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
autogpt_platform/backend/poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/tool_adapter.pyautogpt_platform/backend/pyproject.toml
🧰 Additional context used
📓 Path-based instructions (6)
autogpt_platform/backend/**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
autogpt_platform/backend/**/*.py: Use Python 3.11 (required; managed by Poetry via pyproject.toml) for backend development
Always run 'poetry run format' (Black + isort) before linting in backend development
Always run 'poetry run lint' (ruff) after formatting in backend development
Files:
autogpt_platform/backend/backend/copilot/sdk/tool_adapter.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter_test.pyautogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/backend/**/*.{py,txt}
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
Use
poetry runprefix for all Python commands, including testing, linting, formatting, and migrations
Files:
autogpt_platform/backend/backend/copilot/sdk/tool_adapter.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter_test.pyautogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/backend/backend/**/*.py
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
Use Prisma ORM for database operations in PostgreSQL with pgvector for embeddings
Files:
autogpt_platform/backend/backend/copilot/sdk/tool_adapter.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter_test.pyautogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Format Python code with
poetry run format
Files:
autogpt_platform/backend/backend/copilot/sdk/tool_adapter.pyautogpt_platform/backend/backend/copilot/sdk/response_adapter_test.pyautogpt_platform/backend/backend/copilot/sdk/service.py
autogpt_platform/backend/**/*_test.py
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
autogpt_platform/backend/**/*_test.py: Always review snapshot changes withgit diffbefore committing when updating snapshots withpoetry run pytest --snapshot-update
Use pytest with snapshot testing for API responses in test files
Colocate test files with source files using the*_test.pynaming convention
Files:
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
autogpt_platform/backend/**/*test*.py
📄 CodeRabbit inference engine (AGENTS.md)
Run
poetry run testfor backend testing (runs pytest with docker based postgres + prisma)
Files:
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
🧠 Learnings (7)
📚 Learning: 2026-02-04T16:49:42.490Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-02-04T16:49:42.490Z
Learning: Applies to autogpt_platform/backend/**/*.py : Use Python 3.11 (required; managed by Poetry via pyproject.toml) for backend development
Applied to files:
autogpt_platform/backend/pyproject.toml
📚 Learning: 2026-02-04T16:50:20.508Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: autogpt_platform/backend/CLAUDE.md:0-0
Timestamp: 2026-02-04T16:50:20.508Z
Learning: Applies to autogpt_platform/backend/**/*.{py,txt} : Use `poetry run` prefix for all Python commands, including testing, linting, formatting, and migrations
Applied to files:
autogpt_platform/backend/pyproject.toml
📚 Learning: 2026-01-31T18:44:56.328Z
Learnt from: Bentlybro
Repo: Significant-Gravitas/AutoGPT PR: 0
File: :0-0
Timestamp: 2026-01-31T18:44:56.328Z
Learning: In the AutoGPT backend Docker build with Poetry path dependencies, `autogpt_libs` requires a double-copy pattern: first COPY from builder brings the installed package state (so venv references work), then second COPY from build context overwrites with latest source (to ensure freshness). Both copies are necessary for correct functionality.
Applied to files:
autogpt_platform/backend/pyproject.toml
📚 Learning: 2026-02-04T16:49:42.490Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-02-04T16:49:42.490Z
Learning: Applies to autogpt_platform/backend/schema.prisma : Run database migrations with 'poetry run prisma migrate dev' and 'poetry run prisma generate' after schema changes in backend
Applied to files:
autogpt_platform/backend/pyproject.toml
📚 Learning: 2026-01-23T19:58:10.520Z
Learnt from: ntindle
Repo: Significant-Gravitas/AutoGPT PR: 11796
File: autogpt_platform/backend/backend/blocks/video/loop.py:80-87
Timestamp: 2026-01-23T19:58:10.520Z
Learning: Ensure MoviePy is constrained to version ^2.1.2 (2.x) in pyproject.toml files where MoviePy is declared, so the backend video processing relies on a compatible API. This should cover all relevant pyproject.toml files (e.g., autogpt_platform/backend/pyproject.toml) to maintain consistency.
Applied to files:
autogpt_platform/backend/pyproject.toml
📚 Learning: 2026-02-04T16:49:42.490Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-02-04T16:49:42.490Z
Learning: Applies to autogpt_platform/backend/**/test/**/*.py : Use snapshot testing with '--snapshot-update' flag in backend tests when output changes; always review with 'git diff'
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
📚 Learning: 2026-02-04T16:50:20.508Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: autogpt_platform/backend/CLAUDE.md:0-0
Timestamp: 2026-02-04T16:50:20.508Z
Learning: Applies to autogpt_platform/backend/**/*_test.py : Always review snapshot changes with `git diff` before committing when updating snapshots with `poetry run pytest --snapshot-update`
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py
🧬 Code graph analysis (2)
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py (3)
autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py (2)
stash_pending_tool_output(126-150)wait_for_stash(153-179)autogpt_platform/backend/backend/copilot/response_model.py (2)
StreamBaseResponse(43-50)StreamToolOutputAvailable(150-173)autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (1)
convert_message(70-211)
autogpt_platform/backend/backend/copilot/sdk/service.py (4)
autogpt_platform/backend/backend/copilot/response_model.py (1)
StreamHeartbeat(211-225)autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py (1)
wait_for_stash(153-179)autogpt_platform/backend/backend/copilot/sdk/response_adapter.py (2)
SDKResponseAdapter(43-300)has_unresolved_tool_calls(66-68)autogpt_platform/backend/backend/copilot/sdk/transcript.py (1)
validate_transcript(248-276)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Seer Code Review
🔇 Additional comments (3)
autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py (1)
48-55: Stash-event mechanism is well-designed.The
_stash_eventContextVar, its initialisation inset_execution_context, and theevent.set()call instash_pending_tool_outputcorrectly bridge the PostToolUse hook timing gap described in the comment. The intentional "clear-before-wait" semantics inwait_for_stash(verified bytest_wait_for_stash_already_stashed) are appropriately documented.Also applies to: 88-88, 147-150
autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py (1)
379-429: Flush tests correctly validate the_flush_unresolved_tool_callsbehavior.
test_flush_unresolved_at_result_messageandtest_flush_unresolved_at_next_assistant_messageexercise both flush triggers (ResultMessage and the next AssistantMessage) and assert exact stream-type sequences, including that flushed output defaults to""when no stash exists.Also applies to: 432-474
autogpt_platform/backend/backend/copilot/sdk/service.py (1)
688-760: Heartbeat streaming loop is correctly implemented.The explicit
__aiter__+asyncio.timeoutpattern avoids thewait_fortask-cancellation pitfall documented in the comment, theStopAsyncIterationbreak is clean, and thewait_for_stashintegration beforeconvert_messagecorrectly bridges the PostToolUse hook timing race without requiring an arbitrary sleep.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 422-444: The _is_tool_error_or_denial function currently treats
the substring "failed" as an error marker which is too broad; update
_is_tool_error_or_denial to remove the bare "failed" marker and instead check
for narrower, unambiguous patterns such as "error:", "execution failed", "tool
error:", or the MCP flag '"iserror": true' (or add a distinct sentinel like
"tool error:" when producing denial messages) so benign outputs like "3 tests
failed" aren't misclassified—modify the marker tuple in _is_tool_error_or_denial
accordingly and ensure any code that stores denial messages uses the same
sentinel convention if added.
- Around line 914-972: The synthetic user entry in _build_transcript sets
"parentUuid": "" which can be rejected by the CLI; change how user_entry is
constructed in _build_transcript so that parentUuid is omitted (or set to
None/null) for root-level user messages instead of an empty string, e.g. only
include parentUuid when you have a non-empty parent id; ensure
validate_transcript still runs against the resulting raw JSONL and keep
captured_entries and previous_transcript handling unchanged.
In `@autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py`:
- Around line 153-179: The function wait_for_stash uses the Python 3.11-only
context manager asyncio.timeout; replace that with asyncio.wait_for to maintain
compatibility with Python 3.10: call await asyncio.wait_for(event.wait(),
timeout) instead of using async with asyncio.timeout(timeout): await
event.wait(), and update the exception handling to catch asyncio.TimeoutError
(or alias TimeoutError to asyncio.TimeoutError) so the timeout branch still
returns False; apply this change in wait_for_stash (and mirror the same
replacement in other uses like in service.py if you touch those locations).
---
Duplicate comments:
In `@autogpt_platform/backend/pyproject.toml`:
- Line 19: The dependency entry for claude-agent-sdk is loosely pinned with
"~0.1.35", which allows patch upgrades (>=0.1.35, <0.2.0) and risks breaking
usage of private internals like client._query and
client._internal.message_parser; change the version specifier to a strict pin
(e.g., "0.1.35" or "==0.1.35") in the pyproject.toml dependency line to lock the
exact version and ensure reproducible behavior when relying on those private
APIs.
---
Nitpick comments:
In `@autogpt_platform/backend/backend/copilot/sdk/response_adapter_test.py`:
- Around line 525-545: The async tests test_wait_for_stash_signaled,
test_wait_for_stash_timeout, and test_wait_for_stash_already_stashed leak
ContextVar state on failure because they perform cleanup at the end of the test
body; wrap each test's main body in a try/finally and move the cleanup into the
finally block so _pto and _stash_event are always reset (use _pto.set(None) and
_stash_event.set(None) to restore the original defaults) and keep the existing
simulated stash logic (delayed_stash/task creation) and assertions inside the
try.
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 67-82: CapturedTranscript defines an unused field sdk_session_id
which is set in the _on_stop callback but never read (the _build_transcript path
uses its session_id parameter instead); remove the sdk_session_id attribute from
the CapturedTranscript dataclass and delete any assignments to it in the
_on_stop callback to avoid dead state, making sure to keep _build_transcript
usage unchanged and verify no other code references sdk_session_id.
…urrent message saves (#12177) Requested by @majdyz Concurrent writers (incremental streaming saves from PR #12173 and long-running tool callbacks) can race to persist messages with the same `(sessionId, sequence)` pair, causing unique constraint violations on `ChatMessage`. **Root cause:** The streaming loop tracks `saved_msg_count` in-memory, but the long-running tool callback (`_build_long_running_callback`) also appends messages and calls `upsert_chat_session` independently — without coordinating sequence numbers. When the streaming loop does its next incremental save with the stale `saved_msg_count`, it tries to insert at a sequence that already exists. **Fix:** Multi-layered defense-in-depth approach: 1. **Collision detection with retry** (db.py): `add_chat_messages_batch` uses `create_many()` in a transaction. On `UniqueViolationError`, queries `MAX(sequence)+1` from DB and retries with the correct offset (max 5 attempts). 2. **Robust sequence tracking** (db.py): `get_next_sequence()` uses indexed `find_first` with `order={"sequence": "desc"}` for O(1) MAX lookup, immune to deleted messages. 3. **Session-based counter** (model.py): Added `saved_message_count` field to `ChatSession`. `upsert_chat_session` returns the session with updated count, eliminating tuple returns throughout the codebase. 4. **MessageCounter dataclass** (sdk/service.py): Replaced list[int] mutable reference pattern with a clean `MessageCounter` dataclass for shared state between streaming loop and long-running callbacks. 5. **Session locking** (sdk/service.py): Prevent concurrent streams on the same session using Redis `SET NX EX` distributed locks with TTL refresh on heartbeats (config.stream_ttl = 3600s). 6. **Atomic operations** (db.py): Single timestamp for all messages and session update in batch operations for consistency. Parallel queries with `asyncio.gather` for lower latency. 7. **Config-based TTL** (sdk/service.py, config.py): Consolidated all TTL constants to use `config.stream_ttl` (3600s) with lock refresh on heartbeats. ### Key implementation details - **create_many**: Uses `sessionId` directly (not nested `Session.connect`) as `create_many` doesn't support nested creates - **Type narrowing**: Added explicit `assert session is not None` statements for pyright type checking in async contexts - **Parallel operations**: Use `asyncio.gather` for independent DB operations (create_many + session update) - **Single timestamp**: All messages in a batch share the same `createdAt` timestamp for atomicity ### Changes - `backend/copilot/db.py`: Collision detection with `create_many` + retry, indexed sequence lookup, single timestamp, parallel queries - `backend/copilot/model.py`: Added `saved_message_count` field, simplified return types - `backend/copilot/sdk/service.py`: MessageCounter dataclass, session locking with refresh, config-based TTL, type narrowing - `backend/copilot/service.py`: Updated all callers to handle new return types - `backend/copilot/config.py`: Increased long_running_operation_ttl to 3600s with clarified docstring - `backend/copilot/*_test.py`: Tests updated for new signatures --------- Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
…saves, frontend reconnection (Significant-Gravitas#12173) ## Summary Fixes multiple reliability issues in the copilot's Claude Agent SDK streaming pipeline — tool outputs getting stuck, parallel tool calls flushing prematurely, messages lost on page refresh, and SSE reconnection failures. ## Changes ### Backend: Streaming loop rewrite (`sdk/service.py`) - **Non-cancelling heartbeat pattern**: Replace `asyncio.timeout()` with `asyncio.wait()` for SDK message iteration. The old approach corrupted the SDK's internal anyio memory stream when timeouts fired mid-`__anext__()`, causing `StopAsyncIteration` on the next call and silently dropping all in-flight tool results. - **Hook synchronization**: Add `wait_for_stash()` before `convert_message()` — the SDK fires PostToolUse hooks via `start_soon()` (fire-and-forget), so the next message can arrive before the hook stashes its output. The new asyncio.Event-based mechanism bridges this gap without arbitrary sleeps. - **Error handling**: Add `asyncio.CancelledError` handling at both inner (streaming loop) and outer (session) levels, plus pending task cleanup in `finally` block to prevent leaked coroutines. Catch `Exception` from `done.pop().result()` for SDK error messages. - **Safety-net flush**: After streaming loop ends, flush any remaining unresolved tool calls so the frontend stops showing spinners even if the stream drops unexpectedly. - **StreamFinish fallback**: Emit `StreamFinishStep` + `StreamFinish` when stream ends without `ResultMessage` (StopAsyncIteration) so the frontend transitions to "ready" state. - **Incremental session saves**: Save session to PostgreSQL after each tool input/output event (not just at stream end), so page refresh and other devices see recent messages. - **Enhanced logging**: All log lines now include `session_id[:12]` prefix and tool call resolution state (unresolved/current/resolved counts). ### Backend: Response adapter (`sdk/response_adapter.py`) - **Parallel tool call support**: Skip `_flush_unresolved_tool_calls()` when an AssistantMessage contains only ToolUseBlocks (parallel continuation) — the prior tools are still executing concurrently and haven't finished yet. - **Duplicate output prevention**: Skip already-resolved tool results in both UserMessage (ToolResultBlock) and parent_tool_use_id handling to prevent duplicate `StreamToolOutputAvailable` events. - **`has_unresolved_tool_calls` property**: Used by the streaming loop to decide whether to wait for PostToolUse hooks. - **`session_id` parameter**: Passed through for structured logging. ### Backend: Hook synchronization (`sdk/tool_adapter.py`) - **`_stash_event` ContextVar**: asyncio.Event signaled by `stash_pending_tool_output()` whenever a PostToolUse hook stashes output. - **`wait_for_stash()`**: Awaits the event with configurable timeout — replaces the racy "hope the hook finished" approach. ### Backend: Security hooks (`sdk/security_hooks.py`) - Enhanced logging in `post_tool_use_hook` — log whether tool is built-in, preview of stashed output, warning when `tool_response` is None. ### Backend: Incremental save optimization (`model.py`) - **`existing_message_count` parameter** on `upsert_chat_session`: Skip the DB query to count existing messages when the caller already tracks this (streaming loop). - **`skip_existence_check` parameter** on `_save_session_to_db`: Skip the `get_chat_session` existence query when we know the session row already exists. Reduces from 4 DB round trips to 2 per incremental save. ### Backend: SDK version bump (`pyproject.toml`, `poetry.lock`) - Bump `claude-agent-sdk` from `^0.1.0` to `^0.1.39`. ### Backend: New tests - **`sdk_compat_test.py`** (new file): SDK compatibility tests — verify the installed SDK exposes every class, attribute, and method the copilot integration relies on. Catches SDK upgrade breakage immediately. - **`response_adapter_test.py`**: 9 new tests covering flush-at-ResultMessage, flush-at-next-AssistantMessage, stashed output flush, wait_for_stash signaling/timeout/fast-path, parallel tool call non-premature-flush, text-message flush of prior tools, and already-resolved tool skip in UserMessage. ### Frontend: Session hydration (`convertChatSessionToUiMessages.ts`) - **`isComplete` option**: When session has no active stream, mark dangling tool calls (no output in DB) as `output-available` with empty output — stops stale spinners after page refresh. ### Frontend: Chat session hook (`useChatSession.ts`) - Reorder `hasActiveStream` memo before `hydratedMessages` so `isComplete` flag is available. - Pass `{ isComplete: !hasActiveStream }` to `convertChatSessionMessagesToUiMessages`. ### Frontend: Copilot page hook (`useCopilotPage.ts`) - **Cache invalidation on stream end**: Invalidate React Query session cache when stream transitions active → idle, so next hydration fetches fresh messages from backend (staleTime: Infinity otherwise keeps stale data). - **Resume ref reset**: Reset `hasResumedRef` on stream end to allow re-resume if SSE drops but backend task is still running. - **Remove old `resolveInProgressTools` effect**: Replaced by backend-side safety-net flush + hydration-time `isComplete` marking. ## Test plan - [ ] Existing copilot tests pass (`pytest backend/copilot/ -x -q`) - [ ] SDK compat tests pass (`pytest backend/copilot/sdk/sdk_compat_test.py -v`) - [ ] Tool outputs (bash_exec, web_fetch, WebSearch) appear in the UI instead of getting stuck - [ ] Parallel tool calls (e.g. multiple WebSearch) complete and display results without premature flush - [ ] Page refresh during active stream reconnects and recovers messages - [ ] Opening session from another device shows recent tool results - [ ] SSE drop → automatic reconnection without losing messages - [ ] Long-running tools (create_agent) still delegate to background infrastructure
…urrent message saves (Significant-Gravitas#12177) Requested by @majdyz Concurrent writers (incremental streaming saves from PR Significant-Gravitas#12173 and long-running tool callbacks) can race to persist messages with the same `(sessionId, sequence)` pair, causing unique constraint violations on `ChatMessage`. **Root cause:** The streaming loop tracks `saved_msg_count` in-memory, but the long-running tool callback (`_build_long_running_callback`) also appends messages and calls `upsert_chat_session` independently — without coordinating sequence numbers. When the streaming loop does its next incremental save with the stale `saved_msg_count`, it tries to insert at a sequence that already exists. **Fix:** Multi-layered defense-in-depth approach: 1. **Collision detection with retry** (db.py): `add_chat_messages_batch` uses `create_many()` in a transaction. On `UniqueViolationError`, queries `MAX(sequence)+1` from DB and retries with the correct offset (max 5 attempts). 2. **Robust sequence tracking** (db.py): `get_next_sequence()` uses indexed `find_first` with `order={"sequence": "desc"}` for O(1) MAX lookup, immune to deleted messages. 3. **Session-based counter** (model.py): Added `saved_message_count` field to `ChatSession`. `upsert_chat_session` returns the session with updated count, eliminating tuple returns throughout the codebase. 4. **MessageCounter dataclass** (sdk/service.py): Replaced list[int] mutable reference pattern with a clean `MessageCounter` dataclass for shared state between streaming loop and long-running callbacks. 5. **Session locking** (sdk/service.py): Prevent concurrent streams on the same session using Redis `SET NX EX` distributed locks with TTL refresh on heartbeats (config.stream_ttl = 3600s). 6. **Atomic operations** (db.py): Single timestamp for all messages and session update in batch operations for consistency. Parallel queries with `asyncio.gather` for lower latency. 7. **Config-based TTL** (sdk/service.py, config.py): Consolidated all TTL constants to use `config.stream_ttl` (3600s) with lock refresh on heartbeats. ### Key implementation details - **create_many**: Uses `sessionId` directly (not nested `Session.connect`) as `create_many` doesn't support nested creates - **Type narrowing**: Added explicit `assert session is not None` statements for pyright type checking in async contexts - **Parallel operations**: Use `asyncio.gather` for independent DB operations (create_many + session update) - **Single timestamp**: All messages in a batch share the same `createdAt` timestamp for atomicity ### Changes - `backend/copilot/db.py`: Collision detection with `create_many` + retry, indexed sequence lookup, single timestamp, parallel queries - `backend/copilot/model.py`: Added `saved_message_count` field, simplified return types - `backend/copilot/sdk/service.py`: MessageCounter dataclass, session locking with refresh, config-based TTL, type narrowing - `backend/copilot/service.py`: Updated all callers to handle new return types - `backend/copilot/config.py`: Increased long_running_operation_ttl to 3600s with clarified docstring - `backend/copilot/*_test.py`: Tests updated for new signatures --------- Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
Issues attributed to commits in this pull requestThis pull request was merged and Sentry observed the following issues:
|
Summary
Fixes multiple reliability issues in the copilot's Claude Agent SDK streaming pipeline — tool outputs getting stuck, parallel tool calls flushing prematurely, messages lost on page refresh, and SSE reconnection failures.
Changes
Backend: Streaming loop rewrite (
sdk/service.py)asyncio.timeout()withasyncio.wait()for SDK message iteration. The old approach corrupted the SDK's internal anyio memory stream when timeouts fired mid-__anext__(), causingStopAsyncIterationon the next call and silently dropping all in-flight tool results.wait_for_stash()beforeconvert_message()— the SDK fires PostToolUse hooks viastart_soon()(fire-and-forget), so the next message can arrive before the hook stashes its output. The new asyncio.Event-based mechanism bridges this gap without arbitrary sleeps.asyncio.CancelledErrorhandling at both inner (streaming loop) and outer (session) levels, plus pending task cleanup infinallyblock to prevent leaked coroutines. CatchExceptionfromdone.pop().result()for SDK error messages.StreamFinishStep+StreamFinishwhen stream ends withoutResultMessage(StopAsyncIteration) so the frontend transitions to "ready" state.session_id[:12]prefix and tool call resolution state (unresolved/current/resolved counts).Backend: Response adapter (
sdk/response_adapter.py)_flush_unresolved_tool_calls()when an AssistantMessage contains only ToolUseBlocks (parallel continuation) — the prior tools are still executing concurrently and haven't finished yet.StreamToolOutputAvailableevents.has_unresolved_tool_callsproperty: Used by the streaming loop to decide whether to wait for PostToolUse hooks.session_idparameter: Passed through for structured logging.Backend: Hook synchronization (
sdk/tool_adapter.py)_stash_eventContextVar: asyncio.Event signaled bystash_pending_tool_output()whenever a PostToolUse hook stashes output.wait_for_stash(): Awaits the event with configurable timeout — replaces the racy "hope the hook finished" approach.Backend: Security hooks (
sdk/security_hooks.py)post_tool_use_hook— log whether tool is built-in, preview of stashed output, warning whentool_responseis None.Backend: Incremental save optimization (
model.py)existing_message_countparameter onupsert_chat_session: Skip the DB query to count existing messages when the caller already tracks this (streaming loop).skip_existence_checkparameter on_save_session_to_db: Skip theget_chat_sessionexistence query when we know the session row already exists. Reduces from 4 DB round trips to 2 per incremental save.Backend: SDK version bump (
pyproject.toml,poetry.lock)claude-agent-sdkfrom^0.1.0to^0.1.39.Backend: New tests
sdk_compat_test.py(new file): SDK compatibility tests — verify the installed SDK exposes every class, attribute, and method the copilot integration relies on. Catches SDK upgrade breakage immediately.response_adapter_test.py: 9 new tests covering flush-at-ResultMessage, flush-at-next-AssistantMessage, stashed output flush, wait_for_stash signaling/timeout/fast-path, parallel tool call non-premature-flush, text-message flush of prior tools, and already-resolved tool skip in UserMessage.Frontend: Session hydration (
convertChatSessionToUiMessages.ts)isCompleteoption: When session has no active stream, mark dangling tool calls (no output in DB) asoutput-availablewith empty output — stops stale spinners after page refresh.Frontend: Chat session hook (
useChatSession.ts)hasActiveStreammemo beforehydratedMessagessoisCompleteflag is available.{ isComplete: !hasActiveStream }toconvertChatSessionMessagesToUiMessages.Frontend: Copilot page hook (
useCopilotPage.ts)hasResumedRefon stream end to allow re-resume if SSE drops but backend task is still running.resolveInProgressToolseffect: Replaced by backend-side safety-net flush + hydration-timeisCompletemarking.Test plan
pytest backend/copilot/ -x -q)pytest backend/copilot/sdk/sdk_compat_test.py -v)