fix(platform): try-compact-retry for prompt-too-long errors in CoPilot SDK#12413
Conversation
…reserve compaction The TranscriptBuilder accumulates all raw SDK stream messages including pre-compaction content. When the CLI compacts mid-stream, the uploaded transcript still contains the full uncompacted messages, causing "Prompt is too long" errors on the next --resume turn. Fix: - Read the CLI's own session file (~/.claude/projects/<cwd>/*.jsonl) which reflects mid-stream compaction, instead of TranscriptBuilder - Extract _cli_project_dir() helper, refactor cleanup_cli_project_dir - On "Prompt is too long" error with --resume, delete the oversized transcript so the next turn falls back to compression-based context
…sion Use skip_transcript_upload instead of reusing use_resume to gate transcript uploads. use_resume starts False and only becomes True after a successful download, so gating on it prevented first-turn transcripts from ever being uploaded (bootstrap paradox).
Move flag assignment before the await to prevent re-upload if cancellation interrupts delete_transcript.
…l protection - Replace `import glob` with `pathlib.Path.glob()` in `read_cli_session_file` - Add symlink path traversal validation on glob results using `is_relative_to` - Add unit tests for `read_cli_session_file` and `_cli_project_dir`
…n_file Wrap p.resolve() and stat() calls in try/except to prevent unhandled OSError/RuntimeError from propagating and silently dropping transcript uploads when the caller falls back to TranscriptBuilder.
- Move `import shutil` to module top-level (HIGH) - Extract `_safe_glob_jsonl` helper to shorten `read_cli_session_file` (MEDIUM) - Inline single-use `err_str` variable (LOW) - Use lazy `%s` formatting in all logger calls (LOW) - Fix implicit string concatenation in log format string (NIT) - Compute accurate entry_count when using CLI transcript path (WARNING)
TranscriptBuilder accumulated ALL messages (pre + post compaction), making it larger than the CLI's active context after mid-stream compaction. This caused "Prompt is too long" errors on resume. Now when the PreCompact hook fires, CompactionTracker captures the transcript_path. When compaction ends (next message arrives), we read the CLI session file, find the isCompactSummary entry, and replace TranscriptBuilder's entries with the compacted entries. This ensures TranscriptBuilder always mirrors the CLI's active context. Removes the racy read_cli_session_file fallback from the finally block - TranscriptBuilder is now the single source of truth for uploads.
…ath validation, tests - Make compaction_just_ended a one-shot flag that resets after being read, fixing multiple mid-stream compactions not triggering replace_entries - Add path validation to read_compacted_entries() to reject paths outside the CLI projects directory - Reset _transcript_path in reset_for_query() for consistency - Add tests for read_compacted_entries and TranscriptBuilder.replace_entries - Preserve isCompactSummary entries during STRIPPABLE_TYPES filtering
… last-summary, dedup helpers - Replace side-effect @Property with CompactionResult dataclass for TOCTOU safety - Use build-then-swap in replace_entries to prevent data loss on corrupt input - Fix read_compacted_entries to use LAST isCompactSummary (not first) - Guard isCompactSummary in strip_progress_entries and TranscriptBuilder - Extract _projects_base(), _build_path_from_parts(), _build_meta_storage_path() - Extract TranscriptBuilder._parse_entry() static method - Sanitize transcript_path in pre_compact_hook before logging - Update compaction_test.py and transcript_test.py for new behaviors
Add isCompactSummary field to TranscriptEntry model so compaction summaries survive the export→load_previous roundtrip. Without this, exported summaries with type="summary" were stripped on reload since "summary" is in STRIPPABLE_TYPES. Also add integration tests simulating the full compaction flow (load → append → compact → replace → export → reload).
…stream compaction When compaction ends, replace_entries loads the CLI session file which already contains the current message. Skip append_assistant and append_tool_result for that iteration to avoid duplicates that could cause 'prompt is too long' errors on subsequent turns.
Exercises the full service.py compaction flow end-to-end: TranscriptBuilder load → CompactionTracker state machine → read_compacted_entries → replace_entries → export → roundtrip.
When the transcript exceeds the context window, the SDK now retries with escalating strategies: 1. Use transcript as-is (normal path) 2. Compact transcript via LLM summarization, retry with --resume 3. Drop transcript entirely, fall back to DB-reconstructed context - Add _is_prompt_too_long detector with specific patterns - Add compact_transcript + helpers (_transcript_to_messages, _messages_to_transcript, _flatten_*) to transcript.py - Remove dead delete_transcript function and its tests - Guard finally-block transcript upload with transcript_caused_error to prevent re-uploading a problematic transcript - Add 90 tests covering detection, helpers, and compaction
|
This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds prompt-too-long detection, a per-attempt retry flow that can compact transcripts (LLM-backed) and retry, multimodal attachment handling in the streaming path, transcript flattening/compaction utilities, and extensive tests covering detection, conversion, compaction, and retry scenarios. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Service as Service (stream_chat_completion_sdk)
participant Claude as Claude API
participant Transcript as Transcript Manager
participant Compressor as OpenAI Compressor
Client->>Service: stream_chat_completion_sdk(query, context)
activate Service
rect rgba(100,150,200,0.5)
Note over Service: Retry loop with resume/transcript handling
Service->>Claude: send_request(query, transcript?)
end
Claude-->>Service: ✗ prompt_too_long error
alt Prompt Too Long Detected
Service->>Transcript: _transcript_to_messages(transcript)
Transcript-->>Service: messages[]
Service->>Compressor: _run_compression(messages, model, cfg)
activate Compressor
Compressor->>Compressor: compress via OpenAI
Compressor-->>Service: CompressResult
deactivate Compressor
Service->>Transcript: _messages_to_transcript(compressed_messages)
Transcript-->>Service: new_transcript
Service->>Service: Rebuild query with compacted transcript
Service->>Claude: Retry request (next attempt)
Claude-->>Service: ✓ response | ✗ other error
else Success or Non-Prompt Error
Service->>Service: Continue with response
end
Service->>Transcript: Persist final transcript
Service-->>Client: stream response
deactivate Service
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
autogpt_platform/backend/backend/copilot/sdk/transcript.py (1)
672-691: Consider adding a docstring note about the fallback behavior.The function falls back to truncation-based compression when the LLM call fails, which is good resilience. However, the behavior difference between LLM compression and truncation might be worth documenting for maintainers.
📝 Suggested docstring enhancement
async def _run_compression( messages: list[dict], model: str, cfg: "ChatConfig", log_prefix: str, ) -> "CompressResult": - """Run LLM-based compression with truncation fallback.""" + """Run LLM-based compression with truncation fallback. + + If the LLM call fails (API error, timeout, network), falls back to + truncation-based compression which drops older messages without + summarization. + """ import openai🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/backend/backend/copilot/sdk/transcript.py` around lines 672 - 691, The _run_compression function's docstring should mention its fallback behavior: when openai.AsyncOpenAI calls in _run_compression raise openai.APIError, openai.APITimeoutError, or OSError the function logs a warning and calls compress_context with client=None to perform truncation-based compression; update the docstring above _run_compression to describe this distinction (LLM-based compression vs truncation fallback), list the exceptions that trigger the fallback, and note that the fallback will call compress_context(messages, model, client=None) so maintainers understand the behavior and implications.autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py (1)
341-462: Consider using a pytest fixture for the ChatConfig mock.The mock pattern for
ChatConfigis repeated across multiple tests. A fixture could reduce duplication.♻️ Suggested fixture extraction
`@pytest.fixture` def mock_chat_config(): """Mock ChatConfig for compact_transcript tests.""" with patch( "backend.copilot.config.ChatConfig", return_value=type( "Cfg", (), {"model": "m", "api_key": "k", "base_url": "u"} )(), ) as mock: yield mock class TestCompactTranscript: `@pytest.mark.asyncio` async def test_too_few_messages_returns_none(self, mock_chat_config): transcript = _build_transcript([("user", "Hello")]) result = await compact_transcript(transcript) assert result is None # ... other tests using mock_chat_config fixture🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py` around lines 341 - 462, The tests in TestCompactTranscript repeat the same ChatConfig patch; extract that into a pytest fixture (e.g., mock_chat_config) that patches backend.copilot.config.ChatConfig and yields the mock config object, then add the fixture as a parameter to each async test (test_too_few_messages_returns_none, test_returns_content_when_not_compacted, test_returns_compacted_transcript, test_returns_none_on_compression_failure) and remove the inline patch blocks so compact_transcript tests reuse the fixture.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 991-1002: The fallback branch currently clears the
TranscriptBuilder but still allows the finally block to upload a transcript with
message_count=len(session.messages), which misrepresents the stored transcript
as a full prefix; update the code in the branch that sets use_resume = False /
resume_file = None / transcript_builder = TranscriptBuilder() (and the analogous
block around lines 1574-1603) to either (A) mark a flag (e.g.,
skip_transcript_upload) so the finally upload is skipped when no-resume fallback
is used, or (B) regenerate a true transcript that covers the actual prefix
before letting the finally block write it (ensure message_count matches
transcript_builder.length), and ensure _build_query_message() will see the
correct resume state rather than a falsely complete transcript.
- Around line 944-949: The current empty-prompt check (using
current_message.strip()) runs after the code path that appends the incoming
message to the session and persists it in the finally block, so whitespace-only
messages are being stored; move the trim/empty validation so it executes before
any mutation or append to the session (i.e., validate message/current_message
and yield StreamError("Message cannot be empty.", code="empty_prompt") and
return) OR, if you prefer minimal change, ensure you remove the just-appended
message from the session (undo the append/persist) before returning; locate the
append/persist calls (e.g., session.add_message, session.messages.append, or any
save_session/save_conversation calls) in this handler and either validate
earlier or explicitly remove that entry before exiting.
- Around line 962-999: The retry logic can re-enter compaction on the final
attempt because it keys off transcript_content; update the loop so the last
attempt (_query_attempt == _MAX_QUERY_ATTEMPTS - 1) unconditionally disables
resume and falls back to DB context. Concretely, inside the for loop around
_query_attempt, when handling the "Prompt too long" branch, check if this is the
final retry and if so set use_resume = False, resume_file = None,
transcript_content = "" and reset transcript_builder (e.g., new
TranscriptBuilder()) regardless of transcript_content/compaction result;
otherwise keep the existing compact_transcript flow that may attempt compaction
and set use_resume based on its outcome. Ensure the check uses the existing
symbols _query_attempt, _MAX_QUERY_ATTEMPTS, use_resume, resume_file,
transcript_content, TranscriptBuilder and compact_transcript so the last attempt
always disables resume and triggers the DB fallback.
---
Nitpick comments:
In `@autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py`:
- Around line 341-462: The tests in TestCompactTranscript repeat the same
ChatConfig patch; extract that into a pytest fixture (e.g., mock_chat_config)
that patches backend.copilot.config.ChatConfig and yields the mock config
object, then add the fixture as a parameter to each async test
(test_too_few_messages_returns_none, test_returns_content_when_not_compacted,
test_returns_compacted_transcript, test_returns_none_on_compression_failure) and
remove the inline patch blocks so compact_transcript tests reuse the fixture.
In `@autogpt_platform/backend/backend/copilot/sdk/transcript.py`:
- Around line 672-691: The _run_compression function's docstring should mention
its fallback behavior: when openai.AsyncOpenAI calls in _run_compression raise
openai.APIError, openai.APITimeoutError, or OSError the function logs a warning
and calls compress_context with client=None to perform truncation-based
compression; update the docstring above _run_compression to describe this
distinction (LLM-based compression vs truncation fallback), list the exceptions
that trigger the fallback, and note that the fallback will call
compress_context(messages, model, client=None) so maintainers understand the
behavior and implications.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a785c0d0-ea4e-4238-9542-641e8256eabc
📒 Files selected for processing (9)
autogpt_platform/backend/backend/copilot/sdk/compaction.pyautogpt_platform/backend/backend/copilot/sdk/compaction_test.pyautogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.pyautogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/sdk/transcript_builder.pyautogpt_platform/backend/backend/copilot/sdk/transcript_test.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Seer Code Review
🧰 Additional context used
📓 Path-based instructions (6)
autogpt_platform/backend/**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
autogpt_platform/backend/**/*.py: Use Python 3.11 (required; managed by Poetry via pyproject.toml) for backend development
Always run 'poetry run format' (Black + isort) before linting in backend development
Always run 'poetry run lint' (ruff) after formatting in backend development
Files:
autogpt_platform/backend/backend/copilot/sdk/compaction.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/transcript_builder.pyautogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/sdk/transcript_test.pyautogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.pyautogpt_platform/backend/backend/copilot/sdk/compaction_test.py
autogpt_platform/backend/**/*.{py,txt}
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
Use
poetry runprefix for all Python commands, including testing, linting, formatting, and migrations
Files:
autogpt_platform/backend/backend/copilot/sdk/compaction.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/transcript_builder.pyautogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/sdk/transcript_test.pyautogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.pyautogpt_platform/backend/backend/copilot/sdk/compaction_test.py
autogpt_platform/backend/backend/**/*.py
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
Use Prisma ORM for database operations in PostgreSQL with pgvector for embeddings
Files:
autogpt_platform/backend/backend/copilot/sdk/compaction.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/transcript_builder.pyautogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/sdk/transcript_test.pyautogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.pyautogpt_platform/backend/backend/copilot/sdk/compaction_test.py
autogpt_platform/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Format Python code with
poetry run format
Files:
autogpt_platform/backend/backend/copilot/sdk/compaction.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/transcript_builder.pyautogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/sdk/transcript_test.pyautogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.pyautogpt_platform/backend/backend/copilot/sdk/compaction_test.py
autogpt_platform/backend/**/*_test.py
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
autogpt_platform/backend/**/*_test.py: Always review snapshot changes withgit diffbefore committing when updating snapshots withpoetry run pytest --snapshot-update
Use pytest with snapshot testing for API responses in test files
Colocate test files with source files using the*_test.pynaming convention
Files:
autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.pyautogpt_platform/backend/backend/copilot/sdk/transcript_test.pyautogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.pyautogpt_platform/backend/backend/copilot/sdk/compaction_test.py
autogpt_platform/backend/**/*test*.py
📄 CodeRabbit inference engine (AGENTS.md)
Run
poetry run testfor backend testing (runs pytest with docker based postgres + prisma)
Files:
autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.pyautogpt_platform/backend/backend/copilot/sdk/transcript_test.pyautogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.pyautogpt_platform/backend/backend/copilot/sdk/compaction_test.py
🧠 Learnings (4)
📚 Learning: 2026-02-26T17:02:22.448Z
Learnt from: Pwuts
Repo: Significant-Gravitas/AutoGPT PR: 12211
File: .pre-commit-config.yaml:160-179
Timestamp: 2026-02-26T17:02:22.448Z
Learning: Keep the pre-commit hook pattern broad for autogpt_platform/backend to ensure OpenAPI schema changes are captured. Do not narrow to backend/api/ alone, since the generated schema depends on Pydantic models across multiple directories (backend/data/, backend/blocks/, backend/copilot/, backend/integrations/, backend/util/). Narrowing could miss schema changes and cause frontend type desynchronization.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/compaction.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/transcript_builder.pyautogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/sdk/transcript_test.pyautogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.pyautogpt_platform/backend/backend/copilot/sdk/compaction_test.py
📚 Learning: 2026-03-04T08:04:35.881Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12273
File: autogpt_platform/backend/backend/copilot/tools/workspace_files.py:216-220
Timestamp: 2026-03-04T08:04:35.881Z
Learning: In the AutoGPT Copilot backend, ensure that SVG images are not treated as vision image types by excluding 'image/svg+xml' from INLINEABLE_MIME_TYPES and MULTIMODAL_TYPES in tool_adapter.py; the Claude API supports PNG, JPEG, GIF, and WebP for vision. SVGs (XML text) should be handled via the text path instead, not the vision path.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/compaction.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/transcript_builder.pyautogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/sdk/transcript_test.pyautogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.pyautogpt_platform/backend/backend/copilot/sdk/compaction_test.py
📚 Learning: 2026-03-05T15:42:08.207Z
Learnt from: ntindle
Repo: Significant-Gravitas/AutoGPT PR: 12297
File: .claude/skills/backend-check/SKILL.md:14-16
Timestamp: 2026-03-05T15:42:08.207Z
Learning: In Python files under autogpt_platform/backend (recursively), rely on poetry run format to perform formatting (Black + isort) and linting (ruff). Do not run poetry run lint as a separate step after poetry run format, since format already includes linting checks.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/compaction.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/transcript_builder.pyautogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/transcript.pyautogpt_platform/backend/backend/copilot/sdk/transcript_test.pyautogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.pyautogpt_platform/backend/backend/copilot/sdk/compaction_test.py
📚 Learning: 2026-02-04T16:49:42.490Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-02-04T16:49:42.490Z
Learning: Applies to autogpt_platform/backend/**/test/**/*.py : Use snapshot testing with '--snapshot-update' flag in backend tests when output changes; always review with 'git diff'
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/transcript_test.py
🔇 Additional comments (17)
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py (1)
305-319: LGTM! Good security practice with log sanitization.The newline/carriage-return stripping prevents log injection attacks from untrusted
transcript_pathinput. The updated callback signature properly propagates the transcript path through the compaction flow.autogpt_platform/backend/backend/copilot/sdk/compaction.py (2)
30-41: LGTM! Well-designed dataclass to prevent TOCTOU races.The
CompactionResultproperly bundlesjust_ended,events, andtranscript_pathtogether, eliminating the need for separate flag checks that could introduce race conditions between the emit call and state inspection.
195-208: LGTM! Good defensive logging on transcript_path overwrite.The warning when
_transcript_pathis overwritten helps debug unexpected multiple compaction events within a single cycle.autogpt_platform/backend/backend/copilot/sdk/transcript_builder.py (2)
57-73: LGTM! Good extraction of entry parsing logic.The
_parse_entrymethod properly filters strippable types while preserving entries withisCompactSummary=True, enabling compact summaries to survive load/replace operations. Reusing this method in bothload_previousandreplace_entriesfollows DRY principles.
175-210: LGTM! Defensive validation prevents data loss.The non-empty check before swapping entries is a good safeguard against corrupt input wiping conversation history. The logging provides visibility into compaction outcomes.
autogpt_platform/backend/backend/copilot/sdk/compaction_test.py (2)
307-378: LGTM! Thorough testing of multi-compaction behavior.The tests clearly document the expected behavior: within a single query,
_done=Trueblocks subsequent compactions untilreset_for_queryis called. The inline comments explaining the state machine behavior are helpful for maintainability.
379-407: LGTM! Good coverage of transcript_path lifecycle.Tests verify the complete lifecycle: storage via
on_compact, return viaemit_end_if_ready, and clearing after emission. This ensures the path doesn't leak into subsequent non-compaction calls.autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.py (1)
201-369: LGTM! Excellent end-to-end test coverage.This test comprehensively simulates the service.py compaction flow across 13 steps, including transcript loading, message appending, PreCompact hook handling, CLI session reading, entry replacement, and roundtrip verification. The detailed assertions on parent UUID chains and isCompactSummary preservation provide confidence in the integration.
autogpt_platform/backend/backend/copilot/sdk/transcript.py (4)
156-176: LGTM! Proper path traversal protection.The
_cli_project_dirfunction validates that the resolved path stays within the projects base directory, preventing symlink-based path traversal attacks. ReturningNonefor escaped paths is the right defensive approach.
179-199: LGTM! Safe glob implementation filtering symlink escapes.The
_safe_glob_jsonlfunction properly validates each candidate file's resolved path stays within the project directory, catching both symlink escapes and filesystem errors gracefully.
202-263: LGTM! Correctly finds the last compaction summary.The implementation properly scans for the LAST
isCompactSummaryentry (not breaking on first match), which handles multiple compactions correctly. The path boundary check prevents reading files outside the projects directory.
95-99: LGTM! Proper preservation of compact summaries.The
isCompactSummarycheck ensures that summary entries marked as compaction results are preserved even thoughsummaryis inSTRIPPABLE_TYPES. This is essential for maintaining compacted conversation state across turns.Also applies to: 123-125
autogpt_platform/backend/backend/copilot/sdk/transcript_test.py (3)
319-340: LGTM! Good security test for symlink escape prevention.This test verifies that symlinks pointing outside the project directory are correctly filtered out, preventing path traversal attacks via symbolic links.
604-691: LGTM! Comprehensive integration test simulating real service flow.The test accurately simulates the service.py compaction flow: loading previous transcript, appending messages, reading compacted entries from CLI session, replacing builder state, and verifying the parent chain integrity. This provides confidence in the end-to-end integration.
753-781: LGTM! Excellent test for compact summary preservation during stripping.This test verifies that
strip_progress_entriescorrectly preserves entries withisCompactSummary=Truewhile still stripping regular summary entries. This is a critical behavior for maintaining compacted state.autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py (2)
27-76: LGTM! Thorough error detection testing.The parametrized tests cover a good range of prompt-too-long error patterns including case variations, while explicitly testing that similar but non-matching patterns (like partial matches) don't trigger false positives.
291-303: LGTM! Essential roundtrip validation.The roundtrip test verifies that
messages → transcript → messagespreserves both role and content, which is critical for the compaction flow where transcripts are converted back and forth.
Resolve merge conflicts in service.py, transcript.py, and transcript_test.py. Incorporates dev's compaction tracking (emit_end_if_ready, entries_replaced, read_compacted_entries) into the retry loop structure.
🔍 PR Overlap DetectionThis check compares your PR against all other open PRs targeting the same branch to detect potential merge conflicts early. 🔴 Merge Conflicts DetectedThe following PRs have been tested and will have merge conflicts if merged after this PR. Consider coordinating with the authors.
🟡 Medium Risk — Some Line OverlapThese PRs have some overlapping changes:
🟢 Low Risk — File Overlap OnlyThese PRs touch the same files but different sections (click to expand)
Summary: 4 conflict(s), 2 medium risk, 11 low risk (out of 17 PRs with file overlap) Auto-generated on push. Ignores: |
…backticks Revert f-string to %-format logging conversions in security_hooks.py, tool_adapter.py, and response_adapter.py — purely cosmetic and outside this PR's scope. Replace rST double backticks with single backticks in service.py docstrings for codebase consistency.
Issues attributed to commits in this pull requestThis pull request was merged and Sentry observed the following issues:
|
…aware compression, and truncated tool call recovery (#12625) ## Why CoPilot has several context management issues that degrade long sessions: 1. "Prompt is too long" errors crash the session instead of triggering retry/compaction 2. Stale thinking blocks bloat transcripts, causing unnecessary compaction every turn 3. Compression target is hardcoded regardless of model context window size 4. Truncated tool calls (empty `{}` args from max_tokens) kill the session instead of guiding the model to self-correct ## What **Fix 1: Prompt-too-long retry bypass (SENTRY-1207)** The SDK surfaces "prompt too long" via `AssistantMessage.error` and `ResultMessage.result` — neither triggered the retry/compaction loop (only Python exceptions did). Now both paths are intercepted and re-raised. **Fix 2: Strip stale thinking blocks before upload** Thinking/redacted_thinking blocks in non-last assistant entries are 10-50K tokens each but only needed for API signature verification in the *last* message. Stripping before upload reduces transcript size and prevents per-turn compaction. **Fix 3: Model-aware compression target** `compress_context()` now computes `target_tokens` from the model's context window (e.g. 140K for Opus 200K) instead of a hardcoded 120K default. Larger models retain more history; smaller models compress more aggressively. **Fix 4: Self-correcting truncated tool calls** When the model's response exceeds max_tokens, tool call inputs get silently truncated to `{}`. Previously this tripped a circuit breaker after 3 attempts. Now the MCP wrapper detects empty args and returns guidance: "write in chunks with `cat >>`, pass via `@@agptfile:filename`". The model can self-correct instead of the session dying. ## How - **service.py**: `_is_prompt_too_long` checks in both `AssistantMessage.error` and `ResultMessage` error handlers. Circuit breaker limit raised from 3→5. - **transcript.py**: `strip_stale_thinking_blocks()` reverse-scans for last assistant `message.id`, strips thinking blocks from all others. Called in `upload_transcript()`. - **prompt.py**: `get_compression_target(model)` computes `context_window - 60K overhead`. `compress_context()` uses it when `target_tokens` is None. - **tool_adapter.py**: `_truncating` wrapper intercepts empty args on tools with required params, returns actionable guidance instead of failing. ## Related - Fixes SENTRY-1207 - Sessions: `d2f7cba3` (repeated compaction), `08b807d4` (prompt too long), `130d527c` (truncated tool calls) - Extends #12413, consolidates #12626 ## Test plan - [x] 6 unit tests for `strip_stale_thinking_blocks` - [x] 1 integration test for ResultMessage prompt-too-long → compaction retry - [x] Pyright clean (0 errors), all pre-commit hooks pass - [ ] E2E: Load transcripts from affected sessions and verify behavior
…t SDK (#12413) ## Summary When the Claude SDK returns a prompt-too-long error (e.g. transcript + query exceeds the model's context window), the streaming loop now retries with escalating fallbacks instead of failing immediately: 1. **Attempt 1**: Use the transcript as-is (normal path) 2. **Attempt 2**: Compact the transcript via LLM summarization (`compact_transcript`) and retry 3. **Attempt 3**: Drop the transcript entirely and fall back to DB-reconstructed context (`_build_query_message`) If all 3 attempts fail, a `StreamError(code="prompt_too_long")` is yielded to the frontend. ### Key changes **`service.py`** - Add `_is_prompt_too_long(err)` — pattern-matches SDK exceptions for prompt-length errors (`prompt is too long`, `prompt_too_long`, `context_length_exceeded`, `request too large`) - Wrap `async with ClaudeSDKClient` in a 3-attempt retry `for` loop with compaction/fallback logic - Move `current_message`, `_build_query_message`, and `_prepare_file_attachments` before the retry loop (computed once, reused) - Skip transcript upload in `finally` when `transcript_caused_error` (avoids persisting a broken/empty transcript) - Reset `stream_completed` between retry iterations - Document outer-scope variable contract in `_run_stream_attempt` closure (which variables are reassigned between retries vs read-only) **`transcript.py`** - Add `compact_transcript(content, log_prefix, model)` — converts JSONL → messages → `compress_context` (LLM summarization with truncation fallback) → JSONL - Add helpers: `_flatten_assistant_content`, `_flatten_tool_result_content`, `_transcript_to_messages`, `_messages_to_transcript`, `_run_compression` - Returns `None` when compaction fails or transcript is already within budget (signals caller to fall through to DB fallback) - Truncation fallback wrapped in 30s timeout to prevent unbounded CPU time on large transcripts - Accepts `model` parameter to avoid creating a new `ChatConfig()` on every call **`util/prompt.py`** - Fix `_truncate_middle_tokens` edge case: returns empty string when `max_tok < 1`, properly handles `max_tok < 3` **`config.py`** - E2B sandbox timeout raised from 5 min to 15 min to accommodate compaction retries **`prompt_too_long_test.py`** (new, 45 tests) - `_is_prompt_too_long` positive/negative patterns, case sensitivity, BaseException handling - Flatten helpers for assistant/tool_result content blocks - `_transcript_to_messages` / `_messages_to_transcript` roundtrip, strippable types, empty content - `compact_transcript` async tests: too few messages, not compacted, successful compaction, compression failure **`retry_scenarios_test.py`** (new, 27 tests) - Full retry state machine simulation covering all 8 scenarios: 1. Normal flow (no retry) 2. Compact succeeds → retry succeeds 3. Compact fails → DB fallback succeeds 4. No transcript → DB fallback succeeds 5. Double fail → DB fallback on attempt 3 6. All 3 attempts exhausted 7. Non-prompt-too-long error (no retry) 8. Compaction returns identical content → DB fallback - Edge cases: nested exceptions, case insensitivity, unicode content, large transcripts, resume-after-compaction flow **Shared test fixtures** (`conftest.py`) - Extracted `build_test_transcript` helper used across 3 test files to eliminate duplication ## Test plan - [x] `_is_prompt_too_long` correctly identifies prompt-length errors (8 positive, 5 negative patterns) - [x] `compact_transcript` compacts oversized transcripts via LLM summarization - [x] `compact_transcript` returns `None` on failure or when already within budget - [x] Retry loop state machine: all 8 scenarios verified with state assertions - [x] `TranscriptBuilder` works correctly after loading compacted transcripts - [x] `_messages_to_transcript` roundtrip preserves content including unicode - [x] `transcript_caused_error` prevents stale transcript upload - [x] Truncation timeout prevents unbounded CPU time - [x] All 139 unit tests pass locally - [x] CI green (tests 3.11/3.12/3.13, types, CodeQL, linting)
…t SDK (Significant-Gravitas#12413) ## Summary When the Claude SDK returns a prompt-too-long error (e.g. transcript + query exceeds the model's context window), the streaming loop now retries with escalating fallbacks instead of failing immediately: 1. **Attempt 1**: Use the transcript as-is (normal path) 2. **Attempt 2**: Compact the transcript via LLM summarization (`compact_transcript`) and retry 3. **Attempt 3**: Drop the transcript entirely and fall back to DB-reconstructed context (`_build_query_message`) If all 3 attempts fail, a `StreamError(code="prompt_too_long")` is yielded to the frontend. ### Key changes **`service.py`** - Add `_is_prompt_too_long(err)` — pattern-matches SDK exceptions for prompt-length errors (`prompt is too long`, `prompt_too_long`, `context_length_exceeded`, `request too large`) - Wrap `async with ClaudeSDKClient` in a 3-attempt retry `for` loop with compaction/fallback logic - Move `current_message`, `_build_query_message`, and `_prepare_file_attachments` before the retry loop (computed once, reused) - Skip transcript upload in `finally` when `transcript_caused_error` (avoids persisting a broken/empty transcript) - Reset `stream_completed` between retry iterations - Document outer-scope variable contract in `_run_stream_attempt` closure (which variables are reassigned between retries vs read-only) **`transcript.py`** - Add `compact_transcript(content, log_prefix, model)` — converts JSONL → messages → `compress_context` (LLM summarization with truncation fallback) → JSONL - Add helpers: `_flatten_assistant_content`, `_flatten_tool_result_content`, `_transcript_to_messages`, `_messages_to_transcript`, `_run_compression` - Returns `None` when compaction fails or transcript is already within budget (signals caller to fall through to DB fallback) - Truncation fallback wrapped in 30s timeout to prevent unbounded CPU time on large transcripts - Accepts `model` parameter to avoid creating a new `ChatConfig()` on every call **`util/prompt.py`** - Fix `_truncate_middle_tokens` edge case: returns empty string when `max_tok < 1`, properly handles `max_tok < 3` **`config.py`** - E2B sandbox timeout raised from 5 min to 15 min to accommodate compaction retries **`prompt_too_long_test.py`** (new, 45 tests) - `_is_prompt_too_long` positive/negative patterns, case sensitivity, BaseException handling - Flatten helpers for assistant/tool_result content blocks - `_transcript_to_messages` / `_messages_to_transcript` roundtrip, strippable types, empty content - `compact_transcript` async tests: too few messages, not compacted, successful compaction, compression failure **`retry_scenarios_test.py`** (new, 27 tests) - Full retry state machine simulation covering all 8 scenarios: 1. Normal flow (no retry) 2. Compact succeeds → retry succeeds 3. Compact fails → DB fallback succeeds 4. No transcript → DB fallback succeeds 5. Double fail → DB fallback on attempt 3 6. All 3 attempts exhausted 7. Non-prompt-too-long error (no retry) 8. Compaction returns identical content → DB fallback - Edge cases: nested exceptions, case insensitivity, unicode content, large transcripts, resume-after-compaction flow **Shared test fixtures** (`conftest.py`) - Extracted `build_test_transcript` helper used across 3 test files to eliminate duplication ## Test plan - [x] `_is_prompt_too_long` correctly identifies prompt-length errors (8 positive, 5 negative patterns) - [x] `compact_transcript` compacts oversized transcripts via LLM summarization - [x] `compact_transcript` returns `None` on failure or when already within budget - [x] Retry loop state machine: all 8 scenarios verified with state assertions - [x] `TranscriptBuilder` works correctly after loading compacted transcripts - [x] `_messages_to_transcript` roundtrip preserves content including unicode - [x] `transcript_caused_error` prevents stale transcript upload - [x] Truncation timeout prevents unbounded CPU time - [x] All 139 unit tests pass locally - [x] CI green (tests 3.11/3.12/3.13, types, CodeQL, linting)
Summary
When the Claude SDK returns a prompt-too-long error (e.g. transcript + query exceeds the model's context window), the streaming loop now retries with escalating fallbacks instead of failing immediately:
compact_transcript) and retry_build_query_message)If all 3 attempts fail, a
StreamError(code="prompt_too_long")is yielded to the frontend.Key changes
service.py_is_prompt_too_long(err)— pattern-matches SDK exceptions for prompt-length errors (prompt is too long,prompt_too_long,context_length_exceeded,request too large)async with ClaudeSDKClientin a 3-attempt retryforloop with compaction/fallback logiccurrent_message,_build_query_message, and_prepare_file_attachmentsbefore the retry loop (computed once, reused)finallywhentranscript_caused_error(avoids persisting a broken/empty transcript)stream_completedbetween retry iterations_run_stream_attemptclosure (which variables are reassigned between retries vs read-only)transcript.pycompact_transcript(content, log_prefix, model)— converts JSONL → messages →compress_context(LLM summarization with truncation fallback) → JSONL_flatten_assistant_content,_flatten_tool_result_content,_transcript_to_messages,_messages_to_transcript,_run_compressionNonewhen compaction fails or transcript is already within budget (signals caller to fall through to DB fallback)modelparameter to avoid creating a newChatConfig()on every callutil/prompt.py_truncate_middle_tokensedge case: returns empty string whenmax_tok < 1, properly handlesmax_tok < 3config.pyprompt_too_long_test.py(new, 45 tests)_is_prompt_too_longpositive/negative patterns, case sensitivity, BaseException handling_transcript_to_messages/_messages_to_transcriptroundtrip, strippable types, empty contentcompact_transcriptasync tests: too few messages, not compacted, successful compaction, compression failureretry_scenarios_test.py(new, 27 tests)Shared test fixtures (
conftest.py)build_test_transcripthelper used across 3 test files to eliminate duplicationTest plan
_is_prompt_too_longcorrectly identifies prompt-length errors (8 positive, 5 negative patterns)compact_transcriptcompacts oversized transcripts via LLM summarizationcompact_transcriptreturnsNoneon failure or when already within budgetTranscriptBuilderworks correctly after loading compacted transcripts_messages_to_transcriptroundtrip preserves content including unicodetranscript_caused_errorprevents stale transcript upload