Skip to content

fix(platform): try-compact-retry for prompt-too-long errors in CoPilot SDK#12413

Merged
majdyz merged 80 commits into
devfrom
fix/copilot-transcript-compaction
Mar 18, 2026
Merged

fix(platform): try-compact-retry for prompt-too-long errors in CoPilot SDK#12413
majdyz merged 80 commits into
devfrom
fix/copilot-transcript-compaction

Conversation

@majdyz
Copy link
Copy Markdown
Contributor

@majdyz majdyz commented Mar 13, 2026

Summary

When the Claude SDK returns a prompt-too-long error (e.g. transcript + query exceeds the model's context window), the streaming loop now retries with escalating fallbacks instead of failing immediately:

  1. Attempt 1: Use the transcript as-is (normal path)
  2. Attempt 2: Compact the transcript via LLM summarization (compact_transcript) and retry
  3. Attempt 3: Drop the transcript entirely and fall back to DB-reconstructed context (_build_query_message)

If all 3 attempts fail, a StreamError(code="prompt_too_long") is yielded to the frontend.

Key changes

service.py

  • Add _is_prompt_too_long(err) — pattern-matches SDK exceptions for prompt-length errors (prompt is too long, prompt_too_long, context_length_exceeded, request too large)
  • Wrap async with ClaudeSDKClient in a 3-attempt retry for loop with compaction/fallback logic
  • Move current_message, _build_query_message, and _prepare_file_attachments before the retry loop (computed once, reused)
  • Skip transcript upload in finally when transcript_caused_error (avoids persisting a broken/empty transcript)
  • Reset stream_completed between retry iterations
  • Document outer-scope variable contract in _run_stream_attempt closure (which variables are reassigned between retries vs read-only)

transcript.py

  • Add compact_transcript(content, log_prefix, model) — converts JSONL → messages → compress_context (LLM summarization with truncation fallback) → JSONL
  • Add helpers: _flatten_assistant_content, _flatten_tool_result_content, _transcript_to_messages, _messages_to_transcript, _run_compression
  • Returns None when compaction fails or transcript is already within budget (signals caller to fall through to DB fallback)
  • Truncation fallback wrapped in 30s timeout to prevent unbounded CPU time on large transcripts
  • Accepts model parameter to avoid creating a new ChatConfig() on every call

util/prompt.py

  • Fix _truncate_middle_tokens edge case: returns empty string when max_tok < 1, properly handles max_tok < 3

config.py

  • E2B sandbox timeout raised from 5 min to 15 min to accommodate compaction retries

prompt_too_long_test.py (new, 45 tests)

  • _is_prompt_too_long positive/negative patterns, case sensitivity, BaseException handling
  • Flatten helpers for assistant/tool_result content blocks
  • _transcript_to_messages / _messages_to_transcript roundtrip, strippable types, empty content
  • compact_transcript async tests: too few messages, not compacted, successful compaction, compression failure

retry_scenarios_test.py (new, 27 tests)

  • Full retry state machine simulation covering all 8 scenarios:
    1. Normal flow (no retry)
    2. Compact succeeds → retry succeeds
    3. Compact fails → DB fallback succeeds
    4. No transcript → DB fallback succeeds
    5. Double fail → DB fallback on attempt 3
    6. All 3 attempts exhausted
    7. Non-prompt-too-long error (no retry)
    8. Compaction returns identical content → DB fallback
  • Edge cases: nested exceptions, case insensitivity, unicode content, large transcripts, resume-after-compaction flow

Shared test fixtures (conftest.py)

  • Extracted build_test_transcript helper used across 3 test files to eliminate duplication

Test plan

  • _is_prompt_too_long correctly identifies prompt-length errors (8 positive, 5 negative patterns)
  • compact_transcript compacts oversized transcripts via LLM summarization
  • compact_transcript returns None on failure or when already within budget
  • Retry loop state machine: all 8 scenarios verified with state assertions
  • TranscriptBuilder works correctly after loading compacted transcripts
  • _messages_to_transcript roundtrip preserves content including unicode
  • transcript_caused_error prevents stale transcript upload
  • Truncation timeout prevents unbounded CPU time
  • All 139 unit tests pass locally
  • CI green (tests 3.11/3.12/3.13, types, CodeQL, linting)

majdyz added 15 commits March 13, 2026 16:52
…reserve compaction

The TranscriptBuilder accumulates all raw SDK stream messages including
pre-compaction content. When the CLI compacts mid-stream, the uploaded
transcript still contains the full uncompacted messages, causing
"Prompt is too long" errors on the next --resume turn.

Fix:
- Read the CLI's own session file (~/.claude/projects/<cwd>/*.jsonl)
  which reflects mid-stream compaction, instead of TranscriptBuilder
- Extract _cli_project_dir() helper, refactor cleanup_cli_project_dir
- On "Prompt is too long" error with --resume, delete the oversized
  transcript so the next turn falls back to compression-based context
…sion

Use skip_transcript_upload instead of reusing use_resume to gate
transcript uploads. use_resume starts False and only becomes True
after a successful download, so gating on it prevented first-turn
transcripts from ever being uploaded (bootstrap paradox).
Move flag assignment before the await to prevent re-upload if
cancellation interrupts delete_transcript.
…l protection

- Replace `import glob` with `pathlib.Path.glob()` in `read_cli_session_file`
- Add symlink path traversal validation on glob results using `is_relative_to`
- Add unit tests for `read_cli_session_file` and `_cli_project_dir`
…n_file

Wrap p.resolve() and stat() calls in try/except to prevent unhandled
OSError/RuntimeError from propagating and silently dropping transcript
uploads when the caller falls back to TranscriptBuilder.
- Move `import shutil` to module top-level (HIGH)
- Extract `_safe_glob_jsonl` helper to shorten `read_cli_session_file` (MEDIUM)
- Inline single-use `err_str` variable (LOW)
- Use lazy `%s` formatting in all logger calls (LOW)
- Fix implicit string concatenation in log format string (NIT)
- Compute accurate entry_count when using CLI transcript path (WARNING)
TranscriptBuilder accumulated ALL messages (pre + post compaction),
making it larger than the CLI's active context after mid-stream
compaction. This caused "Prompt is too long" errors on resume.

Now when the PreCompact hook fires, CompactionTracker captures the
transcript_path. When compaction ends (next message arrives), we read
the CLI session file, find the isCompactSummary entry, and replace
TranscriptBuilder's entries with the compacted entries. This ensures
TranscriptBuilder always mirrors the CLI's active context.

Removes the racy read_cli_session_file fallback from the finally block
- TranscriptBuilder is now the single source of truth for uploads.
…ath validation, tests

- Make compaction_just_ended a one-shot flag that resets after being read,
  fixing multiple mid-stream compactions not triggering replace_entries
- Add path validation to read_compacted_entries() to reject paths outside
  the CLI projects directory
- Reset _transcript_path in reset_for_query() for consistency
- Add tests for read_compacted_entries and TranscriptBuilder.replace_entries
- Preserve isCompactSummary entries during STRIPPABLE_TYPES filtering
… last-summary, dedup helpers

- Replace side-effect @Property with CompactionResult dataclass for TOCTOU safety
- Use build-then-swap in replace_entries to prevent data loss on corrupt input
- Fix read_compacted_entries to use LAST isCompactSummary (not first)
- Guard isCompactSummary in strip_progress_entries and TranscriptBuilder
- Extract _projects_base(), _build_path_from_parts(), _build_meta_storage_path()
- Extract TranscriptBuilder._parse_entry() static method
- Sanitize transcript_path in pre_compact_hook before logging
- Update compaction_test.py and transcript_test.py for new behaviors
Add isCompactSummary field to TranscriptEntry model so compaction
summaries survive the export→load_previous roundtrip. Without this,
exported summaries with type="summary" were stripped on reload since
"summary" is in STRIPPABLE_TYPES.

Also add integration tests simulating the full compaction flow
(load → append → compact → replace → export → reload).
…stream compaction

When compaction ends, replace_entries loads the CLI session file which
already contains the current message. Skip append_assistant and
append_tool_result for that iteration to avoid duplicates that could
cause 'prompt is too long' errors on subsequent turns.
Exercises the full service.py compaction flow end-to-end:
TranscriptBuilder load → CompactionTracker state machine →
read_compacted_entries → replace_entries → export → roundtrip.
When the transcript exceeds the context window, the SDK now retries
with escalating strategies:
1. Use transcript as-is (normal path)
2. Compact transcript via LLM summarization, retry with --resume
3. Drop transcript entirely, fall back to DB-reconstructed context

- Add _is_prompt_too_long detector with specific patterns
- Add compact_transcript + helpers (_transcript_to_messages,
  _messages_to_transcript, _flatten_*) to transcript.py
- Remove dead delete_transcript function and its tests
- Guard finally-block transcript upload with transcript_caused_error
  to prevent re-uploading a problematic transcript
- Add 90 tests covering detection, helpers, and compaction
@majdyz majdyz requested a review from a team as a code owner March 13, 2026 22:56
@majdyz majdyz requested review from 0ubbe and Swiftyos and removed request for a team March 13, 2026 22:56
@github-project-automation github-project-automation Bot moved this to 🆕 Needs initial review in AutoGPT development kanban Mar 13, 2026
@github-actions github-actions Bot added platform/backend AutoGPT Platform - Back end conflicts Automatically applied to PRs with merge conflicts labels Mar 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds prompt-too-long detection, a per-attempt retry flow that can compact transcripts (LLM-backed) and retry, multimodal attachment handling in the streaming path, transcript flattening/compaction utilities, and extensive tests covering detection, conversion, compaction, and retry scenarios.

Changes

Cohort / File(s) Summary
Tests
autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py, autogpt_platform/backend/backend/copilot/sdk/retry_scenarios_test.py
New comprehensive tests for prompt-too-long detection, flattening helpers, transcript ↔ messages round-trips, compact_transcript behavior, and end-to-end retry/compaction scenarios with heavy parametrization and mocking.
Transcript utilities
autogpt_platform/backend/backend/copilot/sdk/transcript.py
Adds compaction subsystem: constants (STOP_REASON_END_TURN, COMPACT_MSG_ID_PREFIX, ENTRY_TYPE_MESSAGE), flattening helpers (_flatten_assistant_content, _flatten_tool_result_content), conversion utilities (_transcript_to_messages, _messages_to_transcript), _run_compression, and compact_transcript with validation, logging, and OpenAI-based compression pathway.
Service integration
autogpt_platform/backend/backend/copilot/sdk/service.py
Introduces _PROMPT_TOO_LONG_PATTERNS and _is_prompt_too_long; implements per-attempt retry loop invoking compaction, rebuilds query/message state on retry, handles multimodal attachments, aggregates tool results into transcripts, enhances streaming/error lifecycle handling, and adds transcript upload safeguards.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Service as Service (stream_chat_completion_sdk)
    participant Claude as Claude API
    participant Transcript as Transcript Manager
    participant Compressor as OpenAI Compressor

    Client->>Service: stream_chat_completion_sdk(query, context)
    activate Service

    rect rgba(100,150,200,0.5)
    Note over Service: Retry loop with resume/transcript handling
    Service->>Claude: send_request(query, transcript?)
    end

    Claude-->>Service: ✗ prompt_too_long error

    alt Prompt Too Long Detected
        Service->>Transcript: _transcript_to_messages(transcript)
        Transcript-->>Service: messages[]
        Service->>Compressor: _run_compression(messages, model, cfg)
        activate Compressor
        Compressor->>Compressor: compress via OpenAI
        Compressor-->>Service: CompressResult
        deactivate Compressor
        Service->>Transcript: _messages_to_transcript(compressed_messages)
        Transcript-->>Service: new_transcript
        Service->>Service: Rebuild query with compacted transcript
        Service->>Claude: Retry request (next attempt)
        Claude-->>Service: ✓ response | ✗ other error
    else Success or Non-Prompt Error
        Service->>Service: Continue with response
    end

    Service->>Transcript: Persist final transcript
    Service-->>Client: stream response
    deactivate Service
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

Review effort 4/5

Suggested reviewers

  • Swiftyos
  • 0ubbe
  • Pwuts

Poem

🐇 I nudge the transcript, nibble tokens small,
Flatten the chatter, tuck tools in a sprawl.
Compress and retry, then onward we hop —
Messages stitched, no token will stop.
Hooray for compacted lanes on top!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically summarizes the main change: implementing retry-with-compaction logic for prompt-too-long errors in the CoPilot SDK.
Description check ✅ Passed The PR description clearly outlines the prompt-too-long retry mechanism with three escalating fallback attempts, detailing key changes across multiple files and new test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/copilot-transcript-compaction
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
autogpt_platform/backend/backend/copilot/sdk/transcript.py (1)

672-691: Consider adding a docstring note about the fallback behavior.

The function falls back to truncation-based compression when the LLM call fails, which is good resilience. However, the behavior difference between LLM compression and truncation might be worth documenting for maintainers.

📝 Suggested docstring enhancement
 async def _run_compression(
     messages: list[dict],
     model: str,
     cfg: "ChatConfig",
     log_prefix: str,
 ) -> "CompressResult":
-    """Run LLM-based compression with truncation fallback."""
+    """Run LLM-based compression with truncation fallback.
+
+    If the LLM call fails (API error, timeout, network), falls back to
+    truncation-based compression which drops older messages without
+    summarization.
+    """
     import openai
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autogpt_platform/backend/backend/copilot/sdk/transcript.py` around lines 672
- 691, The _run_compression function's docstring should mention its fallback
behavior: when openai.AsyncOpenAI calls in _run_compression raise
openai.APIError, openai.APITimeoutError, or OSError the function logs a warning
and calls compress_context with client=None to perform truncation-based
compression; update the docstring above _run_compression to describe this
distinction (LLM-based compression vs truncation fallback), list the exceptions
that trigger the fallback, and note that the fallback will call
compress_context(messages, model, client=None) so maintainers understand the
behavior and implications.
autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py (1)

341-462: Consider using a pytest fixture for the ChatConfig mock.

The mock pattern for ChatConfig is repeated across multiple tests. A fixture could reduce duplication.

♻️ Suggested fixture extraction
`@pytest.fixture`
def mock_chat_config():
    """Mock ChatConfig for compact_transcript tests."""
    with patch(
        "backend.copilot.config.ChatConfig",
        return_value=type(
            "Cfg", (), {"model": "m", "api_key": "k", "base_url": "u"}
        )(),
    ) as mock:
        yield mock


class TestCompactTranscript:
    `@pytest.mark.asyncio`
    async def test_too_few_messages_returns_none(self, mock_chat_config):
        transcript = _build_transcript([("user", "Hello")])
        result = await compact_transcript(transcript)
        assert result is None
    # ... other tests using mock_chat_config fixture
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py` around
lines 341 - 462, The tests in TestCompactTranscript repeat the same ChatConfig
patch; extract that into a pytest fixture (e.g., mock_chat_config) that patches
backend.copilot.config.ChatConfig and yields the mock config object, then add
the fixture as a parameter to each async test
(test_too_few_messages_returns_none, test_returns_content_when_not_compacted,
test_returns_compacted_transcript, test_returns_none_on_compression_failure) and
remove the inline patch blocks so compact_transcript tests reuse the fixture.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 991-1002: The fallback branch currently clears the
TranscriptBuilder but still allows the finally block to upload a transcript with
message_count=len(session.messages), which misrepresents the stored transcript
as a full prefix; update the code in the branch that sets use_resume = False /
resume_file = None / transcript_builder = TranscriptBuilder() (and the analogous
block around lines 1574-1603) to either (A) mark a flag (e.g.,
skip_transcript_upload) so the finally upload is skipped when no-resume fallback
is used, or (B) regenerate a true transcript that covers the actual prefix
before letting the finally block write it (ensure message_count matches
transcript_builder.length), and ensure _build_query_message() will see the
correct resume state rather than a falsely complete transcript.
- Around line 944-949: The current empty-prompt check (using
current_message.strip()) runs after the code path that appends the incoming
message to the session and persists it in the finally block, so whitespace-only
messages are being stored; move the trim/empty validation so it executes before
any mutation or append to the session (i.e., validate message/current_message
and yield StreamError("Message cannot be empty.", code="empty_prompt") and
return) OR, if you prefer minimal change, ensure you remove the just-appended
message from the session (undo the append/persist) before returning; locate the
append/persist calls (e.g., session.add_message, session.messages.append, or any
save_session/save_conversation calls) in this handler and either validate
earlier or explicitly remove that entry before exiting.
- Around line 962-999: The retry logic can re-enter compaction on the final
attempt because it keys off transcript_content; update the loop so the last
attempt (_query_attempt == _MAX_QUERY_ATTEMPTS - 1) unconditionally disables
resume and falls back to DB context. Concretely, inside the for loop around
_query_attempt, when handling the "Prompt too long" branch, check if this is the
final retry and if so set use_resume = False, resume_file = None,
transcript_content = "" and reset transcript_builder (e.g., new
TranscriptBuilder()) regardless of transcript_content/compaction result;
otherwise keep the existing compact_transcript flow that may attempt compaction
and set use_resume based on its outcome. Ensure the check uses the existing
symbols _query_attempt, _MAX_QUERY_ATTEMPTS, use_resume, resume_file,
transcript_content, TranscriptBuilder and compact_transcript so the last attempt
always disables resume and triggers the DB fallback.

---

Nitpick comments:
In `@autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py`:
- Around line 341-462: The tests in TestCompactTranscript repeat the same
ChatConfig patch; extract that into a pytest fixture (e.g., mock_chat_config)
that patches backend.copilot.config.ChatConfig and yields the mock config
object, then add the fixture as a parameter to each async test
(test_too_few_messages_returns_none, test_returns_content_when_not_compacted,
test_returns_compacted_transcript, test_returns_none_on_compression_failure) and
remove the inline patch blocks so compact_transcript tests reuse the fixture.

In `@autogpt_platform/backend/backend/copilot/sdk/transcript.py`:
- Around line 672-691: The _run_compression function's docstring should mention
its fallback behavior: when openai.AsyncOpenAI calls in _run_compression raise
openai.APIError, openai.APITimeoutError, or OSError the function logs a warning
and calls compress_context with client=None to perform truncation-based
compression; update the docstring above _run_compression to describe this
distinction (LLM-based compression vs truncation fallback), list the exceptions
that trigger the fallback, and note that the fallback will call
compress_context(messages, model, client=None) so maintainers understand the
behavior and implications.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a785c0d0-ea4e-4238-9542-641e8256eabc

📥 Commits

Reviewing files that changed from the base of the PR and between cfe22e5 and 41591fd.

📒 Files selected for processing (9)
  • autogpt_platform/backend/backend/copilot/sdk/compaction.py
  • autogpt_platform/backend/backend/copilot/sdk/compaction_test.py
  • autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.py
  • autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_builder.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_test.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Seer Code Review
🧰 Additional context used
📓 Path-based instructions (6)
autogpt_platform/backend/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

autogpt_platform/backend/**/*.py: Use Python 3.11 (required; managed by Poetry via pyproject.toml) for backend development
Always run 'poetry run format' (Black + isort) before linting in backend development
Always run 'poetry run lint' (ruff) after formatting in backend development

Files:

  • autogpt_platform/backend/backend/copilot/sdk/compaction.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_builder.py
  • autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_test.py
  • autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py
  • autogpt_platform/backend/backend/copilot/sdk/compaction_test.py
autogpt_platform/backend/**/*.{py,txt}

📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)

Use poetry run prefix for all Python commands, including testing, linting, formatting, and migrations

Files:

  • autogpt_platform/backend/backend/copilot/sdk/compaction.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_builder.py
  • autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_test.py
  • autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py
  • autogpt_platform/backend/backend/copilot/sdk/compaction_test.py
autogpt_platform/backend/backend/**/*.py

📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)

Use Prisma ORM for database operations in PostgreSQL with pgvector for embeddings

Files:

  • autogpt_platform/backend/backend/copilot/sdk/compaction.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_builder.py
  • autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_test.py
  • autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py
  • autogpt_platform/backend/backend/copilot/sdk/compaction_test.py
autogpt_platform/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Format Python code with poetry run format

Files:

  • autogpt_platform/backend/backend/copilot/sdk/compaction.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_builder.py
  • autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_test.py
  • autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py
  • autogpt_platform/backend/backend/copilot/sdk/compaction_test.py
autogpt_platform/backend/**/*_test.py

📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)

autogpt_platform/backend/**/*_test.py: Always review snapshot changes with git diff before committing when updating snapshots with poetry run pytest --snapshot-update
Use pytest with snapshot testing for API responses in test files
Colocate test files with source files using the *_test.py naming convention

Files:

  • autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_test.py
  • autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py
  • autogpt_platform/backend/backend/copilot/sdk/compaction_test.py
autogpt_platform/backend/**/*test*.py

📄 CodeRabbit inference engine (AGENTS.md)

Run poetry run test for backend testing (runs pytest with docker based postgres + prisma)

Files:

  • autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_test.py
  • autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py
  • autogpt_platform/backend/backend/copilot/sdk/compaction_test.py
🧠 Learnings (4)
📚 Learning: 2026-02-26T17:02:22.448Z
Learnt from: Pwuts
Repo: Significant-Gravitas/AutoGPT PR: 12211
File: .pre-commit-config.yaml:160-179
Timestamp: 2026-02-26T17:02:22.448Z
Learning: Keep the pre-commit hook pattern broad for autogpt_platform/backend to ensure OpenAPI schema changes are captured. Do not narrow to backend/api/ alone, since the generated schema depends on Pydantic models across multiple directories (backend/data/, backend/blocks/, backend/copilot/, backend/integrations/, backend/util/). Narrowing could miss schema changes and cause frontend type desynchronization.

Applied to files:

  • autogpt_platform/backend/backend/copilot/sdk/compaction.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_builder.py
  • autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_test.py
  • autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py
  • autogpt_platform/backend/backend/copilot/sdk/compaction_test.py
📚 Learning: 2026-03-04T08:04:35.881Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12273
File: autogpt_platform/backend/backend/copilot/tools/workspace_files.py:216-220
Timestamp: 2026-03-04T08:04:35.881Z
Learning: In the AutoGPT Copilot backend, ensure that SVG images are not treated as vision image types by excluding 'image/svg+xml' from INLINEABLE_MIME_TYPES and MULTIMODAL_TYPES in tool_adapter.py; the Claude API supports PNG, JPEG, GIF, and WebP for vision. SVGs (XML text) should be handled via the text path instead, not the vision path.

Applied to files:

  • autogpt_platform/backend/backend/copilot/sdk/compaction.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_builder.py
  • autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_test.py
  • autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py
  • autogpt_platform/backend/backend/copilot/sdk/compaction_test.py
📚 Learning: 2026-03-05T15:42:08.207Z
Learnt from: ntindle
Repo: Significant-Gravitas/AutoGPT PR: 12297
File: .claude/skills/backend-check/SKILL.md:14-16
Timestamp: 2026-03-05T15:42:08.207Z
Learning: In Python files under autogpt_platform/backend (recursively), rely on poetry run format to perform formatting (Black + isort) and linting (ruff). Do not run poetry run lint as a separate step after poetry run format, since format already includes linting checks.

Applied to files:

  • autogpt_platform/backend/backend/copilot/sdk/compaction.py
  • autogpt_platform/backend/backend/copilot/sdk/service.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_builder.py
  • autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.py
  • autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript.py
  • autogpt_platform/backend/backend/copilot/sdk/transcript_test.py
  • autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py
  • autogpt_platform/backend/backend/copilot/sdk/compaction_test.py
📚 Learning: 2026-02-04T16:49:42.490Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-02-04T16:49:42.490Z
Learning: Applies to autogpt_platform/backend/**/test/**/*.py : Use snapshot testing with '--snapshot-update' flag in backend tests when output changes; always review with 'git diff'

Applied to files:

  • autogpt_platform/backend/backend/copilot/sdk/transcript_test.py
🔇 Additional comments (17)
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py (1)

305-319: LGTM! Good security practice with log sanitization.

The newline/carriage-return stripping prevents log injection attacks from untrusted transcript_path input. The updated callback signature properly propagates the transcript path through the compaction flow.

autogpt_platform/backend/backend/copilot/sdk/compaction.py (2)

30-41: LGTM! Well-designed dataclass to prevent TOCTOU races.

The CompactionResult properly bundles just_ended, events, and transcript_path together, eliminating the need for separate flag checks that could introduce race conditions between the emit call and state inspection.


195-208: LGTM! Good defensive logging on transcript_path overwrite.

The warning when _transcript_path is overwritten helps debug unexpected multiple compaction events within a single cycle.

autogpt_platform/backend/backend/copilot/sdk/transcript_builder.py (2)

57-73: LGTM! Good extraction of entry parsing logic.

The _parse_entry method properly filters strippable types while preserving entries with isCompactSummary=True, enabling compact summaries to survive load/replace operations. Reusing this method in both load_previous and replace_entries follows DRY principles.


175-210: LGTM! Defensive validation prevents data loss.

The non-empty check before swapping entries is a good safeguard against corrupt input wiping conversation history. The logging provides visibility into compaction outcomes.

autogpt_platform/backend/backend/copilot/sdk/compaction_test.py (2)

307-378: LGTM! Thorough testing of multi-compaction behavior.

The tests clearly document the expected behavior: within a single query, _done=True blocks subsequent compactions until reset_for_query is called. The inline comments explaining the state machine behavior are helpful for maintainability.


379-407: LGTM! Good coverage of transcript_path lifecycle.

Tests verify the complete lifecycle: storage via on_compact, return via emit_end_if_ready, and clearing after emission. This ensures the path doesn't leak into subsequent non-compaction calls.

autogpt_platform/backend/backend/copilot/sdk/e2e_compaction_test.py (1)

201-369: LGTM! Excellent end-to-end test coverage.

This test comprehensively simulates the service.py compaction flow across 13 steps, including transcript loading, message appending, PreCompact hook handling, CLI session reading, entry replacement, and roundtrip verification. The detailed assertions on parent UUID chains and isCompactSummary preservation provide confidence in the integration.

autogpt_platform/backend/backend/copilot/sdk/transcript.py (4)

156-176: LGTM! Proper path traversal protection.

The _cli_project_dir function validates that the resolved path stays within the projects base directory, preventing symlink-based path traversal attacks. Returning None for escaped paths is the right defensive approach.


179-199: LGTM! Safe glob implementation filtering symlink escapes.

The _safe_glob_jsonl function properly validates each candidate file's resolved path stays within the project directory, catching both symlink escapes and filesystem errors gracefully.


202-263: LGTM! Correctly finds the last compaction summary.

The implementation properly scans for the LAST isCompactSummary entry (not breaking on first match), which handles multiple compactions correctly. The path boundary check prevents reading files outside the projects directory.


95-99: LGTM! Proper preservation of compact summaries.

The isCompactSummary check ensures that summary entries marked as compaction results are preserved even though summary is in STRIPPABLE_TYPES. This is essential for maintaining compacted conversation state across turns.

Also applies to: 123-125

autogpt_platform/backend/backend/copilot/sdk/transcript_test.py (3)

319-340: LGTM! Good security test for symlink escape prevention.

This test verifies that symlinks pointing outside the project directory are correctly filtered out, preventing path traversal attacks via symbolic links.


604-691: LGTM! Comprehensive integration test simulating real service flow.

The test accurately simulates the service.py compaction flow: loading previous transcript, appending messages, reading compacted entries from CLI session, replacing builder state, and verifying the parent chain integrity. This provides confidence in the end-to-end integration.


753-781: LGTM! Excellent test for compact summary preservation during stripping.

This test verifies that strip_progress_entries correctly preserves entries with isCompactSummary=True while still stripping regular summary entries. This is a critical behavior for maintaining compacted state.

autogpt_platform/backend/backend/copilot/sdk/prompt_too_long_test.py (2)

27-76: LGTM! Thorough error detection testing.

The parametrized tests cover a good range of prompt-too-long error patterns including case variations, while explicitly testing that similar but non-matching patterns (like partial matches) don't trigger false positives.


291-303: LGTM! Essential roundtrip validation.

The roundtrip test verifies that messages → transcript → messages preserves both role and content, which is critical for the compaction flow where transcripts are converted back and forth.

Comment thread autogpt_platform/backend/backend/copilot/sdk/service.py
Comment thread autogpt_platform/backend/backend/copilot/sdk/service.py Outdated
Comment thread autogpt_platform/backend/backend/copilot/sdk/service.py Outdated
Resolve merge conflicts in service.py, transcript.py, and transcript_test.py.
Incorporates dev's compaction tracking (emit_end_if_ready, entries_replaced,
read_compacted_entries) into the retry loop structure.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 13, 2026

🔍 PR Overlap Detection

This check compares your PR against all other open PRs targeting the same branch to detect potential merge conflicts early.

🔴 Merge Conflicts Detected

The following PRs have been tested and will have merge conflicts if merged after this PR. Consider coordinating with the authors.

🟡 Medium Risk — Some Line Overlap

These PRs have some overlapping changes:

🟢 Low Risk — File Overlap Only

These PRs touch the same files but different sections (click to expand)

Summary: 4 conflict(s), 2 medium risk, 11 low risk (out of 17 PRs with file overlap)


Auto-generated on push. Ignores: openapi.json, lock files.

@github-actions github-actions Bot removed the conflicts Automatically applied to PRs with merge conflicts label Mar 13, 2026
Comment thread autogpt_platform/backend/backend/copilot/sdk/transcript.py
…backticks

Revert f-string to %-format logging conversions in security_hooks.py,
tool_adapter.py, and response_adapter.py — purely cosmetic and outside
this PR's scope. Replace rST double backticks with single backticks in
service.py docstrings for codebase consistency.
@majdyz majdyz added this pull request to the merge queue Mar 18, 2026
Merged via the queue into dev with commit 5a29de0 Mar 18, 2026
21 checks passed
@github-project-automation github-project-automation Bot moved this from 👍🏼 Mergeable to ✅ Done in AutoGPT development kanban Mar 18, 2026
@github-project-automation github-project-automation Bot moved this to Done in Frontend Mar 18, 2026
@majdyz majdyz deleted the fix/copilot-transcript-compaction branch March 18, 2026 10:42
@sentry
Copy link
Copy Markdown

sentry Bot commented Mar 26, 2026

Issues attributed to commits in this pull request

This pull request was merged and Sentry observed the following issues:

github-merge-queue Bot pushed a commit that referenced this pull request Apr 1, 2026
…aware compression, and truncated tool call recovery (#12625)

## Why

CoPilot has several context management issues that degrade long
sessions:
1. "Prompt is too long" errors crash the session instead of triggering
retry/compaction
2. Stale thinking blocks bloat transcripts, causing unnecessary
compaction every turn
3. Compression target is hardcoded regardless of model context window
size
4. Truncated tool calls (empty `{}` args from max_tokens) kill the
session instead of guiding the model to self-correct

## What

**Fix 1: Prompt-too-long retry bypass (SENTRY-1207)**
The SDK surfaces "prompt too long" via `AssistantMessage.error` and
`ResultMessage.result` — neither triggered the retry/compaction loop
(only Python exceptions did). Now both paths are intercepted and
re-raised.

**Fix 2: Strip stale thinking blocks before upload**
Thinking/redacted_thinking blocks in non-last assistant entries are
10-50K tokens each but only needed for API signature verification in the
*last* message. Stripping before upload reduces transcript size and
prevents per-turn compaction.

**Fix 3: Model-aware compression target**
`compress_context()` now computes `target_tokens` from the model's
context window (e.g. 140K for Opus 200K) instead of a hardcoded 120K
default. Larger models retain more history; smaller models compress more
aggressively.

**Fix 4: Self-correcting truncated tool calls**
When the model's response exceeds max_tokens, tool call inputs get
silently truncated to `{}`. Previously this tripped a circuit breaker
after 3 attempts. Now the MCP wrapper detects empty args and returns
guidance: "write in chunks with `cat >>`, pass via
`@@agptfile:filename`". The model can self-correct instead of the
session dying.

## How

- **service.py**: `_is_prompt_too_long` checks in both
`AssistantMessage.error` and `ResultMessage` error handlers. Circuit
breaker limit raised from 3→5.
- **transcript.py**: `strip_stale_thinking_blocks()` reverse-scans for
last assistant `message.id`, strips thinking blocks from all others.
Called in `upload_transcript()`.
- **prompt.py**: `get_compression_target(model)` computes
`context_window - 60K overhead`. `compress_context()` uses it when
`target_tokens` is None.
- **tool_adapter.py**: `_truncating` wrapper intercepts empty args on
tools with required params, returns actionable guidance instead of
failing.

## Related

- Fixes SENTRY-1207
- Sessions: `d2f7cba3` (repeated compaction), `08b807d4` (prompt too
long), `130d527c` (truncated tool calls)
- Extends #12413, consolidates #12626

## Test plan

- [x] 6 unit tests for `strip_stale_thinking_blocks`
- [x] 1 integration test for ResultMessage prompt-too-long → compaction
retry
- [x] Pyright clean (0 errors), all pre-commit hooks pass
- [ ] E2E: Load transcripts from affected sessions and verify behavior
Bentlybro pushed a commit that referenced this pull request Apr 4, 2026
…t SDK (#12413)

## Summary

When the Claude SDK returns a prompt-too-long error (e.g. transcript +
query exceeds the model's context window), the streaming loop now
retries with escalating fallbacks instead of failing immediately:

1. **Attempt 1**: Use the transcript as-is (normal path)
2. **Attempt 2**: Compact the transcript via LLM summarization
(`compact_transcript`) and retry
3. **Attempt 3**: Drop the transcript entirely and fall back to
DB-reconstructed context (`_build_query_message`)

If all 3 attempts fail, a `StreamError(code="prompt_too_long")` is
yielded to the frontend.

### Key changes

**`service.py`**
- Add `_is_prompt_too_long(err)` — pattern-matches SDK exceptions for
prompt-length errors (`prompt is too long`, `prompt_too_long`,
`context_length_exceeded`, `request too large`)
- Wrap `async with ClaudeSDKClient` in a 3-attempt retry `for` loop with
compaction/fallback logic
- Move `current_message`, `_build_query_message`, and
`_prepare_file_attachments` before the retry loop (computed once,
reused)
- Skip transcript upload in `finally` when `transcript_caused_error`
(avoids persisting a broken/empty transcript)
- Reset `stream_completed` between retry iterations
- Document outer-scope variable contract in `_run_stream_attempt`
closure (which variables are reassigned between retries vs read-only)

**`transcript.py`**
- Add `compact_transcript(content, log_prefix, model)` — converts JSONL
→ messages → `compress_context` (LLM summarization with truncation
fallback) → JSONL
- Add helpers: `_flatten_assistant_content`,
`_flatten_tool_result_content`, `_transcript_to_messages`,
`_messages_to_transcript`, `_run_compression`
- Returns `None` when compaction fails or transcript is already within
budget (signals caller to fall through to DB fallback)
- Truncation fallback wrapped in 30s timeout to prevent unbounded CPU
time on large transcripts
- Accepts `model` parameter to avoid creating a new `ChatConfig()` on
every call

**`util/prompt.py`**
- Fix `_truncate_middle_tokens` edge case: returns empty string when
`max_tok < 1`, properly handles `max_tok < 3`

**`config.py`**
- E2B sandbox timeout raised from 5 min to 15 min to accommodate
compaction retries

**`prompt_too_long_test.py`** (new, 45 tests)
- `_is_prompt_too_long` positive/negative patterns, case sensitivity,
BaseException handling
- Flatten helpers for assistant/tool_result content blocks
- `_transcript_to_messages` / `_messages_to_transcript` roundtrip,
strippable types, empty content
- `compact_transcript` async tests: too few messages, not compacted,
successful compaction, compression failure

**`retry_scenarios_test.py`** (new, 27 tests)
- Full retry state machine simulation covering all 8 scenarios:
  1. Normal flow (no retry)
  2. Compact succeeds → retry succeeds
  3. Compact fails → DB fallback succeeds
  4. No transcript → DB fallback succeeds
  5. Double fail → DB fallback on attempt 3
  6. All 3 attempts exhausted
  7. Non-prompt-too-long error (no retry)
  8. Compaction returns identical content → DB fallback
- Edge cases: nested exceptions, case insensitivity, unicode content,
large transcripts, resume-after-compaction flow

**Shared test fixtures** (`conftest.py`)
- Extracted `build_test_transcript` helper used across 3 test files to
eliminate duplication

## Test plan

- [x] `_is_prompt_too_long` correctly identifies prompt-length errors (8
positive, 5 negative patterns)
- [x] `compact_transcript` compacts oversized transcripts via LLM
summarization
- [x] `compact_transcript` returns `None` on failure or when already
within budget
- [x] Retry loop state machine: all 8 scenarios verified with state
assertions
- [x] `TranscriptBuilder` works correctly after loading compacted
transcripts
- [x] `_messages_to_transcript` roundtrip preserves content including
unicode
- [x] `transcript_caused_error` prevents stale transcript upload
- [x] Truncation timeout prevents unbounded CPU time
- [x] All 139 unit tests pass locally
- [x] CI green (tests 3.11/3.12/3.13, types, CodeQL, linting)
Bentlybro pushed a commit to Bentlybro/AutoGPT that referenced this pull request Apr 4, 2026
…t SDK (Significant-Gravitas#12413)

## Summary

When the Claude SDK returns a prompt-too-long error (e.g. transcript +
query exceeds the model's context window), the streaming loop now
retries with escalating fallbacks instead of failing immediately:

1. **Attempt 1**: Use the transcript as-is (normal path)
2. **Attempt 2**: Compact the transcript via LLM summarization
(`compact_transcript`) and retry
3. **Attempt 3**: Drop the transcript entirely and fall back to
DB-reconstructed context (`_build_query_message`)

If all 3 attempts fail, a `StreamError(code="prompt_too_long")` is
yielded to the frontend.

### Key changes

**`service.py`**
- Add `_is_prompt_too_long(err)` — pattern-matches SDK exceptions for
prompt-length errors (`prompt is too long`, `prompt_too_long`,
`context_length_exceeded`, `request too large`)
- Wrap `async with ClaudeSDKClient` in a 3-attempt retry `for` loop with
compaction/fallback logic
- Move `current_message`, `_build_query_message`, and
`_prepare_file_attachments` before the retry loop (computed once,
reused)
- Skip transcript upload in `finally` when `transcript_caused_error`
(avoids persisting a broken/empty transcript)
- Reset `stream_completed` between retry iterations
- Document outer-scope variable contract in `_run_stream_attempt`
closure (which variables are reassigned between retries vs read-only)

**`transcript.py`**
- Add `compact_transcript(content, log_prefix, model)` — converts JSONL
→ messages → `compress_context` (LLM summarization with truncation
fallback) → JSONL
- Add helpers: `_flatten_assistant_content`,
`_flatten_tool_result_content`, `_transcript_to_messages`,
`_messages_to_transcript`, `_run_compression`
- Returns `None` when compaction fails or transcript is already within
budget (signals caller to fall through to DB fallback)
- Truncation fallback wrapped in 30s timeout to prevent unbounded CPU
time on large transcripts
- Accepts `model` parameter to avoid creating a new `ChatConfig()` on
every call

**`util/prompt.py`**
- Fix `_truncate_middle_tokens` edge case: returns empty string when
`max_tok < 1`, properly handles `max_tok < 3`

**`config.py`**
- E2B sandbox timeout raised from 5 min to 15 min to accommodate
compaction retries

**`prompt_too_long_test.py`** (new, 45 tests)
- `_is_prompt_too_long` positive/negative patterns, case sensitivity,
BaseException handling
- Flatten helpers for assistant/tool_result content blocks
- `_transcript_to_messages` / `_messages_to_transcript` roundtrip,
strippable types, empty content
- `compact_transcript` async tests: too few messages, not compacted,
successful compaction, compression failure

**`retry_scenarios_test.py`** (new, 27 tests)
- Full retry state machine simulation covering all 8 scenarios:
  1. Normal flow (no retry)
  2. Compact succeeds → retry succeeds
  3. Compact fails → DB fallback succeeds
  4. No transcript → DB fallback succeeds
  5. Double fail → DB fallback on attempt 3
  6. All 3 attempts exhausted
  7. Non-prompt-too-long error (no retry)
  8. Compaction returns identical content → DB fallback
- Edge cases: nested exceptions, case insensitivity, unicode content,
large transcripts, resume-after-compaction flow

**Shared test fixtures** (`conftest.py`)
- Extracted `build_test_transcript` helper used across 3 test files to
eliminate duplication

## Test plan

- [x] `_is_prompt_too_long` correctly identifies prompt-length errors (8
positive, 5 negative patterns)
- [x] `compact_transcript` compacts oversized transcripts via LLM
summarization
- [x] `compact_transcript` returns `None` on failure or when already
within budget
- [x] Retry loop state machine: all 8 scenarios verified with state
assertions
- [x] `TranscriptBuilder` works correctly after loading compacted
transcripts
- [x] `_messages_to_transcript` roundtrip preserves content including
unicode
- [x] `transcript_caused_error` prevents stale transcript upload
- [x] Truncation timeout prevents unbounded CPU time
- [x] All 139 unit tests pass locally
- [x] CI green (tests 3.11/3.12/3.13, types, CodeQL, linting)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

platform/backend AutoGPT Platform - Back end size/xl

Projects

Status: ✅ Done
Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants