Skip to content

Comments

fix: prevent Groq message mutation in structured output fallback (#562)#591

Open
Serhan-Asad wants to merge 3 commits intopromptdriven:mainfrom
Serhan-Asad:fix/issue-562
Open

fix: prevent Groq message mutation in structured output fallback (#562)#591
Serhan-Asad wants to merge 3 commits intopromptdriven:mainfrom
Serhan-Asad:fix/issue-562

Conversation

@Serhan-Asad
Copy link
Collaborator

Summary

Adds failing tests that detect the bug reported in #562: Groq structured output path mutates shared formatted_messages, corrupting fallback models.

Test Files

  • Unit tests: tests/test_llm_invoke_groq_message_mutation.py (5 tests)
  • E2E tests: tests/test_llm_invoke_groq_message_mutation_e2e.py (3 tests)

What This PR Contains

  • 8 failing tests that reproduce the reported bug from two mutation vectors:
    • List-level: .insert(0, ...) injects a Groq schema system message into the shared formatted_messages list
    • Dict-level: messages_list[0]["content"] = schema + "\n\n" + ... overwrites the system message content in-place
  • Tests are verified to fail on current code and will pass once the fix is applied
  • No API keys or network calls needed — pure Python reference/mutation issue

Root Cause

llm_invoke.py:1966 passes formatted_messages by reference (not copy) into litellm_kwargs. The Groq structured output workaround at lines 2125-2129 then mutates this shared reference in-place via .insert(0, ...) and dict content overwrite. Since all variables point to the same Python objects, every subsequent fallback model candidate receives corrupted messages containing Groq's JSON schema instruction.

Next Steps

  1. Implement the fix: copy.deepcopy(formatted_messages) at line 1966
  2. Verify the 5 unit tests pass
  3. Verify the 3 E2E tests pass
  4. Run full test suite for regressions
  5. Mark PR as ready for review

Fixes #562


Generated by PDD agentic bug workflow

Add unit tests and E2E tests that detect the bug where the Groq
structured output code path in llm_invoke.py mutates the shared
formatted_messages list in-place, corrupting fallback model messages.

- 5 unit tests covering both mutation vectors (list .insert() and
  dict content overwrite) plus cumulative corruption
- 3 E2E tests exercising real CSV loading and model selection with
  only litellm.completion mocked

All 8 tests fail on current code and will pass once the fix
(copy.deepcopy at line 1966) is applied.

Fixes promptdriven#562
…utation corrupting fallback models (promptdriven#562)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Serhan-Asad Serhan-Asad marked this pull request as ready for review February 24, 2026 04:04
@gltanaka gltanaka requested a review from Copilot February 24, 2026 04:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds comprehensive test coverage for issue #562, where Groq's structured output implementation mutates the shared formatted_messages list, causing fallback models to receive corrupted messages containing Groq-specific JSON schema instructions. The PR includes both unit tests and E2E tests that verify the fix.

Changes:

  • Added copy.deepcopy() fix at line 1967 to prevent message mutation
  • Added 5 unit tests in test_llm_invoke_groq_message_mutation.py
  • Added 3 E2E tests in test_llm_invoke_groq_message_mutation_e2e.py

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
pdd/llm_invoke.py Added import copy and changed line 1967 to use copy.deepcopy(formatted_messages) instead of passing by reference
tests/test_llm_invoke_groq_message_mutation.py Added 5 unit tests covering both mutation vectors (list insert and dict overwrite) with mocked dependencies
tests/test_llm_invoke_groq_message_mutation_e2e.py Added 3 E2E tests exercising real code paths with only litellm.completion mocked

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Serhan-Asad Serhan-Asad changed the title test: add failing tests for Groq message mutation bug (#562) fix: prevent Groq message mutation in structured output fallback (#562) Feb 24, 2026
Copy link
Contributor

@gltanaka gltanaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work investigating and reproducing this bug! The root cause analysis is spot-on — the shared formatted_messages reference getting mutated by the Groq structured output path is a real issue, and copy.deepcopy is the right fix.

A few suggestions to make this even better:

1. Consider splitting into two PRs

This PR currently bundles three things: tests for #562, the deepcopy fix for #562, and the context window validation feature for #559. Splitting the context window validation into its own PR would make each easier to review and merge independently.

2. Use litellm's built-in context window lookup

Rather than maintaining a hardcoded MODEL_CONTEXT_LIMITS dict in token_counter.py, consider using litellm.get_max_tokens(model) — it already tracks context windows for hundreds of models across all providers, handles Bedrock/Vertex naming conventions, and stays current as litellm is updated. This would eliminate the need for the custom dict and the tricky default value question.

Similarly, litellm.token_counter(model, messages=messages) picks the correct tokenizer per model, which would be more accurate than always using tiktoken's cl100k_base.

The only special case you'd need to handle is Claude's 1M context override (since llm_invoke.py already adds the anthropic-beta: context-1m-2025-08-07 header).

3. Default context limit (128K → 1M)

The change of MODEL_CONTEXT_LIMITS["default"] from 128K to 1M effectively turns off validation for any model not in the dict. Using litellm.get_max_tokens() as suggested above would make this a non-issue, since you'd only need a fallback for truly unknown models (where skipping validation entirely is probably the right call anyway).

4. Test files could be trimmed

The two new test files (~770 lines) are thorough but quite large for a 1-line fix. Consider adding a few focused regression tests to the existing test_llm_invoke.py instead — it already has all the mock infrastructure set up.


Overall, nice detective work on the mutation bug and solid test coverage. The deepcopy fix itself is clean and correct. 👍

…nvoke.py (promptdriven#562)

Replace 770 lines across 2 standalone test files with 3 focused regression
tests (~80 lines) in the existing test_llm_invoke.py, per PR promptdriven#591 review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Groq structured output path mutates shared formatted_messages, corrupting fallback models

2 participants