fix: prevent Groq message mutation in structured output fallback (#562)#591
fix: prevent Groq message mutation in structured output fallback (#562)#591Serhan-Asad wants to merge 3 commits intopromptdriven:mainfrom
Conversation
Add unit tests and E2E tests that detect the bug where the Groq structured output code path in llm_invoke.py mutates the shared formatted_messages list in-place, corrupting fallback model messages. - 5 unit tests covering both mutation vectors (list .insert() and dict content overwrite) plus cumulative corruption - 3 E2E tests exercising real CSV loading and model selection with only litellm.completion mocked All 8 tests fail on current code and will pass once the fix (copy.deepcopy at line 1966) is applied. Fixes promptdriven#562
b7e74c3 to
42cfb47
Compare
…utation corrupting fallback models (promptdriven#562) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive test coverage for issue #562, where Groq's structured output implementation mutates the shared formatted_messages list, causing fallback models to receive corrupted messages containing Groq-specific JSON schema instructions. The PR includes both unit tests and E2E tests that verify the fix.
Changes:
- Added
copy.deepcopy()fix at line 1967 to prevent message mutation - Added 5 unit tests in
test_llm_invoke_groq_message_mutation.py - Added 3 E2E tests in
test_llm_invoke_groq_message_mutation_e2e.py
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pdd/llm_invoke.py | Added import copy and changed line 1967 to use copy.deepcopy(formatted_messages) instead of passing by reference |
| tests/test_llm_invoke_groq_message_mutation.py | Added 5 unit tests covering both mutation vectors (list insert and dict overwrite) with mocked dependencies |
| tests/test_llm_invoke_groq_message_mutation_e2e.py | Added 3 E2E tests exercising real code paths with only litellm.completion mocked |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
74903ee to
7c55764
Compare
gltanaka
left a comment
There was a problem hiding this comment.
Great work investigating and reproducing this bug! The root cause analysis is spot-on — the shared formatted_messages reference getting mutated by the Groq structured output path is a real issue, and copy.deepcopy is the right fix.
A few suggestions to make this even better:
1. Consider splitting into two PRs
This PR currently bundles three things: tests for #562, the deepcopy fix for #562, and the context window validation feature for #559. Splitting the context window validation into its own PR would make each easier to review and merge independently.
2. Use litellm's built-in context window lookup
Rather than maintaining a hardcoded MODEL_CONTEXT_LIMITS dict in token_counter.py, consider using litellm.get_max_tokens(model) — it already tracks context windows for hundreds of models across all providers, handles Bedrock/Vertex naming conventions, and stays current as litellm is updated. This would eliminate the need for the custom dict and the tricky default value question.
Similarly, litellm.token_counter(model, messages=messages) picks the correct tokenizer per model, which would be more accurate than always using tiktoken's cl100k_base.
The only special case you'd need to handle is Claude's 1M context override (since llm_invoke.py already adds the anthropic-beta: context-1m-2025-08-07 header).
3. Default context limit (128K → 1M)
The change of MODEL_CONTEXT_LIMITS["default"] from 128K to 1M effectively turns off validation for any model not in the dict. Using litellm.get_max_tokens() as suggested above would make this a non-issue, since you'd only need a fallback for truly unknown models (where skipping validation entirely is probably the right call anyway).
4. Test files could be trimmed
The two new test files (~770 lines) are thorough but quite large for a 1-line fix. Consider adding a few focused regression tests to the existing test_llm_invoke.py instead — it already has all the mock infrastructure set up.
Overall, nice detective work on the mutation bug and solid test coverage. The deepcopy fix itself is clean and correct. 👍
…nvoke.py (promptdriven#562) Replace 770 lines across 2 standalone test files with 3 focused regression tests (~80 lines) in the existing test_llm_invoke.py, per PR promptdriven#591 review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Adds failing tests that detect the bug reported in #562: Groq structured output path mutates shared
formatted_messages, corrupting fallback models.Test Files
tests/test_llm_invoke_groq_message_mutation.py(5 tests)tests/test_llm_invoke_groq_message_mutation_e2e.py(3 tests)What This PR Contains
.insert(0, ...)injects a Groq schema system message into the sharedformatted_messageslistmessages_list[0]["content"] = schema + "\n\n" + ...overwrites the system message content in-placeRoot Cause
llm_invoke.py:1966passesformatted_messagesby reference (not copy) intolitellm_kwargs. The Groq structured output workaround at lines 2125-2129 then mutates this shared reference in-place via.insert(0, ...)and dict content overwrite. Since all variables point to the same Python objects, every subsequent fallback model candidate receives corrupted messages containing Groq's JSON schema instruction.Next Steps
copy.deepcopy(formatted_messages)at line 1966Fixes #562
Generated by PDD agentic bug workflow