-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Description
What version of Codex is running?
codex-cli 0.77.0
What subscription do you have?
no
Which model were you using?
gpt-5.2
What platform is your computer?
No response
What issue are you seeing?
Parallel tool_calls cause "tool_call_id missing response" error
Summary
When Codex sends multiple parallel tool_calls to the LLM API, some tool responses are lost, causing the subsequent API request to fail with:
An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'.
This error is reproducible and unrecoverable - once it occurs, the session cannot be resumed.
Environment
- Codex version: codex-cli 0.77.0
- OS: macOS Darwin 23.6.0
- Log file:
~/.codex/log/codex-tui.log
Steps to Reproduce
- Start a Codex session
- Give the AI a task that requires reading multiple files (e.g., "analyze the A2A example code")
- The AI will attempt to execute multiple
sedor file-read commands in parallel - The error occurs immediately after the parallel tool calls
Expected Behavior
All tool call responses should be collected and sent back to the API in the correct format.
Actual Behavior
Some tool responses are lost, causing the API to reject the next request due to format violation.
Evidence from Logs
Occurrence 1: 2025-12-23 06:48
2025-12-23T06:48:28.508945Z INFO ToolCall: shell_command {"command": "sed -n '1,220p' ...A2aNodeActionWithConfig.java"}
2025-12-23T06:48:28.509558Z INFO ToolCall: shell_command {"command": "sed -n '1,220p' ...A2AExample.java"}
2025-12-23T06:48:28.509615Z INFO ToolCall: shell_command {"command": "sed -n '1,260p' ...A2aRemoteAgent.java"}
2025-12-23T06:48:30.894742Z INFO Turn error: {"error":{"message":"An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids did not have response messages: call_xNiBLtPTkR9xLP7r41K3wjzr"}}
Occurrence 2: 2025-12-23 08:09 (different session)
2025-12-23T08:09:38.257295Z INFO ToolCall: shell_command {"command": "sed -n '1,220p' ...A2AExample.java"}
2025-12-23T08:09:38.257437Z INFO ToolCall: shell_command {"command": "sed -n '1,240p' ...A2AExampleController.java"}
2025-12-23T08:09:38.257486Z INFO ToolCall: shell_command {"command": "sed -n '1,240p' ...README.md"}
2025-12-23T08:09:40.220236Z INFO Turn error: {"error":{"message":"An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids did not have response messages: call_roXyhlwmpnd3ZrA7DMM2nq7C"}}
Pattern Summary
| Timestamp | Parallel Commands | Time to Error | Missing tool_call_id |
|---|---|---|---|
| 06:48:28 | 3 sed commands | ~2 seconds | call_xNiBLtPTkR9xLP7r41K3wjzr |
| 08:09:38 | 3 sed commands | ~2 seconds | call_roXyhlwmpnd3ZrA7DMM2nq7C |
Source Code Analysis
After analyzing the codex-rs source code, I've identified the following potential bug locations:
Key Files Involved
| File | Purpose |
|---|---|
codex-rs/core/src/codex.rs |
Main turn loop and drain_in_flight |
codex-rs/core/src/stream_events_utils.rs |
Tool call handling and response recording |
codex-rs/core/src/context_manager/normalize.rs |
History validation |
Bug Location 1: Race Condition in Turn Loop (codex.rs)
The main turn loop processes tool calls and collects responses:
ResponseEvent::OutputItemDone(item) => {
// Tool execution queued as future
if let Some(tool_future) = output_result.tool_future {
in_flight.push_back(tool_future); // Queued but not awaited yet
}
}
// ...
ResponseEvent::Completed { ... } => {
// Stream completed - loop breaks BEFORE drain_in_flight
break Ok(TurnRunResult { needs_follow_up, last_agent_message });
}Issue: If ResponseEvent::Completed arrives before all tool futures complete, and drain_in_flight is called after breaking the loop, timing issues may cause some responses to be lost.
Bug Location 2: Tool Response Recording Order (stream_events_utils.rs)
// In handle_output_item_done:
Ok(Some(call)) => {
// Tool CALL recorded immediately
ctx.sess.record_conversation_items(&ctx.turn_context, std::slice::from_ref(&item)).await;
// Tool OUTPUT queued as future - recorded LATER
let tool_future: InFlightFuture = Box::pin(
ctx.tool_runtime.clone().handle_tool_call(call, cancellation_token)
);
output.tool_future = Some(tool_future);
}Issue: The tool call is recorded immediately, but responses are queued as futures. If the session is interrupted before drain_in_flight completes, responses are lost.
Bug Location 3: Session Resumption (codex.rs)
Issue: If rollout items aren't persisted atomically for parallel tool calls (call + all responses), reconstruction may have incomplete pairs.
Root Cause Hypothesis
The most likely root cause is the timing between drain_in_flight and the turn loop:
- Parallel tool calls are dispatched → added to
in_flight: FuturesOrdered - Turn loop continues processing stream events
ResponseEvent::Completedarrives → loop breaksdrain_in_flightcalled AFTER loop ends- If any tool execution is still pending or fails silently → response not recorded
Impact
- Severity: High - completely blocks the session
- Workaround: Delete the corrupted session file and start a new session
rm ~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl
- User Experience: Very poor - users lose all conversation context
Suggested Fixes
1. Ensure all in-flight futures complete before breaking
In codex.rs, drain in-flight futures before processing Completed:
ResponseEvent::Completed { ... } => {
// BEFORE breaking, ensure all in-flight futures complete
drain_in_flight(&mut in_flight, sess.clone(), turn_context.clone()).await?;
should_emit_turn_diff = true;
break Ok(TurnRunResult { needs_follow_up, last_agent_message });
}2. Validate tool responses before API call
Before sending the next API request, validate that all tool_call_ids have corresponding responses:
fn validate_tool_responses(messages: &[Message]) -> Result<(), Error> {
for (i, msg) in messages.iter().enumerate() {
if let Some(tool_calls) = &msg.tool_calls {
let expected_ids: HashSet<_> = tool_calls.iter().map(|tc| &tc.id).collect();
let actual_ids: HashSet<_> = messages[i+1..]
.iter()
.filter_map(|m| m.tool_call_id.as_ref())
.collect();
let missing: Vec<_> = expected_ids.difference(&actual_ids).collect();
if !missing.is_empty() {
return Err(Error::MissingToolResponses(missing));
}
}
}
Ok(())
}What steps can reproduce the bug?
3. Prevent session resume with corrupted state
When resuming a session, validate message sequence integrity before sending to API.
Checklist
- I have searched existing issues for duplicates
- I have included relevant log excerpts
- I have described the expected vs actual behavior
- I have provided steps to reproduce
- I have analyzed the source code to identify root cause
- I have suggested potential fixes
What is the expected behavior?
No response
Additional information
No response