Parallel tool_calls cause "tool_call_id missing response" error

### What version of Codex is running?

codex-cli 0.77.0

### What subscription do you have?

no

### Which model were you using?

gpt-5.2

### What platform is your computer?

_No response_

### What issue are you seeing?

# Parallel tool_calls cause "tool_call_id missing response" error

## Summary

When Codex sends multiple parallel `tool_calls` to the LLM API, some tool responses are lost, causing the subsequent API request to fail with:

```
An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'.
```

This error is **reproducible** and **unrecoverable** - once it occurs, the session cannot be resumed.

## Environment

- **Codex version**: codex-cli 0.77.0
- **OS**: macOS Darwin 23.6.0
- **Log file**: `~/.codex/log/codex-tui.log`

## Steps to Reproduce

1. Start a Codex session
2. Give the AI a task that requires reading multiple files (e.g., "analyze the A2A example code")
3. The AI will attempt to execute multiple `sed` or file-read commands in parallel
4. The error occurs immediately after the parallel tool calls

## Expected Behavior

All tool call responses should be collected and sent back to the API in the correct format.

## Actual Behavior

Some tool responses are lost, causing the API to reject the next request due to format violation.

## Evidence from Logs

### Occurrence 1: 2025-12-23 06:48

```log
2025-12-23T06:48:28.508945Z  INFO ToolCall: shell_command {"command": "sed -n '1,220p' ...A2aNodeActionWithConfig.java"}
2025-12-23T06:48:28.509558Z  INFO ToolCall: shell_command {"command": "sed -n '1,220p' ...A2AExample.java"}
2025-12-23T06:48:28.509615Z  INFO ToolCall: shell_command {"command": "sed -n '1,260p' ...A2aRemoteAgent.java"}
2025-12-23T06:48:30.894742Z  INFO Turn error: {"error":{"message":"An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids did not have response messages: call_xNiBLtPTkR9xLP7r41K3wjzr"}}
```

### Occurrence 2: 2025-12-23 08:09 (different session)

```log
2025-12-23T08:09:38.257295Z  INFO ToolCall: shell_command {"command": "sed -n '1,220p' ...A2AExample.java"}
2025-12-23T08:09:38.257437Z  INFO ToolCall: shell_command {"command": "sed -n '1,240p' ...A2AExampleController.java"}
2025-12-23T08:09:38.257486Z  INFO ToolCall: shell_command {"command": "sed -n '1,240p' ...README.md"}
2025-12-23T08:09:40.220236Z  INFO Turn error: {"error":{"message":"An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids did not have response messages: call_roXyhlwmpnd3ZrA7DMM2nq7C"}}
```

### Pattern Summary

| Timestamp | Parallel Commands | Time to Error | Missing tool_call_id |
|-----------|-------------------|---------------|----------------------|
| 06:48:28 | 3 sed commands | ~2 seconds | call_xNiBLtPTkR9xLP7r41K3wjzr |
| 08:09:38 | 3 sed commands | ~2 seconds | call_roXyhlwmpnd3ZrA7DMM2nq7C |

---

## Source Code Analysis

After analyzing the codex-rs source code, I've identified the following potential bug locations:

### Key Files Involved

| File | Purpose |
|------|---------|
| `codex-rs/core/src/codex.rs` | Main turn loop and `drain_in_flight` |
| `codex-rs/core/src/stream_events_utils.rs` | Tool call handling and response recording |
| `codex-rs/core/src/context_manager/normalize.rs` | History validation |

### Bug Location 1: Race Condition in Turn Loop (`codex.rs`)

The main turn loop processes tool calls and collects responses:

```rust
ResponseEvent::OutputItemDone(item) => {
    // Tool execution queued as future
    if let Some(tool_future) = output_result.tool_future {
        in_flight.push_back(tool_future);  // Queued but not awaited yet
    }
}
// ...
ResponseEvent::Completed { ... } => {
    // Stream completed - loop breaks BEFORE drain_in_flight
    break Ok(TurnRunResult { needs_follow_up, last_agent_message });
}
```

**Issue**: If `ResponseEvent::Completed` arrives before all tool futures complete, and `drain_in_flight` is called after breaking the loop, timing issues may cause some responses to be lost.

### Bug Location 2: Tool Response Recording Order (`stream_events_utils.rs`)

```rust
// In handle_output_item_done:
Ok(Some(call)) => {
    // Tool CALL recorded immediately
    ctx.sess.record_conversation_items(&ctx.turn_context, std::slice::from_ref(&item)).await;

    // Tool OUTPUT queued as future - recorded LATER
    let tool_future: InFlightFuture = Box::pin(
        ctx.tool_runtime.clone().handle_tool_call(call, cancellation_token)
    );
    output.tool_future = Some(tool_future);
}
```

**Issue**: The tool call is recorded immediately, but responses are queued as futures. If the session is interrupted before `drain_in_flight` completes, responses are lost.

### Bug Location 3: Session Resumption (`codex.rs`)

**Issue**: If rollout items aren't persisted atomically for parallel tool calls (call + all responses), reconstruction may have incomplete pairs.

### Root Cause Hypothesis

The most likely root cause is the timing between `drain_in_flight` and the turn loop:

1. Parallel tool calls are dispatched → added to `in_flight: FuturesOrdered`
2. Turn loop continues processing stream events
3. `ResponseEvent::Completed` arrives → loop breaks
4. `drain_in_flight` called AFTER loop ends
5. **If any tool execution is still pending or fails silently** → response not recorded

---

## Impact

- **Severity**: High - completely blocks the session
- **Workaround**: Delete the corrupted session file and start a new session
  ```bash
  rm ~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl
  ```
- **User Experience**: Very poor - users lose all conversation context

---

## Suggested Fixes

### 1. Ensure all in-flight futures complete before breaking

In `codex.rs`, drain in-flight futures before processing `Completed`:

```rust
ResponseEvent::Completed { ... } => {
    // BEFORE breaking, ensure all in-flight futures complete
    drain_in_flight(&mut in_flight, sess.clone(), turn_context.clone()).await?;

    should_emit_turn_diff = true;
    break Ok(TurnRunResult { needs_follow_up, last_agent_message });
}
```

### 2. Validate tool responses before API call

Before sending the next API request, validate that all `tool_call_id`s have corresponding responses:

```rust
fn validate_tool_responses(messages: &[Message]) -> Result<(), Error> {
    for (i, msg) in messages.iter().enumerate() {
        if let Some(tool_calls) = &msg.tool_calls {
            let expected_ids: HashSet<_> = tool_calls.iter().map(|tc| &tc.id).collect();
            let actual_ids: HashSet<_> = messages[i+1..]
                .iter()
                .filter_map(|m| m.tool_call_id.as_ref())
                .collect();

            let missing: Vec<_> = expected_ids.difference(&actual_ids).collect();
            if !missing.is_empty() {
                return Err(Error::MissingToolResponses(missing));
            }
        }
    }
    Ok(())
}
```




### What steps can reproduce the bug?

### 3. Prevent session resume with corrupted state

When resuming a session, validate message sequence integrity before sending to API.

---

## Checklist

- [x] I have searched existing issues for duplicates
- [x] I have included relevant log excerpts
- [x] I have described the expected vs actual behavior
- [x] I have provided steps to reproduce
- [x] I have analyzed the source code to identify root cause
- [x] I have suggested potential fixes

### What is the expected behavior?

_No response_

### Additional information

_No response_

Timestamp	Parallel Commands	Time to Error	Missing tool_call_id
06:48:28	3 sed commands	~2 seconds	call_xNiBLtPTkR9xLP7r41K3wjzr
08:09:38	3 sed commands	~2 seconds	call_roXyhlwmpnd3ZrA7DMM2nq7C

File	Purpose
`codex-rs/core/src/codex.rs`	Main turn loop and `drain_in_flight`
`codex-rs/core/src/stream_events_utils.rs`	Tool call handling and response recording
`codex-rs/core/src/context_manager/normalize.rs`	History validation

Parallel tool_calls cause "tool_call_id missing response" error #8479

Description

What version of Codex is running?

What subscription do you have?

Which model were you using?

What platform is your computer?

What issue are you seeing?

Parallel tool_calls cause "tool_call_id missing response" error

Summary

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Evidence from Logs

Occurrence 1: 2025-12-23 06:48

Occurrence 2: 2025-12-23 08:09 (different session)

Pattern Summary

Source Code Analysis

Key Files Involved

Bug Location 1: Race Condition in Turn Loop (codex.rs)

Bug Location 2: Tool Response Recording Order (stream_events_utils.rs)

Bug Location 3: Session Resumption (codex.rs)

Root Cause Hypothesis

Impact

Suggested Fixes

1. Ensure all in-flight futures complete before breaking

2. Validate tool responses before API call

What steps can reproduce the bug?

3. Prevent session resume with corrupted state

Checklist

What is the expected behavior?

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug Location 1: Race Condition in Turn Loop (`codex.rs`)

Bug Location 2: Tool Response Recording Order (`stream_events_utils.rs`)

Bug Location 3: Session Resumption (`codex.rs`)