Skip to content

fix(ffi): drain stdout/stderr pumps before Wait terminal event (re-open of #685/#705)#714

Closed
G4614 wants to merge 1 commit into
boxlite-ai:mainfrom
G4614:fix/ffi-exec-drain-race
Closed

fix(ffi): drain stdout/stderr pumps before Wait terminal event (re-open of #685/#705)#714
G4614 wants to merge 1 commit into
boxlite-ai:mainfrom
G4614:fix/ffi-exec-drain-race

Conversation

@G4614

@G4614 G4614 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

short execs and reattach lose stdout: terminal event pushed before stream pumps drain

Bug

A short-lived box.exec like echo X && exit 0 returns with empty
stdout when read via the FFI / Python REST chain:

ex = await box.exec("sh", ["-c", "echo first-output && exit 0"], ...)
out = await collect_stream(ex.stdout())   # ""  ← lost
rc = await ex.wait()                      # exit_code = 0

The exec completed (exit code is right) but the stream callback's
stdout came back empty. ~70% loss rate on tight loops measured locally.

Root cause: execution_wait ran as an independent terminal task that
pushed the Wait event the moment wait_on_clone returned — with no
drain barrier. So Wait could land in the event queue ahead of
still-flushing stream chunks the stdout/stderr pumps were emitting.
The caller observed Wait, considered the exec done, and never pulled
the remaining chunks.

Fix

Make exit_pump the sole owner of terminal-event dispatch:

  • execution_wait no longer spawns its own task; it registers a
    pending Wait into a shared list (pending_waits) and ensures
    exit_pump is spawned if it isn't already.
  • exit_pump's sequence becomes:
    1. wait_on_clone(process) — wait for child exit
    2. drain stream_done_rx (snapshot via std::mem::take)
    3. push Exit if exit_dispatch was registered
    4. push Wait per pending_waits registration
  • Queue order is always Stdout/Stderr* → Exit → Wait*, all from the
    same task. No inter-task sync, no race window.

Late boxlite_execution_wait racing the exit_dispatched flag is
covered under the pending_waits lock — the late caller is either
part of the take()'d snapshot, or observes the flag and spawns its
own direct-push task. Never lost.

Test plan — two-sided verified

Pins (both exist on main and in current PRs):

Stack: local e2e. libboxlite rebuilt (make dist:c), Python wheel
rebuilt (make dev:python), no runner change (FFI is SDK-side).

Case Pre-fix Post-fix
test_p0_6_exec_stdout_race ~70% loss (7/10 execs lost stdout) 0% loss
test_reattach_after_original_completes out='' on first exec (race kills original stdout, reattach can't recover) out="first-output\n", reattach succeeds, exit codes match

Two-side: started from current main's execution.rs (both tests fail with
documented loss). Injected this PR's execution.rs, rebuilt the wheel,
ran tests — both pass.

Note

Replaces closed PRs #685 and #705. The branch content (1860c81) is
unchanged; previous PRs were closed by accidental force-pushes during
an unrelated #682 reorganize cleanup. This PR re-opens the exact same
fix against the same branch.

🤖 Generated with Claude Code

The Go SDK's box.Exec would occasionally return with stdout chunks
still in flight — the user's callback got OnExit (or the wait gRPC
reply) before the matching OnStdout/OnStderr chunks landed on the
queue. From the caller's perspective, the exec had finished but its
stdout was silently truncated.

Root cause: execution_wait spawned an independent terminal task that
pushed the Wait event as soon as wait_on_clone returned — with no
drain barrier — so the wait reply could land in the event queue
ahead of still-flushing stream pumps.

Fix: make exit_pump the sole owner of terminal-event dispatch. Both
execution_wait and register_exit fan into it; exit_pump awaits all
stream pump receivers before pushing the terminal event. Queue order
becomes Stdout/Stderr* -> Exit -> Wait*, all from the same task.

Pin: test_p0_6_exec_stdout_race in boxlite-ai#678's e2e suite goes from ~70%
stdout-loss to 0%.

Replaces an earlier split-out attempt that stacked on boxlite-ai#682's reorg
(now abandoned). Branch rebuilt against current main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8e8331ea-2c76-4383-8fac-8f2faab5a027

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@G4614

G4614 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Closed — #563 already covers stdout-drain at the level the failing test needs. I opened this without re-verifying #563's current scope. Sorry for the noise.

@G4614 G4614 closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant