fix(ffi): drain stdout/stderr pumps before Wait terminal event (re-open of #685/#705)#714
Closed
G4614 wants to merge 1 commit into
Closed
fix(ffi): drain stdout/stderr pumps before Wait terminal event (re-open of #685/#705)#714G4614 wants to merge 1 commit into
G4614 wants to merge 1 commit into
Conversation
The Go SDK's box.Exec would occasionally return with stdout chunks still in flight — the user's callback got OnExit (or the wait gRPC reply) before the matching OnStdout/OnStderr chunks landed on the queue. From the caller's perspective, the exec had finished but its stdout was silently truncated. Root cause: execution_wait spawned an independent terminal task that pushed the Wait event as soon as wait_on_clone returned — with no drain barrier — so the wait reply could land in the event queue ahead of still-flushing stream pumps. Fix: make exit_pump the sole owner of terminal-event dispatch. Both execution_wait and register_exit fan into it; exit_pump awaits all stream pump receivers before pushing the terminal event. Queue order becomes Stdout/Stderr* -> Exit -> Wait*, all from the same task. Pin: test_p0_6_exec_stdout_race in boxlite-ai#678's e2e suite goes from ~70% stdout-loss to 0%. Replaces an earlier split-out attempt that stacked on boxlite-ai#682's reorg (now abandoned). Branch rebuilt against current main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
short execs and reattach lose stdout: terminal event pushed before stream pumps drain
Bug
A short-lived
box.execlikeecho X && exit 0returns with emptystdout when read via the FFI / Python REST chain:
The exec completed (exit code is right) but the stream callback's
stdout came back empty. ~70% loss rate on tight loops measured locally.
Root cause:
execution_waitran as an independent terminal task thatpushed the
Waitevent the momentwait_on_clonereturned — with nodrain barrier. So
Waitcould land in the event queue ahead ofstill-flushing stream chunks the stdout/stderr pumps were emitting.
The caller observed Wait, considered the exec done, and never pulled
the remaining chunks.
Fix
Make
exit_pumpthe sole owner of terminal-event dispatch:execution_waitno longer spawns its own task; it registers apending Wait into a shared list (
pending_waits) and ensuresexit_pumpis spawned if it isn't already.exit_pump's sequence becomes:wait_on_clone(process)— wait for child exitstream_done_rx(snapshot viastd::mem::take)Exitifexit_dispatchwas registeredWaitperpending_waitsregistrationStdout/Stderr* → Exit → Wait*, all from thesame task. No inter-task sync, no race window.
Late
boxlite_execution_waitracing theexit_dispatchedflag iscovered under the
pending_waitslock — the late caller is eitherpart of the
take()'d snapshot, or observes the flag and spawns itsown direct-push task. Never lost.
Test plan — two-sided verified
Pins (both exist on main and in current PRs):
scripts/test/e2e/cases/test_p0_6_exec_stdout_race.py(from test(e2e): SDK→API→Runner→VM regression suite #678)— 10 short execs, expects ≤5% stdout loss
scripts/test/e2e/cases/test_exec_attach.py::test_reattach_after_original_completes(in test(e2e): add cli-detach-recovery, exec-attach, volume-readonly cases #710)— short exec then reattach to confirm contract
Stack: local e2e. libboxlite rebuilt (
make dist:c), Python wheelrebuilt (
make dev:python), no runner change (FFI is SDK-side).test_p0_6_exec_stdout_racetest_reattach_after_original_completesout=''on first exec (race kills original stdout, reattach can't recover)out="first-output\n", reattach succeeds, exit codes matchTwo-side: started from current main's
execution.rs(both tests fail withdocumented loss). Injected this PR's
execution.rs, rebuilt the wheel,ran tests — both pass.
Note
Replaces closed PRs #685 and #705. The branch content (1860c81) is
unchanged; previous PRs were closed by accidental force-pushes during
an unrelated #682 reorganize cleanup. This PR re-opens the exact same
fix against the same branch.
🤖 Generated with Claude Code