fix(exec): preserve UTF-8 codepoints split across attach chunks#665
fix(exec): preserve UTF-8 codepoints split across attach chunks#665ldarren wants to merge 3 commits into
Conversation
|
Looking for one thing? Review this PR in Change Stack to search files, summaries, diffs, and code without losing your place. Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughStateful per-stream UTF-8 decoding was added: a private Utf8StreamDecoder preserves incomplete trailing bytes across decode() calls and provides flush(); spawn_attach creates decoders for stdout/stderr, threads them into route_output, and flushes them on all attach exit paths. Tests cover decoder boundaries and integration behavior. ChangesUTF-8 Stream Decoding Fix
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Darren Liew seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
35ad76c to
089026f
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/boxlite/src/portal/interfaces/exec.rs (1)
317-326:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFlush decoder holdover before breaking on attach stream error.
On Line 317, the error branch breaks without flushing
stdout_decoder/stderr_decoder, so trailing partial UTF-8 bytes can be lost on transport errors even though EOF path flushes them.Proposed fix
Some(Err(e)) => { tracing::debug!( execution_id = %execution_id, error = %e, message_count, "Attach stream error, breaking" ); + let stdout_tail = stdout_decoder.flush(); + if !stdout_tail.is_empty() { + let _ = stdout_tx.send(stdout_tail); + } + let stderr_tail = stderr_decoder.flush(); + if !stderr_tail.is_empty() { + let _ = stderr_tx.send(stderr_tail); + } let _ = stderr_tx.send(format!("Attach stream error: {}", e)); break; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/boxlite/src/portal/interfaces/exec.rs` around lines 317 - 326, The attach-stream error branch currently breaks without flushing decoder holdover, which can drop trailing partial UTF-8 bytes; before breaking in the Some(Err(e)) arm, call the flush/finish routine on both stdout_decoder and stderr_decoder to extract any remaining valid text and send those strings to their respective channels (e.g. stdout_tx and stderr_tx) prior to sending the "Attach stream error" message and breaking; ensure you still log execution_id and error as before and perform the decoder flushes in that branch (refer to symbols stdout_decoder, stderr_decoder, stdout_tx, stderr_tx, execution_id).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/boxlite/src/portal/interfaces/exec.rs`:
- Around line 317-326: The attach-stream error branch currently breaks without
flushing decoder holdover, which can drop trailing partial UTF-8 bytes; before
breaking in the Some(Err(e)) arm, call the flush/finish routine on both
stdout_decoder and stderr_decoder to extract any remaining valid text and send
those strings to their respective channels (e.g. stdout_tx and stderr_tx) prior
to sending the "Attach stream error" message and breaking; ensure you still log
execution_id and error as before and perform the decoder flushes in that branch
(refer to symbols stdout_decoder, stderr_decoder, stdout_tx, stderr_tx,
execution_id).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 913ba10c-5893-4b63-8039-8aabd271034c
📒 Files selected for processing (1)
src/boxlite/src/portal/interfaces/exec.rs
|
Hi @ldarren. Thanks for file this PR. Please sign the CLA. |
4845c59 to
06796bf
Compare
06796bf to
57796d9
Compare
|
the repro (replicated the current tests cover clean splits and isolated invalid bytes, so this slips through. it matters for non-tui exec output (binary dumps, latin-1) where invalid bytes are common and any adjacent split corrupts a real char. fix is to iterate the error path — emit one fn decode(&mut self, chunk: &[u8]) -> String {
let mut buf = std::mem::take(&mut self.holdover);
buf.extend_from_slice(chunk);
let mut out = String::new();
let mut rest: &[u8] = &buf;
loop {
match std::str::from_utf8(rest) {
Ok(s) => { out.push_str(s); break; }
Err(e) => {
let v = e.valid_up_to();
out.push_str(unsafe { std::str::from_utf8_unchecked(&rest[..v]) });
match e.error_len() {
Some(n) => { out.push('\u{FFFD}'); rest = &rest[v + n..]; } // definite error: 1 FFFD, keep scanning
None => { self.holdover = rest[v..].to_vec(); break; } // incomplete tail: hold for next chunk
}
}
}
}
out
}worth adding a test with an invalid byte immediately before a split codepoint to lock it in. |
DorianZheng
left a comment
There was a problem hiding this comment.
two optional cleanups on the decoder; neither blocks the fix.
| // (clean chunk boundaries), holdover is empty and we still need to | ||
| // own the bytes to splice in any new partial tail. | ||
| let mut buf = std::mem::take(&mut self.holdover); | ||
| buf.extend_from_slice(chunk); |
There was a problem hiding this comment.
the buf Vec is allocated and the whole chunk copied on every call — including the common case (empty holdover + fully-valid chunk) where it's pure overhead: that path then does from_utf8(&buf) + .to_string(), so two allocs and a full memcpy per frame. the comment calls the allocation unavoidable, but when holdover is empty you can decode chunk in place and only allocate a Vec for the ≤3-byte tail you actually hold:
let buf: Cow<[u8]> = if self.holdover.is_empty() {
Cow::Borrowed(chunk)
} else {
let mut b = std::mem::take(&mut self.holdover);
b.extend_from_slice(chunk);
Cow::Owned(b)
};on a busy tui stream that's one alloc + one chunk-copy saved per message. (composes with the error-path fix from the other comment.)
| output: ExecOutput, | ||
| stdout_tx: &mpsc::UnboundedSender<String>, | ||
| stderr_tx: &mpsc::UnboundedSender<String>, | ||
| stdout_decoder: &mut Utf8StreamDecoder, |
There was a problem hiding this comment.
route_output (5 args) and flush_decoders (4 args) thread the same four values — stdout_tx, stderr_tx, and the two decoders — and the attach loop re-stitches them at four call sites (one route + three exit-path flushes). each sender and its decoder are really one unit per stream; grouping the (tx, decoder) pairs so the loop holds one value per stream and calls .route(output) / .flush() would drop the arg threading and keep decode and flush together instead of split across two free fns. minor, but it's coupling that grows with every new exit path — this change already added three.
The streaming decoder's error path lossy-decoded the entire remaining tail in one shot, so a still-incomplete codepoint sitting immediately after an invalid byte was flattened into spurious U+FFFD instead of being held for the next chunk. This reintroduced the doubled-replacement artifact this PR fixes, gated behind a non-UTF-8 prefix — common in binary or latin-1 exec output, e.g. [FF E2] then [94 80] rendered "<FFFD><FFFD><FFFD><FFFD>" instead of "<FFFD>─". Walk the buffer instead: emit one U+FFFD per definitively invalid sequence and resume scanning after it, holding only a final incomplete-but-valid tail. Also take the empty-holdover fast path (borrow the chunk, skipping the per-message buffer alloc + copy). Adds a reproducer test (invalid byte before a 3- and 4-byte split); fails on the prior decoder, passes now.
route_outputdecoded eachExecOutputchunk independently withString::from_utf8_lossy, so a multi-byte UTF-8 char straddling a chunk boundary became twoU+FFFDreplacements (one per side of the cut). For TUI workloads (pi-coding-agent, htop, ncdu) this desyncs the renderer's column math vs. what the client terminal sees, producing visible double-render / ghost-line artifacts during streaming output.Fix
Hold the trailing partial codepoint across chunks via a per-stream
Utf8StreamDecoderconstructed insidespawn_attach. Flush any held bytes asU+FFFDon EOF to match priorfrom_utf8_lossysemantics for truncated tails.Tests
7 new unit tests in
exec.rscovering:U+FFFD)route_outputintegration shape (regression reproducer)Two-side verified: 5/7 fail without the fix, 7/7 pass with it. Full
make test:unit:rust(725+38) green.make test:unit:node(69) green against the rebuilt native module.Summary by CodeRabbit
Bug Fixes
Tests