Skip to content

fix(exec): preserve UTF-8 codepoints split across attach chunks#665

Open
ldarren wants to merge 3 commits into
boxlite-ai:mainfrom
ldarren:fix/utf8-stream-decode
Open

fix(exec): preserve UTF-8 codepoints split across attach chunks#665
ldarren wants to merge 3 commits into
boxlite-ai:mainfrom
ldarren:fix/utf8-stream-decode

Conversation

@ldarren

@ldarren ldarren commented Jun 6, 2026

Copy link
Copy Markdown

route_output decoded each ExecOutput chunk independently with String::from_utf8_lossy, so a multi-byte UTF-8 char straddling a chunk boundary became two U+FFFD replacements (one per side of the cut). For TUI workloads (pi-coding-agent, htop, ncdu) this desyncs the renderer's column math vs. what the client terminal sees, producing visible double-render / ghost-line artifacts during streaming output.

Fix

Hold the trailing partial codepoint across chunks via a per-stream Utf8StreamDecoder constructed inside spawn_attach. Flush any held bytes as U+FFFD on EOF to match prior from_utf8_lossy semantics for truncated tails.

Tests

7 new unit tests in exec.rs covering:

  • 3-byte and 4-byte codepoints split at every cut position
  • One-byte-at-a-time delivery
  • ASCII prefix + multi-byte tail
  • Definitively invalid bytes (still emit U+FFFD)
  • EOF flush
  • route_output integration shape (regression reproducer)

Two-side verified: 5/7 fail without the fix, 7/7 pass with it. Full make test:unit:rust (725+38) green. make test:unit:node (69) green against the rebuilt native module.

Summary by CodeRabbit

  • Bug Fixes

    • Output streaming preserves multi-byte Unicode across message boundaries, avoids spurious or duplicate replacement characters, and reliably drains held bytes on shutdown, error, or normal end for both stdout and stderr.
  • Tests

    • Expanded tests for boundary splits, invalid-byte replacement, EOF flushing, idempotent draining, and integration cases to prevent regressions.

@cla-assistant

cla-assistant Bot commented Jun 6, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown

Looking for one thing? Review this PR in Change Stack to search files, summaries, diffs, and code without losing your place.

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Stateful per-stream UTF-8 decoding was added: a private Utf8StreamDecoder preserves incomplete trailing bytes across decode() calls and provides flush(); spawn_attach creates decoders for stdout/stderr, threads them into route_output, and flushes them on all attach exit paths. Tests cover decoder boundaries and integration behavior.

Changes

UTF-8 Stream Decoding Fix

Layer / File(s) Summary
Utf8StreamDecoder Implementation
src/boxlite/src/portal/interfaces/exec.rs
Private Utf8StreamDecoder performs lossy UTF-8 decoding while holding 1–3 trailing bytes of incomplete multi-byte sequences across decode() calls and exposes flush() to emit replacement(s) at EOF.
spawn_attach and route_output Integration
src/boxlite/src/portal/interfaces/exec.rs
spawn_attach initializes independent decoders for stdout and stderr, passes &mut decoders into route_output for each ExecOutput chunk, replaces per-message from_utf8_lossy with streaming decoding, and flushes decoders on cancellation, error, or normal termination. Introduces flush_decoders helper to drain tails.
Decoder and Integration Tests
src/boxlite/src/portal/interfaces/exec.rs
Adds unit tests for decoder boundary cases (3-byte/4-byte splits, single-byte chunks, mixed ASCII + partial tails, invalid bytes, EOF flushing) and integration tests verifying route_output and flush_decoders avoid duplicate replacement characters and correctly drain stdout/stderr tails.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibble bytes split on the run,
Holding tails till the whole codepoint's spun.
I mark the lost with a careful sign,
Flush the rest so the output's fine.
Hop, decode, and let the terminal hum!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main fix: preserving UTF-8 codepoints that are split across streaming chunks, which is the core problem addressed by the implementation changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@cla-assistant

cla-assistant Bot commented Jun 6, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Darren Liew seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@ldarren ldarren force-pushed the fix/utf8-stream-decode branch from 35ad76c to 089026f Compare June 6, 2026 07:02

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/boxlite/src/portal/interfaces/exec.rs (1)

317-326: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Flush decoder holdover before breaking on attach stream error.

On Line 317, the error branch breaks without flushing stdout_decoder/stderr_decoder, so trailing partial UTF-8 bytes can be lost on transport errors even though EOF path flushes them.

Proposed fix
                             Some(Err(e)) => {
                                 tracing::debug!(
                                     execution_id = %execution_id,
                                     error = %e,
                                     message_count,
                                     "Attach stream error, breaking"
                                 );
+                                let stdout_tail = stdout_decoder.flush();
+                                if !stdout_tail.is_empty() {
+                                    let _ = stdout_tx.send(stdout_tail);
+                                }
+                                let stderr_tail = stderr_decoder.flush();
+                                if !stderr_tail.is_empty() {
+                                    let _ = stderr_tx.send(stderr_tail);
+                                }
                                 let _ = stderr_tx.send(format!("Attach stream error: {}", e));
                                 break;
                             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/boxlite/src/portal/interfaces/exec.rs` around lines 317 - 326, The
attach-stream error branch currently breaks without flushing decoder holdover,
which can drop trailing partial UTF-8 bytes; before breaking in the Some(Err(e))
arm, call the flush/finish routine on both stdout_decoder and stderr_decoder to
extract any remaining valid text and send those strings to their respective
channels (e.g. stdout_tx and stderr_tx) prior to sending the "Attach stream
error" message and breaking; ensure you still log execution_id and error as
before and perform the decoder flushes in that branch (refer to symbols
stdout_decoder, stderr_decoder, stdout_tx, stderr_tx, execution_id).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/boxlite/src/portal/interfaces/exec.rs`:
- Around line 317-326: The attach-stream error branch currently breaks without
flushing decoder holdover, which can drop trailing partial UTF-8 bytes; before
breaking in the Some(Err(e)) arm, call the flush/finish routine on both
stdout_decoder and stderr_decoder to extract any remaining valid text and send
those strings to their respective channels (e.g. stdout_tx and stderr_tx) prior
to sending the "Attach stream error" message and breaking; ensure you still log
execution_id and error as before and perform the decoder flushes in that branch
(refer to symbols stdout_decoder, stderr_decoder, stdout_tx, stderr_tx,
execution_id).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 913ba10c-5893-4b63-8039-8aabd271034c

📥 Commits

Reviewing files that changed from the base of the PR and between fe557e7 and 35ad76c.

📒 Files selected for processing (1)
  • src/boxlite/src/portal/interfaces/exec.rs

@DorianZheng

Copy link
Copy Markdown
Member

Hi @ldarren. Thanks for file this PR. Please sign the CLA.

@ldarren ldarren force-pushed the fix/utf8-stream-decode branch from 4845c59 to 06796bf Compare June 6, 2026 09:16
@ldarren ldarren force-pushed the fix/utf8-stream-decode branch from 06796bf to 57796d9 Compare June 6, 2026 09:18
@DorianZheng

Copy link
Copy Markdown
Member

decode() can still emit doubled U+FFFD when invalid bytes sit immediately before a codepoint that splits on the chunk boundary — the exact failure this PR targets, just gated behind a non-utf8 prefix.

the error_len() == Some(n) arm runs from_utf8_lossy over the whole remaining tail, so a trailing incomplete-but-valid sequence at the end of that tail gets flattened to U+FFFD instead of being held for the next chunk.

repro (replicated Utf8StreamDecoder::decode verbatim, <FFFD> = U+FFFD):

[0xFF, 0xE2] then [0x94, 0x80]      # garbage 0xFF, then "─" (E2 94 80) split across the boundary
  got:  <FFFD><FFFD><FFFD><FFFD>
  want: <FFFD>─

[0xFF,0xFF,0xF0,0x9F] then [0x91,0x8B]   # 2 garbage + "👋" split
  got:  <FFFD><FFFD><FFFD><FFFD><FFFD>
  want: <FFFD><FFFD>👋

[0xE2] then [0x94,0x80]             # clean split, no garbage — fine today
  got:  ─

the current tests cover clean splits and isolated invalid bytes, so this slips through. it matters for non-tui exec output (binary dumps, latin-1) where invalid bytes are common and any adjacent split corrupts a real char.

fix is to iterate the error path — emit one U+FFFD per definite error, hold only a final error_len() == None tail (also drops the now-dead tail.len() < 4 guard):

fn decode(&mut self, chunk: &[u8]) -> String {
    let mut buf = std::mem::take(&mut self.holdover);
    buf.extend_from_slice(chunk);
    let mut out = String::new();
    let mut rest: &[u8] = &buf;
    loop {
        match std::str::from_utf8(rest) {
            Ok(s) => { out.push_str(s); break; }
            Err(e) => {
                let v = e.valid_up_to();
                out.push_str(unsafe { std::str::from_utf8_unchecked(&rest[..v]) });
                match e.error_len() {
                    Some(n) => { out.push('\u{FFFD}'); rest = &rest[v + n..]; } // definite error: 1 FFFD, keep scanning
                    None    => { self.holdover = rest[v..].to_vec(); break; }   // incomplete tail: hold for next chunk
                }
            }
        }
    }
    out
}

worth adding a test with an invalid byte immediately before a split codepoint to lock it in.

@DorianZheng DorianZheng left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two optional cleanups on the decoder; neither blocks the fix.

// (clean chunk boundaries), holdover is empty and we still need to
// own the bytes to splice in any new partial tail.
let mut buf = std::mem::take(&mut self.holdover);
buf.extend_from_slice(chunk);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the buf Vec is allocated and the whole chunk copied on every call — including the common case (empty holdover + fully-valid chunk) where it's pure overhead: that path then does from_utf8(&buf) + .to_string(), so two allocs and a full memcpy per frame. the comment calls the allocation unavoidable, but when holdover is empty you can decode chunk in place and only allocate a Vec for the ≤3-byte tail you actually hold:

let buf: Cow<[u8]> = if self.holdover.is_empty() {
    Cow::Borrowed(chunk)
} else {
    let mut b = std::mem::take(&mut self.holdover);
    b.extend_from_slice(chunk);
    Cow::Owned(b)
};

on a busy tui stream that's one alloc + one chunk-copy saved per message. (composes with the error-path fix from the other comment.)

output: ExecOutput,
stdout_tx: &mpsc::UnboundedSender<String>,
stderr_tx: &mpsc::UnboundedSender<String>,
stdout_decoder: &mut Utf8StreamDecoder,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

route_output (5 args) and flush_decoders (4 args) thread the same four values — stdout_tx, stderr_tx, and the two decoders — and the attach loop re-stitches them at four call sites (one route + three exit-path flushes). each sender and its decoder are really one unit per stream; grouping the (tx, decoder) pairs so the loop holds one value per stream and calls .route(output) / .flush() would drop the arg threading and keep decode and flush together instead of split across two free fns. minor, but it's coupling that grows with every new exit path — this change already added three.

@DorianZheng DorianZheng added the e2e-test Triggers E2E integration tests on self-hosted runner label Jun 7, 2026
The streaming decoder's error path lossy-decoded the entire remaining
tail in one shot, so a still-incomplete codepoint sitting immediately
after an invalid byte was flattened into spurious U+FFFD instead of being
held for the next chunk. This reintroduced the doubled-replacement
artifact this PR fixes, gated behind a non-UTF-8 prefix — common in
binary or latin-1 exec output, e.g. [FF E2] then [94 80] rendered
"<FFFD><FFFD><FFFD><FFFD>" instead of "<FFFD>─".

Walk the buffer instead: emit one U+FFFD per definitively invalid
sequence and resume scanning after it, holding only a final
incomplete-but-valid tail. Also take the empty-holdover fast path
(borrow the chunk, skipping the per-message buffer alloc + copy).

Adds a reproducer test (invalid byte before a 3- and 4-byte split);
fails on the prior decoder, passes now.
@DorianZheng DorianZheng requested a review from a team as a code owner June 10, 2026 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

e2e-test Triggers E2E integration tests on self-hosted runner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants