fix(exec): preserve UTF-8 codepoints split across attach chunks by ldarren · Pull Request #665 · boxlite-ai/boxlite

ldarren · 2026-06-06T07:00:38Z

route_output decoded each ExecOutput chunk independently with String::from_utf8_lossy, so a multi-byte UTF-8 char straddling a chunk boundary became two U+FFFD replacements (one per side of the cut). For TUI workloads (pi-coding-agent, htop, ncdu) this desyncs the renderer's column math vs. what the client terminal sees, producing visible double-render / ghost-line artifacts during streaming output.

Fix

Hold the trailing partial codepoint across chunks via a per-stream Utf8StreamDecoder constructed inside spawn_attach. Flush any held bytes as U+FFFD on EOF to match prior from_utf8_lossy semantics for truncated tails.

Tests

7 new unit tests in exec.rs covering:

3-byte and 4-byte codepoints split at every cut position
One-byte-at-a-time delivery
ASCII prefix + multi-byte tail
Definitively invalid bytes (still emit U+FFFD)
EOF flush
route_output integration shape (regression reproducer)

Two-side verified: 5/7 fail without the fix, 7/7 pass with it. Full make test:unit:rust (725+38) green. make test:unit:node (69) green against the rebuilt native module.

Summary by CodeRabbit

Bug Fixes
- Output streaming preserves multi-byte Unicode across message boundaries, avoids spurious or duplicate replacement characters, and reliably drains held bytes on shutdown, error, or normal end for both stdout and stderr.
Tests
- Expanded tests for boundary splits, invalid-byte replacement, EOF flushing, idempotent draining, and integration cases to prevent regressions.

cla-assistant · 2026-06-06T07:00:51Z

All committers have signed the CLA.

coderabbitai · 2026-06-06T07:00:51Z

Looking for one thing? Review this PR in Change Stack to search files, summaries, diffs, and code without losing your place.

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Stateful per-stream UTF-8 decoding was added: a private Utf8StreamDecoder preserves incomplete trailing bytes across decode() calls and provides flush(); spawn_attach creates decoders for stdout/stderr, threads them into route_output, and flushes them on all attach exit paths. Tests cover decoder boundaries and integration behavior.

Changes

UTF-8 Stream Decoding Fix

Layer / File(s)	Summary
Utf8StreamDecoder Implementation `src/boxlite/src/portal/interfaces/exec.rs`	Private `Utf8StreamDecoder` performs lossy UTF-8 decoding while holding 1–3 trailing bytes of incomplete multi-byte sequences across `decode()` calls and exposes `flush()` to emit replacement(s) at EOF.
spawn_attach and route_output Integration `src/boxlite/src/portal/interfaces/exec.rs`	`spawn_attach` initializes independent decoders for stdout and stderr, passes `&mut` decoders into `route_output` for each ExecOutput chunk, replaces per-message `from_utf8_lossy` with streaming decoding, and flushes decoders on cancellation, error, or normal termination. Introduces `flush_decoders` helper to drain tails.
Decoder and Integration Tests `src/boxlite/src/portal/interfaces/exec.rs`	Adds unit tests for decoder boundary cases (3-byte/4-byte splits, single-byte chunks, mixed ASCII + partial tails, invalid bytes, EOF flushing) and integration tests verifying `route_output` and `flush_decoders` avoid duplicate replacement characters and correctly drain stdout/stderr tails.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibble bytes split on the run,
Holding tails till the whole codepoint's spun.
I mark the lost with a careful sign,
Flush the rest so the output's fine.
Hop, decode, and let the terminal hum!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and clearly describes the main fix: preserving UTF-8 codepoints that are split across streaming chunks, which is the core problem addressed by the implementation changes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cla-assistant · 2026-06-06T07:00:51Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Darren Liew seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/boxlite/src/portal/interfaces/exec.rs (1)

317-326: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Flush decoder holdover before breaking on attach stream error.

On Line 317, the error branch breaks without flushing stdout_decoder/stderr_decoder, so trailing partial UTF-8 bytes can be lost on transport errors even though EOF path flushes them.

Proposed fix

                             Some(Err(e)) => {
                                 tracing::debug!(
                                     execution_id = %execution_id,
                                     error = %e,
                                     message_count,
                                     "Attach stream error, breaking"
                                 );
+                                let stdout_tail = stdout_decoder.flush();
+                                if !stdout_tail.is_empty() {
+                                    let _ = stdout_tx.send(stdout_tail);
+                                }
+                                let stderr_tail = stderr_decoder.flush();
+                                if !stderr_tail.is_empty() {
+                                    let _ = stderr_tx.send(stderr_tail);
+                                }
                                 let _ = stderr_tx.send(format!("Attach stream error: {}", e));
                                 break;
                             }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/boxlite/src/portal/interfaces/exec.rs` around lines 317 - 326, The
attach-stream error branch currently breaks without flushing decoder holdover,
which can drop trailing partial UTF-8 bytes; before breaking in the Some(Err(e))
arm, call the flush/finish routine on both stdout_decoder and stderr_decoder to
extract any remaining valid text and send those strings to their respective
channels (e.g. stdout_tx and stderr_tx) prior to sending the "Attach stream
error" message and breaking; ensure you still log execution_id and error as
before and perform the decoder flushes in that branch (refer to symbols
stdout_decoder, stderr_decoder, stdout_tx, stderr_tx, execution_id).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/boxlite/src/portal/interfaces/exec.rs`:
- Around line 317-326: The attach-stream error branch currently breaks without
flushing decoder holdover, which can drop trailing partial UTF-8 bytes; before
breaking in the Some(Err(e)) arm, call the flush/finish routine on both
stdout_decoder and stderr_decoder to extract any remaining valid text and send
those strings to their respective channels (e.g. stdout_tx and stderr_tx) prior
to sending the "Attach stream error" message and breaking; ensure you still log
execution_id and error as before and perform the decoder flushes in that branch
(refer to symbols stdout_decoder, stderr_decoder, stdout_tx, stderr_tx,
execution_id).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 913ba10c-5893-4b63-8039-8aabd271034c

📥 Commits

Reviewing files that changed from the base of the PR and between fe557e7 and 35ad76c.

📒 Files selected for processing (1)

src/boxlite/src/portal/interfaces/exec.rs

DorianZheng · 2026-06-06T07:17:26Z

Hi @ldarren. Thanks for file this PR. Please sign the CLA.

DorianZheng · 2026-06-07T06:49:44Z

decode() can still emit doubled U+FFFD when invalid bytes sit immediately before a codepoint that splits on the chunk boundary — the exact failure this PR targets, just gated behind a non-utf8 prefix.

the error_len() == Some(n) arm runs from_utf8_lossy over the whole remaining tail, so a trailing incomplete-but-valid sequence at the end of that tail gets flattened to U+FFFD instead of being held for the next chunk.

repro (replicated Utf8StreamDecoder::decode verbatim, <FFFD> = U+FFFD):

[0xFF, 0xE2] then [0x94, 0x80]      # garbage 0xFF, then "─" (E2 94 80) split across the boundary
  got:  <FFFD><FFFD><FFFD><FFFD>
  want: <FFFD>─

[0xFF,0xFF,0xF0,0x9F] then [0x91,0x8B]   # 2 garbage + "👋" split
  got:  <FFFD><FFFD><FFFD><FFFD><FFFD>
  want: <FFFD><FFFD>👋

[0xE2] then [0x94,0x80]             # clean split, no garbage — fine today
  got:  ─

the current tests cover clean splits and isolated invalid bytes, so this slips through. it matters for non-tui exec output (binary dumps, latin-1) where invalid bytes are common and any adjacent split corrupts a real char.

fix is to iterate the error path — emit one U+FFFD per definite error, hold only a final error_len() == None tail (also drops the now-dead tail.len() < 4 guard):

fn decode(&mut self, chunk: &[u8]) -> String {
    let mut buf = std::mem::take(&mut self.holdover);
    buf.extend_from_slice(chunk);
    let mut out = String::new();
    let mut rest: &[u8] = &buf;
    loop {
        match std::str::from_utf8(rest) {
            Ok(s) => { out.push_str(s); break; }
            Err(e) => {
                let v = e.valid_up_to();
                out.push_str(unsafe { std::str::from_utf8_unchecked(&rest[..v]) });
                match e.error_len() {
                    Some(n) => { out.push('\u{FFFD}'); rest = &rest[v + n..]; } // definite error: 1 FFFD, keep scanning
                    None    => { self.holdover = rest[v..].to_vec(); break; }   // incomplete tail: hold for next chunk
                }
            }
        }
    }
    out
}

worth adding a test with an invalid byte immediately before a split codepoint to lock it in.

DorianZheng

two optional cleanups on the decoder; neither blocks the fix.

DorianZheng · 2026-06-07T06:53:41Z

+        // (clean chunk boundaries), holdover is empty and we still need to
+        // own the bytes to splice in any new partial tail.
+        let mut buf = std::mem::take(&mut self.holdover);
+        buf.extend_from_slice(chunk);


the buf Vec is allocated and the whole chunk copied on every call — including the common case (empty holdover + fully-valid chunk) where it's pure overhead: that path then does from_utf8(&buf) + .to_string(), so two allocs and a full memcpy per frame. the comment calls the allocation unavoidable, but when holdover is empty you can decode chunk in place and only allocate a Vec for the ≤3-byte tail you actually hold:

let buf: Cow<[u8]> = if self.holdover.is_empty() { Cow::Borrowed(chunk) } else { let mut b = std::mem::take(&mut self.holdover); b.extend_from_slice(chunk); Cow::Owned(b) };

on a busy tui stream that's one alloc + one chunk-copy saved per message. (composes with the error-path fix from the other comment.)

DorianZheng · 2026-06-07T06:53:41Z

        output: ExecOutput,
        stdout_tx: &mpsc::UnboundedSender<String>,
        stderr_tx: &mpsc::UnboundedSender<String>,
+        stdout_decoder: &mut Utf8StreamDecoder,


route_output (5 args) and flush_decoders (4 args) thread the same four values — stdout_tx, stderr_tx, and the two decoders — and the attach loop re-stitches them at four call sites (one route + three exit-path flushes). each sender and its decoder are really one unit per stream; grouping the (tx, decoder) pairs so the loop holds one value per stream and calls .route(output) / .flush() would drop the arg threading and keep decode and flush together instead of split across two free fns. minor, but it's coupling that grows with every new exit path — this change already added three.

The streaming decoder's error path lossy-decoded the entire remaining tail in one shot, so a still-incomplete codepoint sitting immediately after an invalid byte was flattened into spurious U+FFFD instead of being held for the next chunk. This reintroduced the doubled-replacement artifact this PR fixes, gated behind a non-UTF-8 prefix — common in binary or latin-1 exec output, e.g. [FF E2] then [94 80] rendered "<FFFD><FFFD><FFFD><FFFD>" instead of "<FFFD>─". Walk the buffer instead: emit one U+FFFD per definitively invalid sequence and resume scanning after it, holding only a final incomplete-but-valid tail. Also take the empty-holdover fast path (borrow the chunk, skipping the per-message buffer alloc + copy). Adds a reproducer test (invalid byte before a 3- and 4-byte split); fails on the prior decoder, passes now.

ldarren force-pushed the fix/utf8-stream-decode branch from 35ad76c to 089026f Compare June 6, 2026 07:02

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

ldarren force-pushed the fix/utf8-stream-decode branch from 4845c59 to 06796bf Compare June 6, 2026 09:16

ldarren added 2 commits June 6, 2026 17:18

fix(exec): preserve UTF-8 codepoints split across attach chunks

5d3bdeb

fix(exec): flush UTF-8 decoder holdover on every attach exit path

57796d9

ldarren force-pushed the fix/utf8-stream-decode branch from 06796bf to 57796d9 Compare June 6, 2026 09:18

DorianZheng reviewed Jun 7, 2026

View reviewed changes

DorianZheng added the e2e-test Triggers E2E integration tests on self-hosted runner label Jun 7, 2026

DorianZheng requested a review from a team as a code owner June 10, 2026 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(exec): preserve UTF-8 codepoints split across attach chunks#665

fix(exec): preserve UTF-8 codepoints split across attach chunks#665
ldarren wants to merge 3 commits into
boxlite-ai:mainfrom
ldarren:fix/utf8-stream-decode

ldarren commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

cla-assistant Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

cla-assistant Bot commented Jun 6, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

DorianZheng commented Jun 6, 2026

Uh oh!

DorianZheng commented Jun 7, 2026

Uh oh!

DorianZheng left a comment

Uh oh!

DorianZheng Jun 7, 2026

Uh oh!

DorianZheng Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ldarren commented Jun 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix

Tests

Summary by CodeRabbit

Uh oh!

cla-assistant Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

cla-assistant Bot commented Jun 6, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

DorianZheng commented Jun 6, 2026

Uh oh!

DorianZheng commented Jun 7, 2026

Uh oh!

DorianZheng left a comment

Choose a reason for hiding this comment

Uh oh!

DorianZheng Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

DorianZheng Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ldarren commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

cla-assistant Bot commented Jun 6, 2026 •

edited

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading