Skip to content

fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip#3329

Merged
dolanmiu merged 6 commits intodolanmiu:masterfrom
Yuof:fix/utf8-encode-before-zip
Feb 15, 2026
Merged

fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip#3329
dolanmiu merged 6 commits intodolanmiu:masterfrom
Yuof:fix/utf8-encode-before-zip

Conversation

@Yuof
Copy link
Contributor

@Yuof Yuof commented Jan 29, 2026

Problem

When JSZip processes large XML content (> 16KB), it chunks strings using \substring()\ which operates on UTF-16 code units. This can split surrogate pairs for characters above U+FFFF (like emoji or Material Design Icons).

Each surrogate then gets encoded as a separate 3-byte UTF-8 sequence, producing invalid CESU-8 instead of proper UTF-8. This corrupts the docx file and causes XML parsing errors.

Root Cause

JSZip's \DataWorker\ uses:
\\javascript
DEFAULT_BLOCK_SIZE = 16 * 1024
data.substring(index, nextIndex)
\\

The \Utf8EncodeWorker\ then processes each chunk independently, without handling split surrogates. I've opened a fix PR to JSZip: Stuk/jszip#963

Solution

This PR works around the issue by pre-encoding strings to UTF-8 using \TextEncoder\ before passing to \zip.file(). Since we pass \Uint8Array\ instead of strings, JSZip skips the problematic \Utf8EncodeWorker\ entirely.

Changes

  • Added \�ncodeUtf8()\ helper function in
    ext-compiler.ts\ and \ rom-docx.ts\
  • Modified \zip.file()\ calls to use pre-encoded UTF-8 bytes for XML content

Testing

  • Build succeeds
  • TypeScript compiles without errors
  • This is a minimal, low-risk change that ensures UTF-8 encoding happens correctly before JSZip processing

Copilot AI review requested due to automatic review settings January 29, 2026 10:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a bug in JSZip where string chunking can split UTF-16 surrogate pairs when processing large XML content (> 16KB), resulting in invalid CESU-8 encoding instead of proper UTF-8. The fix pre-encodes XML strings to UTF-8 using TextEncoder before passing them to JSZip, bypassing the problematic chunking behavior.

Changes:

  • Added encodeUtf8() helper function to pre-encode strings to UTF-8 bytes
  • Modified all zip.file() calls for XML content to use the pre-encoded UTF-8 bytes
  • Binary content (images, fonts) continues to be passed directly without encoding

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/patcher/from-docx.ts Added encodeUtf8() helper and applied it to XML content in the patching workflow
src/export/packer/next-compiler.ts Added encodeUtf8() helper and applied it to all XML file generation in the compiler

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Feb 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (ec4fdc2) to head (0373b68).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##            master     #3329   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          291       291           
  Lines         8862      8864    +2     
  Branches      1435      1436    +1     
=========================================
+ Hits          8862      8864    +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dolanmiu
Copy link
Owner

Looks good. I fixed tests and removed that TextEncoder fallback. All major browsers have this already, so it's ok.

Removed Buffer dependency, there was a PR which removed Buffer, so best to adhere to that

@dolanmiu dolanmiu merged commit f9826ad into dolanmiu:master Feb 15, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants