fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip by Yuof · Pull Request #3329 · dolanmiu/docx

Yuof · 2026-01-29T10:50:35Z

Problem

When JSZip processes large XML content (> 16KB), it chunks strings using \substring()\ which operates on UTF-16 code units. This can split surrogate pairs for characters above U+FFFF (like emoji or Material Design Icons).

Each surrogate then gets encoded as a separate 3-byte UTF-8 sequence, producing invalid CESU-8 instead of proper UTF-8. This corrupts the docx file and causes XML parsing errors.

Root Cause

JSZip's \DataWorker\ uses:
\\javascript
DEFAULT_BLOCK_SIZE = 16 * 1024
data.substring(index, nextIndex)
\\

The \Utf8EncodeWorker\ then processes each chunk independently, without handling split surrogates. I've opened a fix PR to JSZip: Stuk/jszip#963

Solution

This PR works around the issue by pre-encoding strings to UTF-8 using \TextEncoder\ before passing to \zip.file(). Since we pass \Uint8Array\ instead of strings, JSZip skips the problematic \Utf8EncodeWorker\ entirely.

Changes

Added \�ncodeUtf8()\ helper function in
ext-compiler.ts\ and \rom-docx.ts\
Modified \zip.file()\ calls to use pre-encoded UTF-8 bytes for XML content

Testing

Build succeeds
TypeScript compiles without errors
This is a minimal, low-risk change that ensures UTF-8 encoding happens correctly before JSZip processing

Copilot

Pull request overview

This PR addresses a bug in JSZip where string chunking can split UTF-16 surrogate pairs when processing large XML content (> 16KB), resulting in invalid CESU-8 encoding instead of proper UTF-8. The fix pre-encodes XML strings to UTF-8 using TextEncoder before passing them to JSZip, bypassing the problematic chunking behavior.

Changes:

Added encodeUtf8() helper function to pre-encode strings to UTF-8 bytes
Modified all zip.file() calls for XML content to use the pre-encoded UTF-8 bytes
Binary content (images, fonts) continues to be passed directly without encoding

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/patcher/from-docx.ts	Added `encodeUtf8()` helper and applied it to XML content in the patching workflow
src/export/packer/next-compiler.ts	Added `encodeUtf8()` helper and applied it to all XML file generation in the compiler

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/patcher/from-docx.ts

src/export/packer/next-compiler.ts

src/patcher/from-docx.ts

… tests

codecov · 2026-02-15T20:33:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (ec4fdc2) to head (0373b68).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##            master     #3329   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          291       291           
  Lines         8862      8864    +2     
  Branches      1435      1436    +1     
=========================================
+ Hits          8862      8864    +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dolanmiu · 2026-02-15T20:35:41Z

Looks good. I fixed tests and removed that TextEncoder fallback. All major browsers have this already, so it's ok.

Removed Buffer dependency, there was a PR which removed Buffer, so best to adhere to that

fix: pre-encode XML strings to UTF-8 to avoid JSZip surrogate split bug

0a800fd

Copilot AI review requested due to automatic review settings January 29, 2026 10:50

Copilot started reviewing on behalf of Yuof January 29, 2026 10:50 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

src/patcher/from-docx.ts Outdated Show resolved Hide resolved

src/export/packer/next-compiler.ts Outdated Show resolved Hide resolved

src/export/packer/next-compiler.ts Outdated Show resolved Hide resolved

src/patcher/from-docx.ts Outdated Show resolved Hide resolved

Roman and others added 4 commits January 29, 2026 13:45

Address review feedback: extract encodeUtf8 to shared utility and add…

29329a7

… tests

Merge branch 'master' into fix/utf8-encode-before-zip

8bf19a5

Merge branch 'master' into fix/utf8-encode-before-zip

cc977b5

Remove buffer dependency and fallback

c71ceb1

Revert format

0373b68

dolanmiu approved these changes Feb 15, 2026

View reviewed changes

dolanmiu merged commit f9826ad into dolanmiu:master Feb 15, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip#3329

fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip#3329
dolanmiu merged 6 commits intodolanmiu:masterfrom
Yuof:fix/utf8-encode-before-zip

Yuof commented Jan 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 15, 2026 •

edited

Loading

Uh oh!

dolanmiu commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Yuof commented Jan 29, 2026

Problem

Root Cause

Solution

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dolanmiu commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Feb 15, 2026 •

edited

Loading