fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip#3329
fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip#3329dolanmiu merged 6 commits intodolanmiu:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a bug in JSZip where string chunking can split UTF-16 surrogate pairs when processing large XML content (> 16KB), resulting in invalid CESU-8 encoding instead of proper UTF-8. The fix pre-encodes XML strings to UTF-8 using TextEncoder before passing them to JSZip, bypassing the problematic chunking behavior.
Changes:
- Added
encodeUtf8()helper function to pre-encode strings to UTF-8 bytes - Modified all
zip.file()calls for XML content to use the pre-encoded UTF-8 bytes - Binary content (images, fonts) continues to be passed directly without encoding
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/patcher/from-docx.ts | Added encodeUtf8() helper and applied it to XML content in the patching workflow |
| src/export/packer/next-compiler.ts | Added encodeUtf8() helper and applied it to all XML file generation in the compiler |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3329 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 291 291
Lines 8862 8864 +2
Branches 1435 1436 +1
=========================================
+ Hits 8862 8864 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Looks good. I fixed tests and removed that TextEncoder fallback. All major browsers have this already, so it's ok. Removed Buffer dependency, there was a PR which removed Buffer, so best to adhere to that |
Problem
When JSZip processes large XML content (> 16KB), it chunks strings using \substring()\ which operates on UTF-16 code units. This can split surrogate pairs for characters above U+FFFF (like emoji or Material Design Icons).
Each surrogate then gets encoded as a separate 3-byte UTF-8 sequence, producing invalid CESU-8 instead of proper UTF-8. This corrupts the docx file and causes XML parsing errors.
Root Cause
JSZip's \DataWorker\ uses:
\\javascript
DEFAULT_BLOCK_SIZE = 16 * 1024
data.substring(index, nextIndex)
\\
The \Utf8EncodeWorker\ then processes each chunk independently, without handling split surrogates. I've opened a fix PR to JSZip: Stuk/jszip#963
Solution
This PR works around the issue by pre-encoding strings to UTF-8 using \TextEncoder\ before passing to \zip.file(). Since we pass \Uint8Array\ instead of strings, JSZip skips the problematic \Utf8EncodeWorker\ entirely.
Changes
ext-compiler.ts\ and \rom-docx.ts\
Testing