Broken multi-byte letters at the borders of 4096-byte chunks

The **handleEntry()** function in the **open-office-extractor.js** file has an instruction to read streams by 4096-byte chunks:
`const chunk = readStream.read(0x1000);`

Given the text in the *.docx file is not in some Latin alphabet, most characters in the strings inside **<w:t>** tags would be multi-byte. It's only a matter of time for such a chunk to break some characters in two, which will result in 2 Unicode "Replacement Characters" (**U+FFFD**, which looks like **EF BF BD** bytes) in the reconstructed text content.

Perhaps it would be reasonable to get the encoding from the first tag in the *.xml files (*document.xml* and alike) and then treat the file as a text of such encoding, not just a stream of bytes. Read in chunks of characters, not chunks of bytes.

For example, this Lorem Ipsum text uses the Cyrillic letters, and the second appearance of the words `сед ут амет риденс номинави` gets turned to `сед ут амет ри��енс номинави`:
[Lorem_Ipsum__CYR.docx](https://github.com/morungos/node-word-extractor/files/13901957/Lorem_Ipsum__CYR.docx)

This behaviour would pose a problem for the documents using Greek, Cyrillic, Arabic, Hebrew, Hindu, Chinese, Korean, Japanese (and a few more) alphabets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken multi-byte letters at the borders of 4096-byte chunks #54

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Broken multi-byte letters at the borders of 4096-byte chunks #54

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions