The handleEntry() function in the open-office-extractor.js file has an instruction to read streams by 4096-byte chunks:
const chunk = readStream.read(0x1000);
Given the text in the *.docx file is not in some Latin alphabet, most characters in the strings inside <w:t> tags would be multi-byte. It's only a matter of time for such a chunk to break some characters in two, which will result in 2 Unicode "Replacement Characters" (U+FFFD, which looks like EF BF BD bytes) in the reconstructed text content.
Perhaps it would be reasonable to get the encoding from the first tag in the *.xml files (document.xml and alike) and then treat the file as a text of such encoding, not just a stream of bytes. Read in chunks of characters, not chunks of bytes.
For example, this Lorem Ipsum text uses the Cyrillic letters, and the second appearance of the words сед ут амет риденс номинави gets turned to сед ут амет ри��енс номинави:
Lorem_Ipsum__CYR.docx
This behaviour would pose a problem for the documents using Greek, Cyrillic, Arabic, Hebrew, Hindu, Chinese, Korean, Japanese (and a few more) alphabets.
The handleEntry() function in the open-office-extractor.js file has an instruction to read streams by 4096-byte chunks:
const chunk = readStream.read(0x1000);Given the text in the *.docx file is not in some Latin alphabet, most characters in the strings inside <w:t> tags would be multi-byte. It's only a matter of time for such a chunk to break some characters in two, which will result in 2 Unicode "Replacement Characters" (U+FFFD, which looks like EF BF BD bytes) in the reconstructed text content.
Perhaps it would be reasonable to get the encoding from the first tag in the *.xml files (document.xml and alike) and then treat the file as a text of such encoding, not just a stream of bytes. Read in chunks of characters, not chunks of bytes.
For example, this Lorem Ipsum text uses the Cyrillic letters, and the second appearance of the words
сед ут амет риденс номинавиgets turned toсед ут амет ри��енс номинави:Lorem_Ipsum__CYR.docx
This behaviour would pose a problem for the documents using Greek, Cyrillic, Arabic, Hebrew, Hindu, Chinese, Korean, Japanese (and a few more) alphabets.