Skip to content

Broken multi-byte letters at the borders of 4096-byte chunks #54

Description

@Oliverity

The handleEntry() function in the open-office-extractor.js file has an instruction to read streams by 4096-byte chunks:
const chunk = readStream.read(0x1000);

Given the text in the *.docx file is not in some Latin alphabet, most characters in the strings inside <w:t> tags would be multi-byte. It's only a matter of time for such a chunk to break some characters in two, which will result in 2 Unicode "Replacement Characters" (U+FFFD, which looks like EF BF BD bytes) in the reconstructed text content.

Perhaps it would be reasonable to get the encoding from the first tag in the *.xml files (document.xml and alike) and then treat the file as a text of such encoding, not just a stream of bytes. Read in chunks of characters, not chunks of bytes.

For example, this Lorem Ipsum text uses the Cyrillic letters, and the second appearance of the words сед ут амет риденс номинави gets turned to сед ут амет ри��енс номинави:
Lorem_Ipsum__CYR.docx

This behaviour would pose a problem for the documents using Greek, Cyrillic, Arabic, Hebrew, Hindu, Chinese, Korean, Japanese (and a few more) alphabets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions