Skip to content

Wrong BOM detection for strings starting with \x00\xfe\x00\xff #3587

@gyrrhe

Description

@gyrrhe

I get an AssertionError when merging a PDF that contains both latin and Chinese characters. This happens with a real PDF I work with, but for this issue I've created a small PDF that triggers the error. Perhaps the fact that the PDF contains both latin and Chinese characters has nothing to do with the issue, but I suspect it does.

Environment

$ python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.5.0, crypt_provider=('cryptography', '43.0.0'), PIL=11.1.0

Code + PDF

import pypdf

reader = pypdf.PdfReader("minimal.pdf")
writer = pypdf.PdfWriter()
for page in reader.pages:
    writerPage = writer.add_blank_page(reader.pages[0].mediabox.width, reader.pages[0].mediabox.height)
    writerPage.merge_page(page)

minimal.pdf

I do not know why this particular set of characters triggers the error, but it's the minimal set I could find that triggers it. Feel free to add the PDF to your tests.

Traceback

Traceback (most recent call last):
  File "/home/user/tmp/minimal.py", line 7, in <module>
    writerPage.merge_page(page)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/home/user/tmp/pypdf/_page.py", line 1068, in merge_page
    self._merge_page(page2, over=over, expand=expand)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tmp/pypdf/_page.py", line 1147, in _merge_page
    page2content.operations.insert(
    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tmp/pypdf/generic/_data_structures.py", line 1410, in operations
    self._parse_content_stream(BytesIO(self._data))
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tmp/pypdf/generic/_data_structures.py", line 1303, in _parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
                    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tmp/pypdf/generic/_data_structures.py", line 1452, in read_object
    return ArrayObject.read_from_stream(stream, pdf, forced_encoding)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tmp/pypdf/generic/_data_structures.py", line 261, in read_from_stream
    arr.append(read_object(stream, pdf, forced_encoding))
               ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tmp/pypdf/generic/_data_structures.py", line 1450, in read_object
    return read_hex_string_from_stream(stream, forced_encoding)
  File "/home/user/tmp/pypdf/generic/_utils.py", line 35, in read_hex_string_from_stream
    return create_string_object(bytes(arr), forced_encoding)
  File "/home/user/tmp/pypdf/generic/_utils.py", line 168, in create_string_object
    retval = TextStringObject(string.decode("utf-16be"))
  File "/home/user/tmp/pypdf/generic/_base.py", line 673, in __new__
    assert org is not None, "mypy"
           ^^^^^^^^^^^^^^^
AssertionError: mypy

Typst code

The PDF I share above was created with the Typst typesetting system, version 0.14.2, from the following code:

#set document(date: none)
#set text(font: ("Noto Sans CJK SC"))

接司智对UQfiffI:'!'bSMEftN;Tpghlxo?Ldcrm,quva sine.取器效论证反结件以量说明讨用公高认带的起最准都你单和愿计兰它由,亲台本及集游术行具马扩充要读好您才蜂无击步方并急标是多哪该度去朋被小整任译形始成中之应种么识社织调放弃己基理次倒人预况有灵从经个康能建、手现同作慧队

划时但AA轮更力战付者关互AA文型强导工密们各关已(意络共关采利一体张关开关通确信关测关提关关模到立关语关关备这养入关压关第味程关)团群如另言动情;来规培关果没枪关过治自网。就参主地系德着解构做旦政重目感出切大协受令关想翻友关内诸比为把在相先可看机领

I don't think this bug comes from Typst creating a malformed PDF, but if you find that it does I'll be happy to create an issue on their side.

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions