Skip to content

Decode error when trying to get drawings #2468

Closed
@anomam

Description

@anomam

Describe the bug (mandatory)

Starting with version 1.22.0, I'm seeing the following exception when calling page.get_drawings() on one of our PDF files.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 0: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<...>/pdf_test.py", line 67, in <module>
    main()
  File "<...>/pdf_test.py", line 60, in main
    page.get_cdrawings()
  File "<...>/lib/python3.9/site-packages/fitz/fitz.py", line 6612, in get_cdrawings
    val = _fitz.Page_get_cdrawings(self, extended, callback, method)
SystemError: <built-in function Page_get_cdrawings> returned a result with an error set

But I do not get any error with previous versions like 1.21.1.

To Reproduce (mandatory)

I'm a bit stuck here as unfortunately I cannot share the PDF in question because it's sensitive, and I've been struggling to create a new PDF that would mimic the issue.

Is there any chance you could provide some guidance on how to isolate the drawing issue?

So far I tried to copy the failing drawing content stream to a new PDF using version 1.21.1, and so that I can potentially post it here, but the newly created PDF has no issue with 1.22.0+....

Here is my script for copying the stream

doc = fitz.open(fp)
page = doc[0]
xref_content = page.get_contents()
# >> in this case = [4]
stream = doc.xref_stream(xref_content[0])
# >> returning bytes: b' BT /F2 11.000 Tf ET\n1.000 g\n0.000 G\n/GS1 gs\n0.567 w\n<...>'
# the problem is with b'\xac' which can't be decoded with utf-8
page.get_cdrawings()
print(stream)

new_doc = fitz.open()
new_page = new_doc.new_page(width=page.rect.width, height=page.rect.height)
# create a dummy drawing to overwrite with the failing one
shape = new_page.new_shape()
shape.draw_line((10, 10), (15, 15))
shape.finish()
shape.commit()
# overwrite the dummy drawing with the failing one
new_xref = new_page.get_contents()[0]
new_doc.update_stream(new_xref, stream, compress=True)
new_doc.save("new_doc.pdf")

Expected behavior (optional)

Since getting the drawings would pass for versions prior to 1.22.0, I would expect it to pass for newer versions as well.

Screenshots (optional)

Not sure if that can help, but here is a cropped screenshot of the drawing stream bytes:

image

Your configuration (mandatory)

  • Operating system, potentially version and bitness
  • Python version, bitness
  • PyMuPDF version, installation method (wheel or generated from source).

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

3.9.13 (main, Sep  8 2022, 09:21:48)
[GCC 9.4.0]
 linux

PyMuPDF 1.22.0: Python bindings for the MuPDF 1.22.0 library.
Version date: 2023-04-14 00:00:01.
Built for Python 3.9 on linux (64-bit).

Installed via pip install pymupdf==1.22.0

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions