Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode error when trying to get drawings #2468

Closed
anomam opened this issue Jun 14, 2023 · 7 comments
Closed

Decode error when trying to get drawings #2468

anomam opened this issue Jun 14, 2023 · 7 comments
Assignees
Labels

Comments

@anomam
Copy link

anomam commented Jun 14, 2023

Describe the bug (mandatory)

Starting with version 1.22.0, I'm seeing the following exception when calling page.get_drawings() on one of our PDF files.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 0: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<...>/pdf_test.py", line 67, in <module>
    main()
  File "<...>/pdf_test.py", line 60, in main
    page.get_cdrawings()
  File "<...>/lib/python3.9/site-packages/fitz/fitz.py", line 6612, in get_cdrawings
    val = _fitz.Page_get_cdrawings(self, extended, callback, method)
SystemError: <built-in function Page_get_cdrawings> returned a result with an error set

But I do not get any error with previous versions like 1.21.1.

To Reproduce (mandatory)

I'm a bit stuck here as unfortunately I cannot share the PDF in question because it's sensitive, and I've been struggling to create a new PDF that would mimic the issue.

Is there any chance you could provide some guidance on how to isolate the drawing issue?

So far I tried to copy the failing drawing content stream to a new PDF using version 1.21.1, and so that I can potentially post it here, but the newly created PDF has no issue with 1.22.0+....

Here is my script for copying the stream

doc = fitz.open(fp)
page = doc[0]
xref_content = page.get_contents()
# >> in this case = [4]
stream = doc.xref_stream(xref_content[0])
# >> returning bytes: b' BT /F2 11.000 Tf ET\n1.000 g\n0.000 G\n/GS1 gs\n0.567 w\n<...>'
# the problem is with b'\xac' which can't be decoded with utf-8
page.get_cdrawings()
print(stream)

new_doc = fitz.open()
new_page = new_doc.new_page(width=page.rect.width, height=page.rect.height)
# create a dummy drawing to overwrite with the failing one
shape = new_page.new_shape()
shape.draw_line((10, 10), (15, 15))
shape.finish()
shape.commit()
# overwrite the dummy drawing with the failing one
new_xref = new_page.get_contents()[0]
new_doc.update_stream(new_xref, stream, compress=True)
new_doc.save("new_doc.pdf")

Expected behavior (optional)

Since getting the drawings would pass for versions prior to 1.22.0, I would expect it to pass for newer versions as well.

Screenshots (optional)

Not sure if that can help, but here is a cropped screenshot of the drawing stream bytes:

image

Your configuration (mandatory)

  • Operating system, potentially version and bitness
  • Python version, bitness
  • PyMuPDF version, installation method (wheel or generated from source).

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

3.9.13 (main, Sep  8 2022, 09:21:48)
[GCC 9.4.0]
 linux

PyMuPDF 1.22.0: Python bindings for the MuPDF 1.22.0 library.
Version date: 2023-04-14 00:00:01.
Built for Python 3.9 on linux (64-bit).

Installed via pip install pymupdf==1.22.0

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jun 14, 2023

This is weird. Your screen print points into the middle of a text object, which are completely ignored by get_drawings().
Anyway, all that happens inside C-code. So I am afraid I can hunt down the problem only with a file at hand.
You can send it directly to my e-mail-address, to cope with confidentiality concerns.

At that version, a new dictionary key "layer" has been introduced. Potentially your PDF uses Optional Content layers, and one of the OCGs has a non-UTF8 name?
That's all that I can think of at the moment that has dealings with string representation in that method.

To check this, go to the "layers" tab of your PDF viewer, for example Adobe Acrobat.

@JorjMcKie JorjMcKie added the bug label Jun 14, 2023
@JorjMcKie JorjMcKie self-assigned this Jun 14, 2023
@anomam
Copy link
Author

anomam commented Jun 15, 2023

Thank you so much for the quick reply @JorjMcKie

Unfortunately I don't think it's related to the OCGs naming 😔 as I'm seeing the following (using 1.22.0):

>>> print(doc.get_ocgs())
{5: {'name': 'Layer 1', 'intent': [], 'on': True, 'usage': None}}

now that I think about it, the drawings in the document don't have any sensitive info since they're just SVG lines, so it's probably safe to share that part only. If I read the drawing content at xref=4 (using 1.22.0) I get:

>>> doc.xref_stream(4)
b' BT /F2 11.000 Tf ET\n1.000 g\n0.000 G\n/GS1 gs\n0.567 w\n0 Tr\n[] 0 d\n/GS1 gs\n/GS1 gs\n/GS1 gs\n/GS1 gs BT /F2 8.500 Tf ET\n q  Q q 0.000 g  0 Tr BT 56.693 791.720 Td  <5363616C6174697665204F79204C7464> Tj ET Q\n q  Q q 0.000 g  0 Tr BT 56.693 781.520 Td  <4F74746F204272616E6474696E20706F6C6B7520342042203334> Tj ET Q\n q  Q q 0.000 g  0 Tr BT 56.693 771.320 Td  <30303635302048656C73696E6B69> Tj ET Q\n q  Q q 0.000 g  0 Tr BT 56.693 761.120 Td  <46494E4C414E44> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F3 11.000 Tf ET\nq 0.000 g  0 Tr BT 323.154 803.644 Td  (\x00I\x00N\x00V\x00O\x00I\x00C\x00E) Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nq 0.000 g  0 Tr BT 524.698 803.644 Td  <31283129> Tj ET Q\n/GS1 gs\n1.000 w\n0.302 0.302 0.302 RG\n323.654 794.174 m 524.198 794.174 l S\n2 J\n525.198 794.174 m 543.756 794.174 l S\n0.000 G\n0.567 w\nBT /F2 10.000 Tf ET\nq 0.000 g  0 Tr BT 323.154 779.005 Td  <496E766F696365206E756D626572> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 10.000 Tf ET\nq 0.000 g  0 Tr BT 409.383 779.005 Td  <313532> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 10.000 Tf ET\nq 0.000 g  0 Tr BT 323.154 767.005 Td  <5265666572656E6365206E756D626572> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 10.000 Tf ET\nq 0.000 g  0 Tr BT 409.383 767.005 Td  <31353230> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 10.000 Tf ET\nq 0.000 g  0 Tr BT 323.154 755.005 Td  <496E766F6963652064617465> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 10.000 Tf ET\nq 0.000 g  0 Tr BT 409.383 755.005 Td  <30372E30372E32303139> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 10.000 Tf ET\nq 0.000 g  0 Tr BT 323.154 743.005 Td  <4475652064617465> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F3 10.000 Tf ET\nq 0.000 g  0 Tr BT 409.383 743.005 Td  (\x000\x007\x00.\x000\x008\x00.\x002\x000\x001\x009) Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 10.000 Tf ET\nq 0.000 g  0 Tr BT 323.154 731.005 Td  <5061796D656E74207465726D73> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F3 10.000 Tf ET\nq 0.000 g  0 Tr BT 409.383 731.005 Td  (\x003\x001\x00 \x00d\x00a\x00y\x00s\x00 \x00n\x00e\x00t) Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 10.000 Tf ET\nq 0.000 g  0 Tr BT 323.154 719.005 Td  <50656E616C747920696E746572657374> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 10.000 Tf ET\nq 0.000 g  0 Tr BT 409.383 719.005 Td  <31322C30302025> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 10.000 Tf ET\n q  Q q 0.000 g  0 Tr BT 56.693 719.504 Td  <436F67656E74204C61627320496E632E> Tj ET Q\n q  Q q 0.000 g  0 Tr BT 56.693 707.504 Td  <54454E4F4841204C41422032302D3233> Tj ET Q\n q  Q q 0.000 g  0 Tr BT 56.693 695.504 Td  <4461696B616E79616D612D63686F2C20536869627579612D6B75> Tj ET Q\n q  Q q 0.000 g  0 Tr BT 56.693 683.504 Td  <3135302D3030333420> Tj ET Q\n q  Q q 0.000 g  0 Tr BT 101.443 683.504 Td  <546F6B796F> Tj ET Q\n q  Q q 0.000 g  0 Tr BT 56.693 671.504 Td  <4A4150414E> Tj ET Q\n/GS1 gs\nq\n0.000 w\n28.346 600.945 m \n28.346 601.945 l \n583.941 601.945 l \n583.941 600.945 l \n h W n \n1.000 w\n0.000 0.000 0.000 RG\n0 j\n0 J\n28.346 601.445 m \n583.941 601.445 l \nS\n\nQ\n0.283 w\n0.000 G\n2 j\n2 J\n/GS1 gs\n0.567 w\n\n\nq\n1.0000 0.0000 0.0000 1.0000 0.0000 168.4055 cm\nBT /F2 11.000 Tf ET\n1.000 g\n0.000 G\n/GS1 gs\n0.567 w\n0 Tr\n[] 0 d\n/GS1 gs BT /F3 8.500 Tf ET\n q  Q q 0.000 g  0 Tr BT 42.520 -17.388 Td  (\x00D\x00u\x00e\x00 \x00d\x00a\x00t\x00e) Tj ET Q\nBT /F2 12.000 Tf ET\n q  Q q 0.000 g  0 Tr BT 42.520 -30.738 Td  <30372E30382E32303139> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F3 8.500 Tf ET\n q  Q q 0.000 g  0 Tr BT 141.732 -17.388 Td  (\x00R\x00e\x00f\x00e\x00r\x00e\x00n\x00c\x00e\x00 \x00n\x00u\x00m\x00b\x00e\x00r) Tj ET Q\nBT /F2 12.000 Tf ET\n q  Q q 0.000 g  0 Tr BT 141.732 -30.738 Td  <31353230> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F3 12.000 Tf ET\nq 0.000 g  0 Tr BT 458.010 -25.638 Td  (\x00T\x00o\x00t\x00a\x00l) Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 12.000 Tf ET\nq 0.000 g  0 Tr BT 501.367 -25.638 Td  <80392C3430302E3030> Tj ET Q\n/GS1 gs\n1.000 w\n0.800 0.800 0.800 RG\n28.846 -43.548 m 127.059 -43.548 l S\n2 J\n128.059 -43.548 m 330.339 -43.548 l S\n331.339 -43.548 m 486.694 -43.548 l S\n487.694 -43.548 m 583.441 -43.548 l S\n0.000 G\n/GS1 gs\n0.567 w\n/GS1 gs\n/GS1 gs BT /F2 11.000 Tf ET\nBT /F3 9.500 Tf ET\nq 0.000 g  0 Tr BT 42.520 -58.267 Td  (\x00B\x00I\x00C\x00 \x00/\x00 \x00S\x00W\x00I\x00F\x00T) Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F3 9.500 Tf ET\nq 0.000 g  0 Tr BT 106.076 -58.267 Td  (\x00B\x00a\x00n\x00k\x00 \x00A\x00c\x00c\x00o\x00u\x00n\x00t\x00 \x00N\x00u\x00m\x00b\x00e\x00r\x00 \x00/\x00 \x00I\x00B\x00A\x00N) Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 9.500 Tf ET\nq 0.000 g  0 Tr BT 42.520 -69.667 Td  <48454C5346494848> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 9.500 Tf ET\nq 0.000 g  0 Tr BT 106.076 -69.667 Td  <46493238203430353520303031322033383833203038> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 8.000 Tf ET\n q  Q q 0.000 g  0 Tr BT 315.148 -56.917 Td  <5363616C6174697665204F79204C7464> Tj ET Q\n q  Q q 0.000 g  0 Tr BT 315.148 -66.517 Td  <4F74746F204272616E6474696E20706F6C6B7520342042203334> Tj ET Q\n q  Q q 0.000 g  0 Tr BT 315.148 -76.117 Td  <30303635302048656C73696E6B69> Tj ET Q\n q  Q q 0.000 g  0 Tr BT 315.148 -85.717 Td  <46494E4C414E44> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\nBT /F2 8.000 Tf ET\nq 0.000 g  0 Tr BT 447.575 -56.917 Td  <54656C3A202B333538203434203233372033333331> Tj ET Q\nq 0.000 g  0 Tr BT 447.575 -66.517 Td  <696E766F6963696E67407363616C61746976652E636F6D> Tj ET Q\nq 0.000 g  0 Tr BT 447.575 -76.117 Td  <427573696E6573732049443A20323734363439352D36> Tj ET Q\nq 0.000 g  0 Tr BT 447.575 -85.717 Td  <564154206E756D6265723A2046493237343634393536> Tj ET Q\n/GS1 gs\n1.000 w\n0.800 0.800 0.800 RG\n306.144 -44.548 m 306.144 -105.910 l S\n0.000 G\n/GS1 gs\n0.567 w\nBT /F2 11.000 Tf ET\n/GS1 gs\nBT /F2 10.000 Tf ET\nq 0.000 g  0 Tr BT 156.024 -126.249 Td  <343238343035353030313233383833303830303934303030303030303030303030303030303030303030303031353230313930383037> Tj ET Q\n/GS1 gs\nBT /F2 11.000 Tf ET\n/GS1 gs\nq 297.638 0 0 28.346 157.325 -157.095 cm /I1 Do Q\n/GS1 gs\nq\n0.000 w\n28.346 -0.028 m \n28.346 -1.028 l \n583.941 -1.028 l \n583.941 -0.028 l \n h W n \n1.000 w\n0.000 0.000 0.000 RG\n0 j\n0 J\n583.941 -0.528 m \n28.346 -0.528 l \nS\n\nQ\n0.283 w\n0.000 G\n2 j\n2 J\n/GS1 gs\n0.567 w\nQ 2 J\n0.567 w\nBT /F2 11.000 Tf ET\n1.000 g\n0.000 G\n/GS1 gs\n0.567 w\n0 Tr\n[] 0 d\n/GS1 gs\n\n/GS1 gs\n0.000 G\n1.000 g\n/GS1 gs\n0.567 w\n0 Tr\n[] 0 d\n/GS1 gs\n\n/OC /ZI1 BDC \nBT /F2 11.000 Tf ET\n1.000 g\n0.000 G\n/GS1 gs\n0.567 w\n0 Tr\n[] 0 d\n/GS1 gs\nBT /F2 9.000 Tf ET\nq 0.000 g  0 Tr BT 39.685 587.626 Td  <496E766F69636520666F7220776F726B20706572666F726D6564206265747765656E2030312F30362F3230313920616E642033302F30362F323031392E> Tj ET Q\n/GS1 gs\n/GS1 gs /GS1 gs\nBT /F3 9.000 Tf ET\nq 0.000 g  0 Tr BT 63.141 568.772 Td  (\x00D\x00e\x00s\x00c\x00r\x00i\x00p\x00t\x00i\x00o\x00n) Tj ET Q\n/GS1 gs\nBT /F2 9.000 Tf ET\nBT /F3 9.000 Tf ET\nq 0.000 g  0 Tr BT 430.669 568.772 Td  (\x00U\x00n\x00i\x00t\x00 \x00p\x00r\x00i\x00c\x00e\x00  \xac) Tj ET Q\n/GS1 gs\nBT /F2 9.000 Tf ET\nBT /F3 9.000 Tf ET\nq 0.000 g  0 Tr BT 492.398 568.772 Td  (\x00Q\x00t\x00y) Tj ET Q\n/GS1 gs\nBT /F2 9.000 Tf ET\nBT /F3 9.000 Tf ET\nq 0.000 g  0 Tr BT 520.819 568.772 Td  (\x00T\x00o\x00t\x00a\x00l\x00  \xac) Tj ET Q\n/GS1 gs\nBT /F2 9.000 Tf ET\nBT /F2 8.000 Tf ET\nq 0.302 0.302 0.302 rg  0 Tr BT 45.354 555.337 Td  <312E> Tj ET Q\n/GS1 gs\nBT /F2 9.000 Tf ET\nq 0.000 g  0 Tr BT 63.141 555.137 Td  <486F75726C7920636F6E73756C74696E67> Tj ET Q\n/GS1 gs\nq 0.000 g  0 Tr BT 453.790 555.137 Td  <3130302E3030> Tj ET Q\n/GS1 gs\nq 0.000 g  0 Tr BT 492.398 555.137 Td  <39342068> Tj ET Q\n/GS1 gs\nq 0.000 g  0 Tr BT 520.819 555.137 Td  <392C3430302E3030> Tj ET Q\n/GS1 gs\n0.302 0.302 0.302 RG\n1.000 w\n39.685 540.399 m 555.595 540.399 l S\n0.567 w\n0.000 G\n/GS1 gs BT /F3 9.000 Tf ET\nq 0.000 g  0 Tr BT 39.685 523.295 Td  (\x00 ) Tj ET Q\n/GS1 gs\nBT /F2 9.000 Tf ET\nq 0.000 g  0 Tr BT 442.049 523.295 Td  <546F74616C20746F207061792080> Tj ET Q\n/GS1 gs\nBT /F3 9.000 Tf ET\nq 0.000 g  0 Tr BT 521.071 523.295 Td  (\x009\x00,\x004\x000\x000\x00.\x000\x000) Tj ET Q\n/GS1 gs\nEMC\n'

which is what I was sharing earlier in my cropped screenshot. These are bytes, and indeed if you try to do .decode('utf-8') on it it will fail.
Maybe that can help you make more hypotheses that I can test on my end? Thanks again for all the help 🙏

@JorjMcKie
Copy link
Collaborator

It's normal for binary data to not being convertible to strings. So that is no help.
Also I keep seeing text objects (things wrapped by BT / ET tokens). These guys have no business inside a drawing.
I currently therefore don't believe we are looking at a drawings object. In addition, vector graphics do not have their own xref. They are atomic PDF commands like "draw a line", a curve a rectangle, ... and that may occur anywhere in any /Contents stream.
So we are still chasing a black cat in the dark here.
Bottom line: I do need some reproducing file.

@anomam
Copy link
Author

anomam commented Jun 16, 2023

Thanks for your reply! I just sent you an email containing the troublesome PDF after removing any sensitive data. I hope that will be helpful. Please let me know if you have any additional questions.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jun 16, 2023

Thanks for the file - that did help!
In the end, the error is connected to PDF layers. Starting with 1.22.0, the layer name is part of the returned dictionaries, but not correctly interpreted by the base software that actually reads the content stream and extracts and returns the layer name.
So it may happen that the character string points to invalid data.
This will be fixed in the next version.

JorjMcKie added a commit that referenced this issue Jun 16, 2023
Detail descriptions:

Fixing #2468:
MuPDF now correctly provides the OC layer name. In PyMuPDF, a safeguard against invalid lay name strings has been implemented.

Fixing: #2365:
Combined "fill" and "stroke" paths ("fs") now correctly report dictionary keys from the sub-paths.

Fixing #2391:
Checkbox "True" values were inconsistent between getting and setting. This value is now always set to "Yes".

Fixing #2400:
Fixed by an internal MuPDF fix.

Fixing #2404:
Fixed by an internal MuPDF fix.

Fixing #2430:
We falsely reduced the reference count of `Py_None` object when creating the dictionary `Font.infos`. This has been corrected.

Other changes:

* Support for "cloudy"  annotation borders

* Consistent setting / unsetting of RadioButtons within same RB group. However: the RB group must be a PDF object: radio buttons  with JavaScripts that simulate that behaviour are not supported.

* Adobe Photoshop images are now supported as input (Pixmaps and Documents).

* The /Locked key in OCProperties is now support for getting / setting.

* Document method `set_layer_ui_config()` now also supports the OCG name as argument (was just the sequence number previously).
@anomam
Copy link
Author

anomam commented Jun 19, 2023

Awesome, thanks again!

julian-smith-artifex-com pushed a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Jun 19, 2023
Detail descriptions:

Fixing pymupdf#2468:
MuPDF now correctly provides the OC layer name. In PyMuPDF, a safeguard against invalid lay name strings has been implemented.

Fixing: pymupdf#2365:
Combined "fill" and "stroke" paths ("fs") now correctly report dictionary keys from the sub-paths.

Fixing pymupdf#2391:
Checkbox "True" values were inconsistent between getting and setting. This value is now always set to "Yes".

Fixing pymupdf#2400:
Fixed by an internal MuPDF fix.

Fixing pymupdf#2404:
Fixed by an internal MuPDF fix.

Fixing pymupdf#2430:
We falsely reduced the reference count of `Py_None` object when creating the dictionary `Font.infos`. This has been corrected.

Other changes:

* Support for "cloudy"  annotation borders

* Consistent setting / unsetting of RadioButtons within same RB group. However: the RB group must be a PDF object: radio buttons  with JavaScripts that simulate that behaviour are not supported.

* Adobe Photoshop images are now supported as input (Pixmaps and Documents).

* The /Locked key in OCProperties is now support for getting / setting.

* Document method `set_layer_ui_config()` now also supports the OCG name as argument (was just the sequence number previously).
julian-smith-artifex-com pushed a commit that referenced this issue Jun 20, 2023
Detail descriptions:

Fixing #2468:
MuPDF now correctly provides the OC layer name. In PyMuPDF, a safeguard against invalid lay name strings has been implemented.

Fixing: #2365:
Combined "fill" and "stroke" paths ("fs") now correctly report dictionary keys from the sub-paths.

Fixing #2391:
Checkbox "True" values were inconsistent between getting and setting. This value is now always set to "Yes".

Fixing #2400:
Fixed by an internal MuPDF fix.

Fixing #2404:
Fixed by an internal MuPDF fix.

Fixing #2430:
We falsely reduced the reference count of `Py_None` object when creating the dictionary `Font.infos`. This has been corrected.

Other changes:

* Support for "cloudy"  annotation borders

* Consistent setting / unsetting of RadioButtons within same RB group. However: the RB group must be a PDF object: radio buttons  with JavaScripts that simulate that behaviour are not supported.

* Adobe Photoshop images are now supported as input (Pixmaps and Documents).

* The /Locked key in OCProperties is now support for getting / setting.

* Document method `set_layer_ui_config()` now also supports the OCG name as argument (was just the sequence number previously).
@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.22.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants