save(garbage=3) breaks get_pixmap() with side effect #2596

Max-ES · 2023-08-17T13:59:49Z

Problem

After saving the fitz.Document with pdf.save("filename.pdf", garbage=3) the embedded images of the document I can see in the pixmap of the pages are mixed up: pixmap = pdf[0].get_pixmap().
fitz.Document.save(garbage=3) and fitz.Document.write(garbage=3) appear to have side effects that change the document in place.
When I set garbage to 2 it works as expected.

Observation

Checked checkboxes become unchecked boxes and the other way around, some checkboxes become logos etc..

Code Example

pdf = fitz.Document("filename.pdf")
# pixmap of the first page looks just fine
pix0 = pdf[0].get_pixmap()

# save with "garbage=3" breaks the pdf in memory (the saved pdf on disk is still fine)
pdf.save("debug.pdf", garbage=3)

# the pixmap is now broken - images in the pixmap are mixed up
pix1 = pdf[0].get_pixmap()  # <---------- PROBLEM

# read the saved pdf again
new_pdf = fitz.Document("debug.pdf")
# the pixmap of the saved and read pdf file is not broken!
pix2 = new_pdf[0].get_pixmap()

Sadly I can't provide the affected internal documents.

Version

PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library.
Version date: 2022-08-13 00:00:01.
Built for Python 3.9 on darwin (64-bit).

Installed with pip install PyMuPdf
I also tested it on 1.22.5 on macOS with the same result.

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2023-08-17T16:19:09Z

This is sad indeed. Please do provide some reproducing file, or otherwise we won't be able to accept the report.

JorjMcKie · 2023-08-18T16:14:39Z

Other missing information pieces pertain to you platform and PyMuPDF version.
Please make sure to provide all this.

Max-ES · 2023-08-18T17:13:52Z

Thanks for the quick response. I updated the description. I will have to check if there is a way to provide you with the documents - this might take some time.

JorjMcKie · 2023-08-18T17:15:33Z

Thanks for the quick response. I updated the description. I will have to check if there is a way to provide you with the documents - this might take some time.

You can use my private e-mail address. This way there will be no public visibility.

JorjMcKie · 2023-08-23T13:42:22Z

I just received the example file - thanks a lot.

I did a few tests and found that this probably is no bug. Instead, a few precautions should be taken to ensure properly created pixmaps in this situation.
A few preliminary comments on what is happening on Document.save():

The base library maintains a cache in memory to improve performance. Large objects are stored there - first and foremost images and pixmaps.
Depending on the save options, the Document in memory is being changed. For any positive garbage value, PDF objects may be deleted or be updated. The four positive garbage levels are like "onion skins": a larger value implies all actions of the lower levels.
- Garbage 1 deletes unreferenced objects. This is BTW the minimum value to use after redactions have been applied: otherwise, content invalidated by redactions will still be physically present and continue to be potential data leaks.
- Garbage 2 compacts the XREF table.
- Garbage 3 will merge objects with equal definitions. That may lead to renumbering all objects in terms of their XREF numbers.
- Garbage 4 finally also searches for duplicate binary streams in the Document and consolidates them.
With this in mind, the observed behavior can be explained. When you used pdf[0] instead of assigning a persistent Page object for taking the Pixmap, you already took half the way to prevent the worst. When reusing the same Page object before and after the save, a page = pdf.reload_page(page) would do the same job.
- A complete safeguard would be to also empty the internal cache of the base library.

So wrapping up, the following sequence of instructions should prevent the problem:

pix1 = page.get_pixmap()
doc.save(... with garbage>0 or other changes ...)

# the following instructions will make sure that
# the pixmap is made from updated data.
fitz.TOOLS.store_shrink(100)  # clears the internal cache
page = doc.reload_page(page)  # reload the page preserving dependent objects

pix2 = page.get_pixmap()

So as a general rule, you should

Clear the cache before rendering pages after saving the document.
Reload a page object before rendering after saving the document.
- In the next version, we will include cache clearing in the Document.reload_page(), such that this instruction will be sufficient.

Max-ES · 2023-08-30T06:06:49Z

Thank you for your work!
I understand, that a garbage value > 0 changes the loaded page by optimizing it.

But the mixing up of the images in the page still seems like a bug to me.
When the xref numbers are regenerated with garbage=3, I would expect the pixmap to look the same, without reloading the page, especially because the saved document looks fine. (The saved document gets optimized and the loaded page gets 'mangled' in the process.)

If I understood it correctly the problems are the cached images that are renumbered during saving and falsely referenced after that which leads to the switched appearances.

Updating the cached images with the new indices during saving would fix the issue, wouldn't it?

JorjMcKie · 2023-09-06T08:12:49Z

Updating the cached images with the new indices during saving would fix the issue, wouldn't it?

May well be. But that is not what a cache is for: its sense in life is to save work, not the opposite.
The adequate way of doing this is (1) empty the cache, (2) reload the page. If then another Pixmap of the page is required, it will automatically be based on current page contents and not on invalidated cache pieces.

When continuing to use a page object after the PDF has been saved with garbage options may change the MuPDF cache such that e.g. rendering the page after the save will produce wrong appearance or even garbled appearance. The user needs to reload the page (page = doc.reload(page)) before using it further, especially for rendering. The reload method now always empties the cache.

The underlying fz_page was being kept alive so reload_page()'s call of fz_load_page() was returning it from the document's internal list of pages instead of regenerating it.

julian-smith-artifex-com · 2023-10-13T09:31:18Z

Fixed in 1.23.5.

JorjMcKie added the example required label Aug 18, 2023

JorjMcKie self-assigned this Aug 23, 2023

JorjMcKie added enhancement and removed example required labels Aug 23, 2023

JorjMcKie added Fixed in next release and removed Fixed in next release labels Oct 3, 2023

JorjMcKie closed this as completed in c38bb5b Oct 5, 2023

JorjMcKie pushed a commit that referenced this issue Oct 5, 2023

src/__init__.py: address #2596 in rebased.

7658e42

The underlying fz_page was being kept alive so reload_page()'s call of fz_load_page() was returning it from the document's internal list of pages instead of regenerating it.

JorjMcKie reopened this Oct 5, 2023

JorjMcKie added the Fixed in next release label Oct 5, 2023

julian-smith-artifex-com removed the Fixed in next release label Oct 13, 2023

julian-smith-artifex-com closed this as completed Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

save(garbage=3) breaks get_pixmap() with side effect #2596

save(garbage=3) breaks get_pixmap() with side effect #2596

Max-ES commented Aug 17, 2023 •

edited

Loading

JorjMcKie commented Aug 17, 2023

JorjMcKie commented Aug 18, 2023

Max-ES commented Aug 18, 2023

JorjMcKie commented Aug 18, 2023

JorjMcKie commented Aug 23, 2023

Max-ES commented Aug 30, 2023

JorjMcKie commented Sep 6, 2023

julian-smith-artifex-com commented Oct 13, 2023

save(garbage=3) breaks get_pixmap() with side effect #2596

save(garbage=3) breaks get_pixmap() with side effect #2596

Comments

Max-ES commented Aug 17, 2023 • edited Loading

Problem

Observation

Code Example

Version

JorjMcKie commented Aug 17, 2023

JorjMcKie commented Aug 18, 2023

Max-ES commented Aug 18, 2023

JorjMcKie commented Aug 18, 2023

JorjMcKie commented Aug 23, 2023

Max-ES commented Aug 30, 2023

JorjMcKie commented Sep 6, 2023

julian-smith-artifex-com commented Oct 13, 2023

Max-ES commented Aug 17, 2023 •

edited

Loading