Different image format/data from Page.get_text("dict") and Fitz.get_page_images() #2290

chatumao · 2023-03-16T16:02:24Z

Please provide all mandatory information!

Describe the bug (mandatory)

I am trying to match (inline) images found via Page.get_text("dict") with the ones obtained by Fitz.get_page_images(), in order to assign the image name to the object obtained by the first method. I am having a PDF document that seems to have one single PNG image in it, I checked with a PDF editor and also from what I can read from the PDF source code there is only one image (not inline, but an object). Creating a PIL image from the data from the first method gives me a JPEG image type, from the other method it yields a PNG type. The underlying binary data is also different.

To Reproduce (mandatory)

#! /usr/bin/env python3

from PIL import Image
import io


def run():
  import fitz
  from io import BytesIO
  idata = open("your_pdf_path_here", "rb").read()
  ibuffer = BytesIO(idata)
  my_fitz = fitz.open("pdf", ibuffer)
  idx = -1
  for page in my_fitz.pages():
    idx += 1
    img_list = page.get_text("dict")
    img_list_xref = my_fitz.get_page_images(idx, full = True)
    img_found = my_fitz.extract_image(img_list_xref[0][0])
    img_inline = img_list["blocks"][3]
    img_inline_pil = Image.open(io.BytesIO(img_inline["image"]))
    img_found_pil = Image.open(io.BytesIO(img_found["image"]))
    pass


if __name__ == "__main__":
  run()

Your configuration (mandatory)

Manjaro 64bit
Python3.10.7 64bit
1.21.1 from wheel

Additional context (optional)

I would like to share the file with you, but cannot do so publicly.

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2023-03-16T16:33:18Z

Please provide data to reporduce problems.
That is probably not a bug. There are a number of situations where MuPDF falls back to creating a PNG.

from the other method it yields a PNG type

Which other method?

JorjMcKie · 2023-03-16T16:50:28Z

Ah, I did not read it thoroughly enough: you can send the file to my e-mail, if that is ok.

chatumao · 2023-03-16T18:10:05Z

I sent the file to jorj dot x dot mckie at outlook dot de.
Thanks for looking into it.

JorjMcKie · 2023-03-17T10:32:21Z

Found the problem, which will be fixed in next version.

Adde check for compressed buffer existence after creating image from pdf_obj. We were falsely assuming that a PNG image had to created if the raw (compressed) stream could not be interpreted as an image. This assumption was wrong (at least) in case where two compression filters existed.

JorjMcKie · 2023-03-17T12:17:14Z

Fixed by commit cee1dda

chatumao · 2023-03-17T14:51:39Z

Thanks a lot! :)

JorjMcKie added bug Fixed in next release labels Mar 17, 2023

JorjMcKie mentioned this issue Mar 17, 2023

Harald #2292

Closed

julian-smith-artifex-com removed the Fixed in next release label Apr 14, 2023

julian-smith-artifex-com closed this as completed Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different image format/data from Page.get_text("dict") and Fitz.get_page_images() #2290

Different image format/data from Page.get_text("dict") and Fitz.get_page_images() #2290

chatumao commented Mar 16, 2023

JorjMcKie commented Mar 16, 2023

JorjMcKie commented Mar 16, 2023 •

edited

Loading

chatumao commented Mar 16, 2023

JorjMcKie commented Mar 17, 2023

JorjMcKie commented Mar 17, 2023

chatumao commented Mar 17, 2023

Different image format/data from Page.get_text("dict") and Fitz.get_page_images() #2290

Different image format/data from Page.get_text("dict") and Fitz.get_page_images() #2290

Comments

chatumao commented Mar 16, 2023

Describe the bug (mandatory)

To Reproduce (mandatory)

Your configuration (mandatory)

Additional context (optional)

JorjMcKie commented Mar 16, 2023

JorjMcKie commented Mar 16, 2023 • edited Loading

chatumao commented Mar 16, 2023

JorjMcKie commented Mar 17, 2023

JorjMcKie commented Mar 17, 2023

chatumao commented Mar 17, 2023

JorjMcKie commented Mar 16, 2023 •

edited

Loading