Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different image format/data from Page.get_text("dict") and Fitz.get_page_images() #2290

Closed
chatumao opened this issue Mar 16, 2023 · 6 comments
Labels

Comments

@chatumao
Copy link

Please provide all mandatory information!

Describe the bug (mandatory)

I am trying to match (inline) images found via Page.get_text("dict") with the ones obtained by Fitz.get_page_images(), in order to assign the image name to the object obtained by the first method. I am having a PDF document that seems to have one single PNG image in it, I checked with a PDF editor and also from what I can read from the PDF source code there is only one image (not inline, but an object). Creating a PIL image from the data from the first method gives me a JPEG image type, from the other method it yields a PNG type. The underlying binary data is also different.

To Reproduce (mandatory)

#! /usr/bin/env python3

from PIL import Image
import io


def run():
  import fitz
  from io import BytesIO
  idata = open("your_pdf_path_here", "rb").read()
  ibuffer = BytesIO(idata)
  my_fitz = fitz.open("pdf", ibuffer)
  idx = -1
  for page in my_fitz.pages():
    idx += 1
    img_list = page.get_text("dict")
    img_list_xref = my_fitz.get_page_images(idx, full = True)
    img_found = my_fitz.extract_image(img_list_xref[0][0])
    img_inline = img_list["blocks"][3]
    img_inline_pil = Image.open(io.BytesIO(img_inline["image"]))
    img_found_pil = Image.open(io.BytesIO(img_found["image"]))
    pass


if __name__ == "__main__":
  run()

Your configuration (mandatory)

  • Manjaro 64bit
  • Python3.10.7 64bit
  • 1.21.1 from wheel

Additional context (optional)

I would like to share the file with you, but cannot do so publicly.

@JorjMcKie
Copy link
Collaborator

Please provide data to reporduce problems.
That is probably not a bug. There are a number of situations where MuPDF falls back to creating a PNG.

from the other method it yields a PNG type

Which other method?

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Mar 16, 2023

Ah, I did not read it thoroughly enough: you can send the file to my e-mail, if that is ok.

@chatumao
Copy link
Author

I sent the file to jorj dot x dot mckie at outlook dot de.
Thanks for looking into it.

@JorjMcKie
Copy link
Collaborator

Found the problem, which will be fixed in next version.

JorjMcKie referenced this issue Mar 17, 2023
Adde check for compressed buffer existence after creating image from pdf_obj.
We were falsely assuming that a PNG image had to created if the raw (compressed) stream could not be interpreted as an image.
This assumption was wrong (at least) in case where two compression filters existed.
@JorjMcKie
Copy link
Collaborator

Fixed by commit cee1dda

@JorjMcKie JorjMcKie mentioned this issue Mar 17, 2023
@chatumao
Copy link
Author

Thanks a lot! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants