Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: image is too high for a long paged pdf file when trying get_pixmap() #1995

Closed
gooseillo opened this issue Oct 27, 2022 · 3 comments
Assignees

Comments

@gooseillo
Copy link

gooseillo commented Oct 27, 2022

Please provide all mandatory information!

Describe the bug (mandatory)

I have a pdf which is 3480 pixels long per page. For each of the pages I am trying get_pixmap(). This gives me the following error
RuntimeError: image is too high

To Reproduce (mandatory)

pdf_doc = fitz.open("poster-P060210766.pdf")
pdf_image_dpi=200
pdf_doc_img = fitz.open()
for ppi,pdf_page in enumerate(pdf_doc.pages()):
    print(ppi)
    pdf_pix_map = pdf_page.get_pixmap()
    pdf_page_img = pdf_doc_img.new_page(width=pdf_page.rect.width, height=pdf_page.rect.height)
    xref = pdf_page_img.insert_image(rect=pdf_page.rect, pixmap=pdf_pix_map)
pdf_doc.close()

Screenshots (optional)

image

Your configuration (mandatory)

  • Windows 10
  • Jupyter notebook in VSCode
  • Python 3.8.0, Fitz 0.0.1.dev2, PyMuPDF 1.20.2

Additional context (optional)

I split a very long pdf into pages which are 3480ppx long (aws textract limit for page size). If I split it into smaller lengths, then I'll have more pages which will be more expence to process (AWS charges $$/page)

PDF File-
https://drive.google.com/file/d/13uld7nQ5u8-oxvBIyr3ZykGVPphHeda4/view?usp=sharing

@JorjMcKie
Copy link
Collaborator

There is an in-built limit of 65536 pixels to image width and height in MuPDF (not PyMuPDF).
Your pages display images beyond that limit (in the range of 90,000+ in height).

MuPDF will be updated to accept larger values, up to at least the next power of 2 = 131,072. But this will not happen in the release to be published shortly, but in the version following that one. Sorry, bad timing.

Maybe you can use a way to shrink the images before inserting them. I know no details of your process, but maybe there is a way to use the thumbnail() method of PIL.Image once you detect a too large image dimension.

@JorjMcKie JorjMcKie added enhancement and removed bug labels Oct 27, 2022
@gooseillo
Copy link
Author

gooseillo commented Oct 27, 2022

Can you tell me what you used to see the pixel size of the PDF page. I'm struggling to find that.

As you see the max length of the page is 3480.

image

I started of with an initial tiff (very long), converted it to a PDF and the used Fitz to split it into page sizes which are 3480px long using the following code. I wonder how you got the 90,000+ pixels?

import fitz

def pdf_split(pdf_file):

    src = fitz.open(pdf_file)
    doc = fitz.open()  # empty output PDF

    for spage in src:  # for each page in input
        xref = 0  # force initial page copy to output
        r = spage.rect  # input page rectangle
        d = fitz.Rect(
            spage.cropbox_position, spage.cropbox_position  # CropBox displacement if not
        )  # starting at (0, 0)

        MAX_DIMENSION = 3480

        x0 = r.x0
        y0 = r.y0
        x1 = r.x1
        y1 = r.y1

        length = y1
        isHorizontal = False

        rect_list = []

        if x1 > y1:
            length = x1
            isHorizontal = True

        while length > 0:
            if isHorizontal:
                rect_list.append(fitz.Rect(x0, y0, x0 + MAX_DIMENSION, y1))
                x0 += MAX_DIMENSION
            else:
                rect_list.append(fitz.Rect(x0, y0, x1, y0 + MAX_DIMENSION))
                y0 += MAX_DIMENSION
            
            length -= MAX_DIMENSION

        for rx in rect_list:  # run thru rect list
            # print(rx)
            rx += d  # add the CropBox displacement
            page = doc.new_page(
                -1, width=rx.width, height=rx.height  # new output page with rx dimensions
            )
            xref = page.show_pdf_page(
                page.rect,  # fill all new page with the image
                src,  # input document
                spage.number,  # input page number
                clip=rx,  # which part to use of input page
                reuse_xref=xref,
            )  # copy input page once only

    # that's it, save output file
    doc.save(
        "poster-" + src.name, garbage=4, deflate=True  # eliminate duplicate objects
    )  # compress stuff where possible

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Oct 27, 2022

It is not the size of the page, but the size of the image displayed by the page. The image height of 92 k-pixels is the problem.
MuPDF currently refuses to process images which have a width or height of more than 2**16 pixels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants