RuntimeError: image is too high for a long paged pdf file when trying get_pixmap() #1995

gooseillo · 2022-10-27T02:04:55Z

Please provide all mandatory information!

Describe the bug (mandatory)

I have a pdf which is 3480 pixels long per page. For each of the pages I am trying get_pixmap(). This gives me the following error
RuntimeError: image is too high

To Reproduce (mandatory)

pdf_doc = fitz.open("poster-P060210766.pdf")
pdf_image_dpi=200
pdf_doc_img = fitz.open()
for ppi,pdf_page in enumerate(pdf_doc.pages()):
    print(ppi)
    pdf_pix_map = pdf_page.get_pixmap()
    pdf_page_img = pdf_doc_img.new_page(width=pdf_page.rect.width, height=pdf_page.rect.height)
    xref = pdf_page_img.insert_image(rect=pdf_page.rect, pixmap=pdf_pix_map)
pdf_doc.close()

Screenshots (optional)

Your configuration (mandatory)

Windows 10
Jupyter notebook in VSCode
Python 3.8.0, Fitz 0.0.1.dev2, PyMuPDF 1.20.2

Additional context (optional)

I split a very long pdf into pages which are 3480ppx long (aws textract limit for page size). If I split it into smaller lengths, then I'll have more pages which will be more expence to process (AWS charges $$/page)

PDF File-
https://drive.google.com/file/d/13uld7nQ5u8-oxvBIyr3ZykGVPphHeda4/view?usp=sharing

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2022-10-27T11:16:03Z

There is an in-built limit of 65536 pixels to image width and height in MuPDF (not PyMuPDF).
Your pages display images beyond that limit (in the range of 90,000+ in height).

MuPDF will be updated to accept larger values, up to at least the next power of 2 = 131,072. But this will not happen in the release to be published shortly, but in the version following that one. Sorry, bad timing.

Maybe you can use a way to shrink the images before inserting them. I know no details of your process, but maybe there is a way to use the thumbnail() method of PIL.Image once you detect a too large image dimension.

gooseillo · 2022-10-27T12:45:44Z

Can you tell me what you used to see the pixel size of the PDF page. I'm struggling to find that.

As you see the max length of the page is 3480.

I started of with an initial tiff (very long), converted it to a PDF and the used Fitz to split it into page sizes which are 3480px long using the following code. I wonder how you got the 90,000+ pixels?

import fitz

def pdf_split(pdf_file):

    src = fitz.open(pdf_file)
    doc = fitz.open()  # empty output PDF

    for spage in src:  # for each page in input
        xref = 0  # force initial page copy to output
        r = spage.rect  # input page rectangle
        d = fitz.Rect(
            spage.cropbox_position, spage.cropbox_position  # CropBox displacement if not
        )  # starting at (0, 0)

        MAX_DIMENSION = 3480

        x0 = r.x0
        y0 = r.y0
        x1 = r.x1
        y1 = r.y1

        length = y1
        isHorizontal = False

        rect_list = []

        if x1 > y1:
            length = x1
            isHorizontal = True

        while length > 0:
            if isHorizontal:
                rect_list.append(fitz.Rect(x0, y0, x0 + MAX_DIMENSION, y1))
                x0 += MAX_DIMENSION
            else:
                rect_list.append(fitz.Rect(x0, y0, x1, y0 + MAX_DIMENSION))
                y0 += MAX_DIMENSION
            
            length -= MAX_DIMENSION

        for rx in rect_list:  # run thru rect list
            # print(rx)
            rx += d  # add the CropBox displacement
            page = doc.new_page(
                -1, width=rx.width, height=rx.height  # new output page with rx dimensions
            )
            xref = page.show_pdf_page(
                page.rect,  # fill all new page with the image
                src,  # input document
                spage.number,  # input page number
                clip=rx,  # which part to use of input page
                reuse_xref=xref,
            )  # copy input page once only

    # that's it, save output file
    doc.save(
        "poster-" + src.name, garbage=4, deflate=True  # eliminate duplicate objects
    )  # compress stuff where possible

JorjMcKie · 2022-10-27T12:56:57Z

It is not the size of the page, but the size of the image displayed by the page. The image height of 92 k-pixels is the problem.
MuPDF currently refuses to process images which have a width or height of more than 2**16 pixels.

gooseillo added the bug label Oct 27, 2022

gooseillo assigned JorjMcKie Oct 27, 2022

julian-smith-artifex-com assigned julian-smith-artifex-com and unassigned JorjMcKie Oct 27, 2022

JorjMcKie added enhancement and removed bug labels Oct 27, 2022

JorjMcKie mentioned this issue Nov 28, 2022

Image with Filter "[/FlateDecode/JPXDecode]" not extracted #2087

Closed

JorjMcKie added the Fixed in next release label Dec 14, 2022

julian-smith-artifex-com removed the Fixed in next release label Apr 14, 2023

julian-smith-artifex-com closed this as completed Apr 14, 2023

JorjMcKie mentioned this issue Jul 30, 2023

get_pixmap - RuntimeError: Private data too large to pack into display list node #2563

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: image is too high for a long paged pdf file when trying get_pixmap() #1995

RuntimeError: image is too high for a long paged pdf file when trying get_pixmap() #1995

gooseillo commented Oct 27, 2022 •

edited

Loading

JorjMcKie commented Oct 27, 2022

gooseillo commented Oct 27, 2022 •

edited

Loading

JorjMcKie commented Oct 27, 2022 •

edited

Loading

RuntimeError: image is too high for a long paged pdf file when trying get_pixmap() #1995

RuntimeError: image is too high for a long paged pdf file when trying get_pixmap() #1995

Comments

gooseillo commented Oct 27, 2022 • edited Loading

Describe the bug (mandatory)

To Reproduce (mandatory)

Screenshots (optional)

Your configuration (mandatory)

Additional context (optional)

JorjMcKie commented Oct 27, 2022

gooseillo commented Oct 27, 2022 • edited Loading

JorjMcKie commented Oct 27, 2022 • edited Loading

gooseillo commented Oct 27, 2022 •

edited

Loading

gooseillo commented Oct 27, 2022 •

edited

Loading

JorjMcKie commented Oct 27, 2022 •

edited

Loading