Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.pagesplit() not working with iOS Quartz produced pdfs #491

Open
TarunChakitha opened this issue Aug 3, 2024 · 3 comments
Open

.pagesplit() not working with iOS Quartz produced pdfs #491

TarunChakitha opened this issue Aug 3, 2024 · 3 comments

Comments

@TarunChakitha
Copy link

TarunChakitha commented Aug 3, 2024

Hi @jcupitt,

I am trying to split a many-page image into a list of N separate images.

Code:

import pyvips

file_path = "/filesharemnt/testpdf.pdf"
DPI = float(150)
multi_page_image = pyvips.Image.pdfload(file_path, n = -1, dpi=DPI)

total_pages = multi_page_image.get_n_pages()
print("total_pages",total_pages)

fields = multi_page_image.get_fields()
for field in fields:
    print(f"{field}: {multi_page_image.get(field)}")

individual_pages = multi_page_image.pagesplit()
print("\nlen(individual_pages) =", len(individual_pages))

output:

total_pages 925
width: 1275
height: 1622346
bands: 4
format: uchar
coding: none
interpretation: srgb
xoffset: 0
yoffset: 0
xres: 5.905511811023622
yres: 5.905511811023622
filename: /filesharemnt/testpdf.pdf
vips-loader: pdfload
page-height: 1650
pdf-n_pages: 925
n-pages: 925
pdf-producer: iOS Version 15.5 (Build 19F77) Quartz PDFContext; modified using iText® 5.4.1 ©2000-2012 1T3XT BVBA (AGPL-version)

len(individual_pages) = 1

Expected:

  • individual_pages must contain a list of 925 individual pages

Actual:

  • individual_pages has only 1 element which same as the multi_page_image but with a temp filename.

I noticed that this is happening with pdfs having the producer given in the output. Rest of the pdfs I tested have a different producer and its working for them.

OS details:
only tried testing this with debian 11 docker, ubuntu docker.

lsb_release -a:

Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

uname -a:

Linux SandboxHost-638582921772039215 5.10.102.2-microsoft-standard #1 SMP Mon Mar 7 17:36:34 UTC 2022 x86_64 GNU/Linux

Python version 3.10.14
pyvips version: 2.2.3

could you please help.

@jcupitt
Copy link
Member

jcupitt commented Aug 3, 2024

Hello @TarunChakitha,

It's because your image doesn't split neatly into pages. You have an image height of 1622346 and a page height of 1650, but 1622346 / 1650 is 988.24, not 925. I would guess that one of the pages in your document is a different size.

You will probably have to process this one page at a time, perhaps (untested):

doc = pyvips.Image.pdfload(file_path)
n_pages = doc.get("n-pages")

pages = [pyvips.Image.pdfload(file_path, n=i, dpi=DPI)
         for i in range(n_pages)]

It's a little slower than loading once and then splitting, unfortunately.

@TarunChakitha
Copy link
Author

Is there no other workaround other than the looping method? Because, the loop method itself was my first approach. But for some reason the azure function that I hosted this code errored out with code 137 after 6 or 7 iterations. And that is happening with equal sized pages also but they are non-digital (scanned image pdfs).

@jcupitt
Copy link
Member

jcupitt commented Aug 4, 2024

You could open a page at a time and try to find which pages differ in size.

You could also try opening pages in sequential mode, and using a loop rather than a list comprehension. And it depends what you plan to do with the pages once you've loaded them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants