Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: source object number out of range #856

Closed
jayjb opened this issue Jan 22, 2021 · 6 comments
Closed

RuntimeError: source object number out of range #856

jayjb opened this issue Jan 22, 2021 · 6 comments
Assignees
Labels

Comments

@jayjb
Copy link

jayjb commented Jan 22, 2021

Describe the bug (mandatory)

When trying to combine two pdfs; I receive the following exception: RuntimeError: source object number out of range

doc.insertPDF(existing_doc)
  File "/dev-test-scripts/pdfenv/lib/python2.7/site-packages/fitz/fitz.py", line 4093, in insertPDF
    val = _fitz.Document_insertPDF(self, docsrc, from_page, to_page, start_at, rotate, links, annots, show_progress, final, _gmap)
RuntimeError: source object number out of range

To Reproduce (mandatory)

I simply load two pdfs and try combine them. It works for 99% of pdfs but for a few i hit this issue and trying figure if there is a way to detect that there will be an issue before trying to merge.

o_doc = open(other_pdf_name, "rb").read()
u_doc = open(existing_pdf_name, "rb").read()

other_doc = fitz.Document('pdf', o_doc)
existing_doc = fitz.Document('pdf', u_doc)

other_doc.insertPDF(existing_doc)

I tried loading the PDFs directly using fitz.Document(<filename>), but get the same results. I could even send you the PDF.

Expected behavior (optional)

I expect the method to be successful.

Your configuration (mandatory)

  • MacOS, Catalina 10.15.7
  • Python2.7 (ill try it on Python3 version to confirm)
  • Whl from releases page. 1.18.3

Let me know if you want me to send the PDF that is breaking. Perhaps there is a method i could use to determine that the PDF is incompatible.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jan 22, 2021

Is that PDF large?
Before we exchange files, could you try and clean the suspicious PDF, e.g. via mutool clean -ggggz file.pdf cleaned.pdf? Or equivalent save("cleaned.pdf", garbage=4, deflate=True)?
I expect the PDF to have issues. The message indicates the it references objects via nnn 0 R where nnn is larger than the highest number in the xref-table.

Should this be confirmed, there are obvious ways to "immunize" your script ...

@jayjb
Copy link
Author

jayjb commented Jan 22, 2021

Hi @JorjMcKie,

Spot on! That worked like a charm. I did try the mutool clean but didn't crank up the intensity which seemed to work. In terms of replicating that in python code; I went this route because I didn't see a .clean method.

prechecked_existing_doc = fitz.Document('pdf', u_doc)

# Repair any issues (hopefully) before we hit them
output = StringIO()
output.write(prechecked_existing_doc.write(clean=True, garbage=4))
new_contents = output.getvalue()
output.close()
existing_doc = fitz.Document('pdf', new_contents)

in order to keep it all in memory and not have to write it out to file and reread that file.

Thanks for the quick response! Its been awesome working with this library.

@JorjMcKie
Copy link
Collaborator

Good to hear that.

I didn't see a .clean method.

The cleaning is implicitely done if you use garbage option 3+ in doc.save or doc.write. This performs a scan through all xrefs to see (1) if there are any sitting around and never used by a nnn 0 R reference, and (2) whether there are any duplicates (except for the xref number itself). This usually also leads to an overall renumbering of all objects.
Option 4 in addition to the previous also check for identical (binary) stream contents (object types like images, fonts, ...) and therefore can take significantly longer (large objects, sometimes recompression necessary before compares make sense, etc.), but correspondingly has potential for large file size savings.

@JorjMcKie
Copy link
Collaborator

Maybe it is also worthwhile to explain, why insert_pdf must stop working if the above error situation occurs:
For each page copied from the source

  • all references to objects in the source PDF are collected (a recursive process)
  • during copying over that page, each identified page-referenced object is put into an internal array (its xref only) to prevent multiple copies of the same.

So when a non-existing source xref is referenced here, then a damaged source page would become part of the target PDF ...

@JorjMcKie
Copy link
Collaborator

close issue?

@jayjb
Copy link
Author

jayjb commented Jan 22, 2021

Thanks for the explanation @JorjMcKie! its very helpful.

@jayjb jayjb closed this as completed Jan 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants