Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROB: reading PDF with multiple forms is really slow #2643

Closed
farjasju opened this issue May 14, 2024 · 5 comments · Fixed by #2656
Closed

ROB: reading PDF with multiple forms is really slow #2643

farjasju opened this issue May 14, 2024 · 5 comments · Fixed by #2656
Labels
workflow-forms From a users perspective, forms is the affected feature/workflow

Comments

@farjasju
Copy link
Contributor

When reading a specific PDF file with a lot of forms, the get_fields() seems to loop for a very long time (one hour and a half on my setup).

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-173-generic-x86_64-with-glibc2.29

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('pycryptodome', '3.20.0'), PIL=10.3.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
with open('inheritance.pdf', 'rb') as f:
    doc = PdfReader(f)
    doc.get_fields()

The PDF file (coming from https://www.nj.gov/treasury/taxation/pdf/other_forms/inheritance/itrbk.pdf):
inheritance.pdf

Traceback

The get_fields() function seems to make a lot of recursive _check_kids() calls

Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1120, 0, 133206044573808), IndirectObject(1867, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1115, 0, 133206044573808), IndirectObject(1865, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1107, 0, 133206044573808), IndirectObject(1866, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1120, 0, 133206044573808), IndirectObject(1867, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1120, 0, 133206044573808), IndirectObject(1867, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1107, 0, 133206044573808), IndirectObject(1866, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1120, 0, 133206044573808), IndirectObject(1867, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1120, 0, 133206044573808), IndirectObject(1867, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1071, 0, 133206044573808), IndirectObject(1864, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1115, 0, 133206044573808), IndirectObject(1865, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1107, 0, 133206044573808), IndirectObject(1866, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1120, 0, 133206044573808), IndirectObject(1867, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1120, 0, 133206044573808), IndirectObject(1867, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1107, 0, 133206044573808), IndirectObject(1866, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1120, 0, 133206044573808), IndirectObject(1867, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1120, 0, 133206044573808), IndirectObject(1867, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1115, 0, 133206044573808), IndirectObject(1865, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1107, 0, 133206044573808), IndirectObject(1866, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1120, 0, 133206044573808), IndirectObject(1867, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(942, 0, 133206044573808), IndirectObject(922, 0, 133206044573808)]
Recursing down 5 kids: [IndirectObject(1120, 0, 133206044573808), IndirectObject(1867, 0, 133206044573808)]
@stefan6419846 stefan6419846 added the workflow-forms From a users perspective, forms is the affected feature/workflow label May 14, 2024
@stefan6419846
Copy link
Collaborator

Thanks for the report. That surely does not look right.

@farjasju
Copy link
Contributor Author

If ever useful: another (smaller) file where the same problem happens: https://www.courts.ca.gov/documents/fl410.pdf

@pubpub-zz
Copy link
Collaborator

@farjasju
You should have a look at the document with PdfBox with debug function to look at the pdf structure. There is a lot of fields that exists in the document (fields can be identified in the field /T. You may see a lot of fields named 0, 1 : they are different from each others : the qualified name are differents.
There might be some optimisation. also can you check the document does not includes some (infinite)loops (where some protection may be required)

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented May 15, 2024

I confirm that the file is "corrupted" with some objects being a grand child of itself:

67820872-daee-48e8-aa91-0c83441be86b

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue May 19, 2024
@pubpub-zz pubpub-zz changed the title Performance bug: reading PDF with multiple forms is really slow ROB: reading PDF with multiple forms is really slow May 19, 2024
@pubpub-zz
Copy link
Collaborator

@MartinThoma The issue could be an infinite loop, may be a candidate to security.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-forms From a users perspective, forms is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants