Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: '/Root' due to invalid start of xref #1756

Closed
owurman opened this issue Mar 29, 2023 · 8 comments · Fixed by #1784
Closed

KeyError: '/Root' due to invalid start of xref #1756

owurman opened this issue Mar 29, 2023 · 8 comments · Fixed by #1784
Labels
is-robustness-issue From a users perspective, this is about robustness key-error Could be a bug, but also a robustness issue

Comments

@owurman
Copy link
Contributor

owurman commented Mar 29, 2023

I was trying to get the pages for the attached PDF but received a KeyError: '/Root'. The file appears to be encrypted to me, but pdf.is_encrypted is False.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-10.16-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf.__version__)"
3.7.0

Code + PDF

import pypdf

reader = pypdf.PdfReader("641-Attachment-B-Pediatric-Cardiac-Arrest-8-1-2019.pdf")
assert (not reader.is_encrypted)
len(reader.pages)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
641-Attachment-B-Pediatric-Cardiac-Arrest-8-1-2019.pdf

It's a public document so it should be fine to add to your tests.

Traceback

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pypdf/_page.py", line 2155, in __len__
    return self.length_function()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "pypdf/_reader.py", line 452, in _get_num_pages
    self._flatten()
  File "pypdf/_reader.py", line 1186, in _flatten
    catalog = self.trailer[TK.ROOT].get_object()
              ~~~~~~~~~~~~^^^^^^^^^
  File "pypdf/generic/_data_structures.py", line 291, in __getitem__
    return dict.__getitem__(self, key).get_object()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: '/Root'
@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness and removed is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Mar 30, 2023
@MartinThoma
Copy link
Member

A PDF file should look like this:

image

with:

A trailer giving the location of the cross-reference table and of certain special objects within the body of the
file

The trailer of that file is empty, thus the error.

@MartinThoma
Copy link
Member

This command fixes it:

gs -o repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress 641-Attachment-B-Pediatric-Cardiac-Arrest-8-1-2019.pdf

See https://superuser.com/q/278562/64857

@owurman
Copy link
Contributor Author

owurman commented Mar 30, 2023

Thanks. Is it a reasonable ask that a better message be given if the trailer is missing? I'm guessing that actually repairing the PDF as ghostscript does is beyond the scope of what you want pypdf to do...

@MartinThoma
Copy link
Member

The problem is not this specific case. Sure, we can (and regularly do) add robustness improvements. It's just a never ending story. There is an infinite number of ways the standard can be broken

@MartinThoma
Copy link
Member

I was hoping that we could use similar techniques as web browsers / beautiful soup does for HTML for that problem. I just didn't have the time to look into it so far.

@pubpub-zz
Copy link
Collaborator

@owurman
I've prepared a PR to improve robustness if you want to try it.

@MartinThoma
Copy link
Member

The robustness improvement was just added to main and will be released this weekend with pypdf>3.7.1.

@MartinThoma MartinThoma changed the title KeyError: '/Root', possibly a red herring for an encryption detection bug KeyError: '/Root' due to invalid start of xref Apr 14, 2023
@MartinThoma
Copy link
Member

@owurman If you want I can add you as a contributor to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

@MartinThoma MartinThoma added the key-error Could be a bug, but also a robustness issue label Aug 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness key-error Could be a bug, but also a robustness issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants