Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#6 Using PdfReader causes a crash #2875

Closed
Avgor46 opened this issue Sep 27, 2024 · 1 comment · Fixed by #2880
Closed

#6 Using PdfReader causes a crash #2875

Avgor46 opened this issue Sep 27, 2024 · 1 comment · Fixed by #2880

Comments

@Avgor46
Copy link

Avgor46 commented Sep 27, 2024

Hi!

I've found KeyError in PdfReader. Necessary information are provided below.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.0-56-generic-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.0.0, crypt_provider=('cryptography', '3.1'), PIL=none

commit 762fc1f

Code + PDF

This is a minimal, complete example that shows the issue:

#! /usr/bin/env python3

import pypdf
from pypdf.errors import EmptyFileError, PdfReadError, PdfStreamError
import sys

def TestOneInput(fname):
  try:
    pdf_reader = pypdf.PdfReader(fname)
    for page_number, page in enumerate(pdf_reader.pages):
        page.extract_text()
  except (EmptyFileError, PdfReadError, PdfStreamError):
      pass

if __name__ == "__main__":
    if len(sys.argv) < 2:
        exit(1)
    TestOneInput(sys.argv[1])

PoC

crash-6620e8b1abfe3da639b654595da859b87f985748.pdf

Traceback

This is the complete stderr I see:

incorrect startxref pointer(2)
parsing for Object Streams
Traceback (most recent call last):
  File "/fuzz/./poc.py", line 18, in <module>
    TestOneInput(sys.argv[1])
  File "/fuzz/./poc.py", line 10, in TestOneInput
    for page_number, page in enumerate(pdf_reader.pages):
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_page.py", line 2468, in __iter__
    for i in range(len(self)):
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_page.py", line 2393, in __len__
    return self.length_function()
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_doc_common.py", line 353, in get_num_pages
    self._flatten(self._readonly)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_doc_common.py", line 1151, in _flatten
    pages = catalog["/Pages"].get_object()  # type: ignore
  File "/usr/local/lib/python3.9/dist-packages/pypdf/generic/_data_structures.py", line 471, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError: '/Pages'
@pubpub-zz
Copy link
Collaborator

Your pdf has a trailer referencing object (1) whereas the real root object is (2). I've found a solution to repair it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants