-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Closed
Labels
Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-imagesFrom a users perspective, image handling is the affected feature/workflowFrom a users perspective, image handling is the affected feature/workflow
Description
I am trying to extract images from pdf files, however occasionally it gives 'not enough image data' exception from PIL when handling certain pdf. The files look correct in Atril Document Viewer and works if using pdfimages from poppler-utils
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-6.5.0-kali3-amd64-x86_64-with-glibc2.37
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.2, crypt_provider=('cryptography', '38.0.4'), PIL=10.0.0
Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
import sys
for filename in sys.argv[1:]:
reader = PdfReader(filename)
for i, page in enumerate(reader.pages):
for j, image in enumerate(page.images):
print("Writing %d-%d: %s (%d)..." % (i, j, image.name, len(image.data)))
with open(image.name, "wb") as fp:
fp.write(image.data)
Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
Traceback
This is the complete traceback I see:
Traceback (most recent call last):
File "/home/user/pypdf/pypdf_test.py", line 7, in <module>
for j, image in enumerate(page.images):
File "/home/user/.local/lib/python3.11/site-packages/pypdf/_page.py", line 2727, in __iter__
yield self[i]
~~~~^^^
File "/home/user/.local/lib/python3.11/site-packages/pypdf/_page.py", line 2723, in __getitem__
return self.get_function(lst[index])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/pypdf/_page.py", line 557, in _get_image
imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/pypdf/filters.py", line 785, in _xobj_to_image
img, image_format, extension, _ = _handle_flate(
^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/pypdf/_xobj_image_helpers.py", line 172, in _handle_flate
img = Image.frombytes(mode, size, data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/PIL/Image.py", line 2952, in frombytes
im.frombytes(data, decoder_name, args)
File "/usr/lib/python3/dist-packages/PIL/Image.py", line 805, in frombytes
raise ValueError(msg)
ValueError: not enough image data
Metadata
Metadata
Assignees
Labels
Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-imagesFrom a users perspective, image handling is the affected feature/workflowFrom a users perspective, image handling is the affected feature/workflow