You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
loaders = []
for file in files:
loaders.append(PDFPlumberLoader(file, extract_images = True))
docs = []
for loader in loaders:
docs.extend(loader.load())
with open("./docs_processed/pdfplumber_docs.txt", "wb") as file:
pickle.dump(docs, file)
Error Message and Stack Trace (if applicable)
Traceback (most recent call last):
File "/home/work/Local Work Files/langchain-doc-loaders/process_docs.py", line 21, in
docs.extend(loader.load())
^^^^^^^^^^^^^
File "/home/work/.pyenv/versions/docloader-venv/lib/python3.12/site-packages/langchain_community/document_loaders/pdf.py", line 644, in load
return parser.parse(blob)
^^^^^^^^^^^^^^^^^^
File "/home/work/.pyenv/versions/docloader-venv/lib/python3.12/site-packages/langchain_core/document_loaders/base.py", line 126, in parse
return list(self.lazy_parse(blob))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/work/.pyenv/versions/docloader-venv/lib/python3.12/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 397, in lazy_parse
+ self._extract_images_from_page(page),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/work/.pyenv/versions/docloader-venv/lib/python3.12/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 428, in _extract_images_from_page
if img["stream"]["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'name'
Description
I'm using the PDF Plumber LangChain implementation to extract image information from a set of car brochure PDFs. The error is located in the langchain_community/document_loaders/parsers/pdf.py file on line 428.
I was able to fix this issue by deleting ".name" from the conditional on line 428. I'm opening this issue so I can create a PR with the fix.
Checked other resources
Example Code
from langchain_community.document_loaders import PDFPlumberLoader
from PIL import ImageFile
import pickle
Configuring Libraries
ImageFile.LOAD_TRUNCATED_IMAGES = True
Main Program Code
files = ["./docs/cayenne.pdf",
"./docs/cullinan.pdf",
"./docs/aventador.pdf",
"./docs/performante.pdf"]
loaders = []
for file in files:
loaders.append(PDFPlumberLoader(file, extract_images = True))
docs = []
for loader in loaders:
docs.extend(loader.load())
with open("./docs_processed/pdfplumber_docs.txt", "wb") as file:
pickle.dump(docs, file)
Error Message and Stack Trace (if applicable)
Traceback (most recent call last):
File "/home/work/Local Work Files/langchain-doc-loaders/process_docs.py", line 21, in
docs.extend(loader.load())
^^^^^^^^^^^^^
File "/home/work/.pyenv/versions/docloader-venv/lib/python3.12/site-packages/langchain_community/document_loaders/pdf.py", line 644, in load
return parser.parse(blob)
^^^^^^^^^^^^^^^^^^
File "/home/work/.pyenv/versions/docloader-venv/lib/python3.12/site-packages/langchain_core/document_loaders/base.py", line 126, in parse
return list(self.lazy_parse(blob))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/work/.pyenv/versions/docloader-venv/lib/python3.12/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 397, in lazy_parse
+ self._extract_images_from_page(page),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/work/.pyenv/versions/docloader-venv/lib/python3.12/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 428, in _extract_images_from_page
if img["stream"]["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'name'
Description
I'm using the PDF Plumber LangChain implementation to extract image information from a set of car brochure PDFs. The error is located in the langchain_community/document_loaders/parsers/pdf.py file on line 428.
I was able to fix this issue by deleting ".name" from the conditional on line 428. I'm opening this issue so I can create a PR with the fix.
System Info
System Information
Package Information
Optional packages not installed
Other Dependencies
The text was updated successfully, but these errors were encountered: