Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFPlumberParser error: 'list' object has no attribute "name" #26528

Open
5 tasks done
ZaraP-NSTARX opened this issue Sep 16, 2024 · 0 comments
Open
5 tasks done

PDFPlumberParser error: 'list' object has no attribute "name" #26528

ZaraP-NSTARX opened this issue Sep 16, 2024 · 0 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@ZaraP-NSTARX
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.document_loaders import PDFPlumberLoader
from PIL import ImageFile
import pickle

Configuring Libraries

ImageFile.LOAD_TRUNCATED_IMAGES = True

Main Program Code

files = ["./docs/cayenne.pdf",
"./docs/cullinan.pdf",
"./docs/aventador.pdf",
"./docs/performante.pdf"]

loaders = []
for file in files:
loaders.append(PDFPlumberLoader(file, extract_images = True))

docs = []
for loader in loaders:
docs.extend(loader.load())

with open("./docs_processed/pdfplumber_docs.txt", "wb") as file:
pickle.dump(docs, file)

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
File "/home/work/Local Work Files/langchain-doc-loaders/process_docs.py", line 21, in
docs.extend(loader.load())
^^^^^^^^^^^^^
File "/home/work/.pyenv/versions/docloader-venv/lib/python3.12/site-packages/langchain_community/document_loaders/pdf.py", line 644, in load
return parser.parse(blob)
^^^^^^^^^^^^^^^^^^
File "/home/work/.pyenv/versions/docloader-venv/lib/python3.12/site-packages/langchain_core/document_loaders/base.py", line 126, in parse
return list(self.lazy_parse(blob))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/work/.pyenv/versions/docloader-venv/lib/python3.12/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 397, in lazy_parse
+ self._extract_images_from_page(page),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/work/.pyenv/versions/docloader-venv/lib/python3.12/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 428, in _extract_images_from_page
if img["stream"]["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'name'

Description

I'm using the PDF Plumber LangChain implementation to extract image information from a set of car brochure PDFs. The error is located in the langchain_community/document_loaders/parsers/pdf.py file on line 428.

I was able to fix this issue by deleting ".name" from the conditional on line 428. I'm opening this issue so I can create a PR with the fix.

System Info

System Information

OS: Linux
OS Version: #1 SMP PREEMPT_DYNAMIC Thu, 12 Sep 2024 17:21:02 +0000
Python Version: 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805]

Package Information

langchain_core: 0.2.40
langchain: 0.3.0
langchain_community: 0.3.0
langsmith: 0.1.120
langchain_text_splitters: 0.3.0
langchain_unstructured: 0.1.2

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.5
async-timeout: Installed. No version info available.
dataclasses-json: 0.6.7
httpx: 0.27.2
jsonpatch: 1.33
numpy: 1.26.4
orjson: 3.10.7
packaging: 24.1
pydantic: 2.9.1
pydantic-settings: 2.5.2
PyYAML: 6.0.2
requests: 2.32.3
SQLAlchemy: 2.0.34
tenacity: 8.5.0
typing-extensions: 4.12.2
unstructured-client: 0.24.1
unstructured[all-docs]: Installed. No version info available.

ZaraP-NSTARX added a commit to ZaraP-NSTARX/langchain that referenced this issue Sep 16, 2024
ZaraP-NSTARX added a commit to ZaraP-NSTARX/langchain that referenced this issue Sep 16, 2024
@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant