Skip to content

Name objects after inline images are considered binary data #3172

@yaseerapure

Description

@yaseerapure

I'm trying to read a pdf using PyPdf but it gave me this error, although my pdf file is not corrupted. but when i replace the version from 5.3.0 to 5.1.0. the error got resolved
PdfReadError: Unexpected end of stream

Environment

Ubuntu 20.0

Code + PDF

This is a minimal, complete example that shows the issue:

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

DATA_PATH = 'data/'

def load_pdf_files(data):
    loader=DirectoryLoader(data,glob='*.pdf',loader_cls=PyPDFLoader)
    documnets=loader.load()
    return documnets

documents=load_pdf_files(data=DATA_PATH)
print("length of documents",len(documents))

This is the pdf file I'm using
https://www.academia.edu/32752835/The_GALE_ENCYCLOPEDIA_of_MEDICINE_SECOND_EDITION

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions