Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Huridocs pdf loader more functionality added #25543

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open
Changes from 15 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
eb88fec
openai proxy added to base embeddings
Jul 23, 2024
500cd93
Update pyproject.toml
PabloKarpacho Jul 23, 2024
eca0b5b
Revert "Update pyproject.toml"
Jul 23, 2024
e82736d
Merge branch 'master' into PabloKarpacho/master
baskaryan Jul 28, 2024
d846558
fmt
baskaryan Jul 28, 2024
29daebb
fix
baskaryan Jul 28, 2024
b4c67ee
Merge branch 'langchain-ai:master' into master
PabloKarpacho Aug 13, 2024
36ed59b
Huridocs integration added to langchain community
Aug 13, 2024
2bf3080
Update pdf.py
PabloKarpacho Aug 15, 2024
5888005
Add methods for Huridocs pdf loader
ali6parmak Aug 19, 2024
49f538d
Merge remote-tracking branch 'origin' into document-layout-analysis
ali6parmak Aug 20, 2024
e911653
Format HuridocsPDFLoader
ali6parmak Aug 20, 2024
aac0a4d
Merge branch 'master' into document-layout-analysis
ali6parmak Aug 22, 2024
4e0eb32
Merge branch 'master' into document-layout-analysis
ali6parmak Aug 22, 2024
49d3e89
Merge branch 'master' into document-layout-analysis
ali6parmak Aug 22, 2024
44f73b0
Reformat HuridocsPDFLoader
ali6parmak Aug 22, 2024
b8bb3e5
Merge branch 'master' into document-layout-analysis
ali6parmak Aug 22, 2024
5c785f6
Merge branch 'master' into document-layout-analysis
ali6parmak Aug 23, 2024
79fe9bd
Merge branch 'master' into document-layout-analysis
ali6parmak Aug 25, 2024
a07730e
Merge branch 'master' into document-layout-analysis
ali6parmak Aug 27, 2024
ea0e944
Add documentation for HuridocsPDFLoader
ali6parmak Aug 29, 2024
252b9b0
Merge branch 'master' into document-layout-analysis
ali6parmak Sep 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions libs/community/langchain_community/document_loaders/pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -945,5 +945,144 @@
yield from self.parser.parse(blob)


class HuridocsPDFLoader(BasePDFLoader):
"""Load a PDF with Huridocs"""

def __init__(
self,
file_path: str,
server_url: str,
fast: Optional[bool] = False,
) -> None:
"""
Initialize the object for PDF file processing with
Huridocs pdf-document-layout-analysis.

This constructor initializes a HuridocsPDFLoader object to be used
for parsing files using the pdf-document-layout-analysis API.
Loader uses VGT layout model.
Parameters:
-----------
file_path : str
The path to the file that needs to be parsed.
server_url: str
The path to pdf-document-layout-analysis self-hosted API server.

Types of the Segments:
---------
1: "Caption"
2: "Footnote"
3: "Formula"
4: "List item"
5: "Page footer"
6: "Page header"
7: "Picture"
8: "Section header"
9: "Table"
10: "Text"
11: "Title"


Examples:
---------
>>> pdf_loader = HuridocsPDFLoader(
... file_path="path/to/file",
... server_url="path/to/sef-hosted/api"
... )

pdf_analysis = pdf_loader.analyze_pdf()
table_of_contents = pdf_loader.get_table_of_contents()
pdf_loader.get_visualization(/path/to/output/pdf)
pdf_content = pdf_loader.get_text()
"""
self.server_url = server_url
self.fast = fast

try:
response = requests.get(self.server_url)
response.raise_for_status()
except requests.exceptions.HTTPError as err:
raise err

super().__init__(file_path)

def analyze_pdf(self) -> str:
with open(self.file_path, "rb") as f:
files = {"file": f}
try:
data = {"fast": self.fast}
response = requests.post(f"{self.server_url}/", files=files, data=data)
response.raise_for_status()
except requests.exceptions.HTTPError as err:
raise err

response_data = response.json()

return response_data

def get_table_of_contents(self) -> str:
with open(self.file_path, "rb") as f:
files = {"file": f}
try:
data = {"fast": self.fast}
response = requests.post(f"{self.server_url}/toc", files=files, data=data)

Check failure on line 1028 in libs/community/langchain_community/document_loaders/pdf.py

View workflow job for this annotation

GitHub Actions / cd libs/community / make lint #3.12

Ruff (E501)

langchain_community/document_loaders/pdf.py:1028:89: E501 Line too long (90 > 88)

Check failure on line 1028 in libs/community/langchain_community/document_loaders/pdf.py

View workflow job for this annotation

GitHub Actions / cd libs/community / make lint #3.8

Ruff (E501)

langchain_community/document_loaders/pdf.py:1028:89: E501 Line too long (90 > 88)
response.raise_for_status()
except requests.exceptions.HTTPError as err:
raise err

response_data = response.json()

return response_data

def get_visualization(self, output_destination_path: str):
with open(self.file_path, "rb") as f:
files = {"file": f}
try:
data = {"fast": self.fast}
response = requests.post(f"{self.server_url}/visualize", files=files, data=data)

Check failure on line 1042 in libs/community/langchain_community/document_loaders/pdf.py

View workflow job for this annotation

GitHub Actions / cd libs/community / make lint #3.12

Ruff (E501)

langchain_community/document_loaders/pdf.py:1042:89: E501 Line too long (96 > 88)

Check failure on line 1042 in libs/community/langchain_community/document_loaders/pdf.py

View workflow job for this annotation

GitHub Actions / cd libs/community / make lint #3.8

Ruff (E501)

langchain_community/document_loaders/pdf.py:1042:89: E501 Line too long (96 > 88)
response.raise_for_status()
with open(output_destination_path, "wb") as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)

except requests.exceptions.HTTPError as err:
raise err

def get_text(self, types: str = "all") -> str:
with open(self.file_path, "rb") as f:
files = {"file": f}
try:
data = {"fast": self.fast, "types": types}
response = requests.post(f"{self.server_url}/text", files=files, data=data)

Check failure on line 1056 in libs/community/langchain_community/document_loaders/pdf.py

View workflow job for this annotation

GitHub Actions / cd libs/community / make lint #3.12

Ruff (E501)

langchain_community/document_loaders/pdf.py:1056:89: E501 Line too long (91 > 88)

Check failure on line 1056 in libs/community/langchain_community/document_loaders/pdf.py

View workflow job for this annotation

GitHub Actions / cd libs/community / make lint #3.8

Ruff (E501)

langchain_community/document_loaders/pdf.py:1056:89: E501 Line too long (91 > 88)
response.raise_for_status()
except requests.exceptions.HTTPError as err:
raise err

response_data = response.json()

return response_data

def load(self) -> List[Document]:
"""Load data into Document objects."""
return list(self.lazy_load())

def lazy_load(
self,
) -> Iterator[Document]:
"""Lazy load given path as pages."""
elements = self.analyze_pdf()

for el in elements:
yield Document(
page_content=el["text"],
metadata={
"coordinates": (el["left"], el["top"], el["width"], el["height"]),
"page_number": el["page_number"],
"page_width": el["page_width"],
"page_height": el["page_height"],
"type": el["type"],
},
)

# Legacy: only for backwards compatibility. Use PyPDFLoader instead
PagedPDFSplitter = PyPDFLoader
Loading