Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unstructured, community, initialize langchain-unstructured package #22779

Merged
merged 115 commits into from
Jul 24, 2024
Merged
Changes from 1 commit
Commits
Show all changes
115 commits
Select commit Hold shift + click to select a range
292f984
change inheritance structure for unstructured loaders
Coniferish Jun 10, 2024
3145a55
fix types
Coniferish Jun 10, 2024
1d435c2
implement using the sdk for making requests to the api
Coniferish Jun 11, 2024
4a83a93
linting
Coniferish Jun 11, 2024
48ae68a
parameterize test
Coniferish Jun 13, 2024
4696e34
fix type hint and extract out _get_content helper function
Coniferish Jun 13, 2024
5bf9951
move functions to bottom of the file
Coniferish Jun 13, 2024
7946520
refactor _get_content and make file_path required arg in get_elements…
Coniferish Jun 13, 2024
e89845d
add test, fix default/accepted types (remove None default), remove un…
Coniferish Jun 11, 2024
1bc26a9
change UnstructuredFileIOLoader's file type hint to include Sequence …
Coniferish Jun 13, 2024
51c68a4
First pass at implementing UnstructuredAPIFileLoader.lazy_load()
Coniferish Jun 17, 2024
4a10c33
Make _post_process_elements an @abstractmethod and implement in child…
Coniferish Jun 18, 2024
f97543e
move test and fix metadata for api loaders
Coniferish Jun 18, 2024
9820cc4
remove type hint for partitioning sequences by UnstructuredFileIOLoader
Coniferish Jun 19, 2024
562f5aa
add lazy_load method to UnstructuredAPIFileIOLoader, split parameteri…
Coniferish Jun 19, 2024
bf33d72
add to Document metadata
Coniferish Jun 20, 2024
70af048
update links and Loader docstrings
Coniferish Jun 20, 2024
a172c1f
update jupyter notebook and docs
Coniferish Jun 21, 2024
960a877
linting and formatting
Coniferish Jun 21, 2024
5865649
Merge branch 'master' into jj/sdk
Coniferish Jun 21, 2024
6931a13
address unstructured.mdx comment
Coniferish Jun 21, 2024
0e30d2d
Address mode='paged'
Coniferish Jun 24, 2024
7ae58bb
Merge branch 'master' into jj/sdk
Coniferish Jun 24, 2024
d067719
Merge branch 'master' into jj/sdk
Coniferish Jun 25, 2024
ca930a3
Merge branch 'master' into jj/sdk
Coniferish Jun 26, 2024
fc1869e
Merge branch 'master' into jj/sdk
Coniferish Jun 28, 2024
963786f
Merge branch 'master' into jj/sdk
Coniferish Jul 1, 2024
22cda6b
Merge branch 'master' into jj/sdk
Coniferish Jul 1, 2024
a762de3
Merge branch 'master' into jj/sdk
Coniferish Jul 1, 2024
a098943
undo linting changes to unrelated docs
Coniferish Jul 1, 2024
b002838
fix notebook merge and mode bug
Coniferish Jul 2, 2024
3bc9143
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
0994404
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
cea7419
minor fix
Coniferish Jul 2, 2024
fb0bac5
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
992f485
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
231f421
linting
Coniferish Jul 2, 2024
0f7c1b3
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
336cded
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
e3f0800
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
f50a10b
alphabetize loaders
Coniferish Jul 2, 2024
88dd95d
add UnstructuredBaseLoader to package and import tests
Coniferish Jul 2, 2024
94bcde9
init partner package
Coniferish Jul 2, 2024
57d7b16
replicate SDK Loaders in partners
Coniferish Jul 2, 2024
bf18f52
update all references to api url
Coniferish Jul 2, 2024
c706390
update docstring and remove unused class
Coniferish Jul 3, 2024
0a4e47b
undo changes from rebase attempt
Coniferish Jul 3, 2024
ad5e874
Merge branch 'langchain-ai:master' into jj/sdk
Coniferish Jul 3, 2024
9fd5f7b
restore API loaders so they don't use the unstrd client, improve test…
Coniferish Jul 3, 2024
5349c8b
Merge branch 'master' into jj/sdk
Coniferish Jul 8, 2024
1e8f135
Merge branch 'master' into jj/sdk
Coniferish Jul 8, 2024
860f311
Merge branch 'master' into jj/sdk
Coniferish Jul 9, 2024
e625686
deprecate API Loaders and update docs to use SDK Loaders
Coniferish Jul 9, 2024
5b12875
fix docstring
Coniferish Jul 9, 2024
cd77607
Merge branch 'master' into jj/sdk
Coniferish Jul 10, 2024
38825ea
update test assertions and remove 'mode' param from SDK loaders
Coniferish Jul 10, 2024
3b5a039
add tests for mode
Coniferish Jul 11, 2024
0607648
Merge branch 'master' into jj/sdk
Coniferish Jul 11, 2024
82aca5c
remove comments
Coniferish Jul 11, 2024
6496f8c
address comments about private class, documentation, etc.
Coniferish Jul 14, 2024
295914a
remove libs/partners/unstructured/docs/document_loaders.ipynb
Coniferish Jul 15, 2024
0551990
Merge branch 'master' into jj/sdk
Coniferish Jul 15, 2024
6798635
linting
Coniferish Jul 15, 2024
befd745
Merge branch 'master' into jj/sdk
Coniferish Jul 15, 2024
b1a33e3
Merge branch 'master' into jj/sdk
Coniferish Jul 15, 2024
2f4f1ef
Merge branch 'master' into jj/sdk
Coniferish Jul 15, 2024
579a985
poetry lock --no-update
Coniferish Jul 15, 2024
7bc0b67
change all references back to UnstructuredBaseLoader
Coniferish Jul 15, 2024
028245a
Implement UnstructuredLoader and update unit tests after making metho…
Coniferish Jul 17, 2024
d5c6112
update tests
Coniferish Jul 18, 2024
1ad6d77
wip
Coniferish Jul 18, 2024
6eece75
refactor to simplify interface and misc.
Coniferish Jul 19, 2024
30e28a8
fix classes to pass tests
Coniferish Jul 19, 2024
ce8b66b
update docs and make file_path a positional arg
Coniferish Jul 22, 2024
9c32799
remove UnstructuredBaseLoader as a public class and references to SDK…
Coniferish Jul 22, 2024
a10781a
add deprecation decorators, undo some refactoring, and add type hints…
Coniferish Jul 22, 2024
dcec3c8
Merge branch 'master' into jj/sdk
Coniferish Jul 22, 2024
c5b087e
update README
Coniferish Jul 22, 2024
39e48db
linting and type hinting
Coniferish Jul 22, 2024
e3d5d74
add SDK example to docs
Coniferish Jul 22, 2024
890f84c
Merge branch 'master' into jj/sdk
Coniferish Jul 22, 2024
174dfd6
add unstructured to list of providers
Coniferish Jul 22, 2024
165454c
address comments
Coniferish Jul 23, 2024
f538f2f
Merge branch 'langchain-ai:master' into jj/sdk
Coniferish Jul 23, 2024
c60324c
revert files
Coniferish Jul 23, 2024
01d7ab4
refactor to simplify diff
Coniferish Jul 23, 2024
afb4679
Merge branch 'master' into jj/sdk
Coniferish Jul 23, 2024
e34670e
Merge branch 'master' into jj/sdk
efriis Jul 23, 2024
d4f6673
x
efriis Jul 24, 2024
532f9bc
add return values and address CI errors
Coniferish Jul 24, 2024
f2af016
Merge branch 'master' into jj/sdk
Coniferish Jul 24, 2024
b98eab7
format
efriis Jul 24, 2024
fabc70c
Merge branch 'master' into jj/sdk
efriis Jul 24, 2024
b9ff6b7
x
efriis Jul 24, 2024
02c99c4
x
efriis Jul 24, 2024
96a0812
docs: add tables for search and code interpreter tools (#24586)
isahers1 Jul 24, 2024
f21772f
cli: remove snapshot flag from pytest defaults (#24622)
efriis Jul 24, 2024
dbf2dab
milvus: release 0.1.3 (#24624)
efriis Jul 24, 2024
490e2b3
partners[milvus]: add dynamic field (#24544)
zc277584121 Jul 24, 2024
26c60a9
cli: release 0.0.26 (#24623)
efriis Jul 24, 2024
f7064bc
change client to UnstructuredClient, add os.getenv(), and update jupy…
Coniferish Jul 24, 2024
c124358
linting
Coniferish Jul 24, 2024
4a0c8ec
fix TypeAlias import
Coniferish Jul 24, 2024
9aa868c
Merge branch 'master' into jj/sdk
efriis Jul 24, 2024
35e58d3
x
efriis Jul 24, 2024
638ea5f
linting
Coniferish Jul 24, 2024
536ac18
x
efriis Jul 24, 2024
84138b5
Merge branch 'jj/sdk' of github.com:Coniferish/langchain into jj/sdk
efriis Jul 24, 2024
50c28b5
Merge branch 'master' into jj/sdk
efriis Jul 24, 2024
d384b96
x
efriis Jul 24, 2024
0d8331f
x
efriis Jul 24, 2024
9be2081
x
efriis Jul 24, 2024
9579813
x
efriis Jul 24, 2024
e73c148
x
efriis Jul 24, 2024
ec98d2a
x
efriis Jul 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
efriis marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""Loader that uses unstructured to load files."""
from abc import ABC, abstractmethod
import json
from pathlib import Path
from typing import (
IO,
Expand Down Expand Up @@ -28,6 +29,10 @@ def __init__(
**unstructured_kwargs: Any,
):
"""Initialize with file path."""

# `single` - elements are combined into one (default)
# `elements` - maintain individual elements
# `paged` - elements are combined by page
_valid_modes = {"single", "elements", "paged"}

if mode not in _valid_modes:
Expand All @@ -45,7 +50,7 @@ def _get_elements(self) -> List:

@abstractmethod
def _get_metadata(self) -> dict:
"""Get metadata."""
"""Get file_path metadata if available."""
Coniferish marked this conversation as resolved.
Show resolved Hide resolved

def _post_process_elements(self, elements: list) -> list:
"""Applies post processing functions to extracted unstructured elements.
Expand Down Expand Up @@ -74,13 +79,13 @@ def lazy_load(self) -> Iterator[Document]:
text_dict: Dict[int, str] = {}
meta_dict: Dict[int, Dict] = {}

for idx, element in enumerate(elements):
for element in elements:
metadata = self._get_metadata()
if hasattr(element, "metadata"):
metadata.update(element.metadata.to_dict())
page_number = metadata.get("page_number", 1)

# Check if this page_number already exists in docs_dict
# Check if this page_number already exists in text_dict
if page_number not in text_dict:
# If not, create new entry with initial text and metadata
text_dict[page_number] = str(element) + "\n\n"
Expand Down Expand Up @@ -219,6 +224,53 @@ def __init__(

super().__init__(mode=mode, **unstructured_kwargs)

def lazy_load(self) -> Iterator[Document]:
Coniferish marked this conversation as resolved.
Show resolved Hide resolved
"""Load file."""
# This method overwrites the UnstructuredBaseLoader method because that one expects
# `Element` objects instead of a json, which is what the SDK returns.
elements_json = self._get_elements()
self._post_process_elements(elements_json)
Coniferish marked this conversation as resolved.
Show resolved Hide resolved

if self.mode == "elements":
for element in elements_json:
metadata = self._get_metadata()
# NOTE(MthwRobinson) - the attribute check is for backward compatibility
# with unstructured<0.4.9. The metadata attributed was added in 0.4.9.
if element.get("metadata") is not None:
metadata.update(element["metadata"])
Coniferish marked this conversation as resolved.
Show resolved Hide resolved
if element.get("category"):
metadata.update(element["category"])
yield Document(page_content=element.get("text"), metadata=metadata)
elif self.mode == "paged":
Coniferish marked this conversation as resolved.
Show resolved Hide resolved
text_dict: Dict[int, str] = {}
meta_dict: Dict[int, Dict] = {}

for element in elements_json:
metadata = self._get_metadata()
if element.get("metadata") is not None:
metadata.update(element.get("metadata"))
Coniferish marked this conversation as resolved.
Show resolved Hide resolved
page_number = metadata.get("page_number", 1)

# Check if this page_number already exists in text_dict
if page_number not in text_dict:
# If not, create new entry with initial text and metadata
text_dict[page_number] = str(element.get("text")) + "\n\n"
meta_dict[page_number] = metadata
else:
# If exists, append to text and update the metadata
text_dict[page_number] += str(element.get("text")) + "\n\n"
meta_dict[page_number].update(metadata)

# Convert the dict to a list of Document objects
for key in text_dict.keys():
yield Document(page_content=text_dict[key], metadata=meta_dict[key])
elif self.mode == "single":
metadata = self._get_metadata()
text = "\n\n".join([el.get("text") for el in elements_json])
yield Document(page_content=text, metadata=metadata)
else:
raise ValueError(f"mode of {self.mode} not supported.")

def _get_metadata(self) -> dict:
return {"source": self.file_path}

Expand Down Expand Up @@ -403,13 +455,11 @@ def get_elements_from_api(

try:
import unstructured_client # noqa:F401
from unstructured.staging.base import elements_from_json # noqa:F401
except ImportError:
raise ImportError(
"unstructured_client and/or unstructured package not found, please install it with "
"`pip install unstructured-client` or `pip install unstructured`."
"unstructured_client package not found, please install it with "
"`pip install unstructured-client`."
)
from unstructured.staging.base import elements_from_json
from unstructured_client.models import operations, shared

content = _get_content(file=file, file_path=file_path)
Expand All @@ -426,7 +476,7 @@ def get_elements_from_api(
response = client.general.partition(req)

if response.status_code == 200:
return elements_from_json(text=response.raw_response.text)
return json.loads(response.raw_response.text)
else:
raise ValueError(
f"Receive unexpected status code {response.status_code} from the API.",
Expand Down