Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unstructured, community, initialize langchain-unstructured package #22779

Merged
merged 115 commits into from
Jul 24, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
115 commits
Select commit Hold shift + click to select a range
292f984
change inheritance structure for unstructured loaders
Coniferish Jun 10, 2024
3145a55
fix types
Coniferish Jun 10, 2024
1d435c2
implement using the sdk for making requests to the api
Coniferish Jun 11, 2024
4a83a93
linting
Coniferish Jun 11, 2024
48ae68a
parameterize test
Coniferish Jun 13, 2024
4696e34
fix type hint and extract out _get_content helper function
Coniferish Jun 13, 2024
5bf9951
move functions to bottom of the file
Coniferish Jun 13, 2024
7946520
refactor _get_content and make file_path required arg in get_elements…
Coniferish Jun 13, 2024
e89845d
add test, fix default/accepted types (remove None default), remove un…
Coniferish Jun 11, 2024
1bc26a9
change UnstructuredFileIOLoader's file type hint to include Sequence …
Coniferish Jun 13, 2024
51c68a4
First pass at implementing UnstructuredAPIFileLoader.lazy_load()
Coniferish Jun 17, 2024
4a10c33
Make _post_process_elements an @abstractmethod and implement in child…
Coniferish Jun 18, 2024
f97543e
move test and fix metadata for api loaders
Coniferish Jun 18, 2024
9820cc4
remove type hint for partitioning sequences by UnstructuredFileIOLoader
Coniferish Jun 19, 2024
562f5aa
add lazy_load method to UnstructuredAPIFileIOLoader, split parameteri…
Coniferish Jun 19, 2024
bf33d72
add to Document metadata
Coniferish Jun 20, 2024
70af048
update links and Loader docstrings
Coniferish Jun 20, 2024
a172c1f
update jupyter notebook and docs
Coniferish Jun 21, 2024
960a877
linting and formatting
Coniferish Jun 21, 2024
5865649
Merge branch 'master' into jj/sdk
Coniferish Jun 21, 2024
6931a13
address unstructured.mdx comment
Coniferish Jun 21, 2024
0e30d2d
Address mode='paged'
Coniferish Jun 24, 2024
7ae58bb
Merge branch 'master' into jj/sdk
Coniferish Jun 24, 2024
d067719
Merge branch 'master' into jj/sdk
Coniferish Jun 25, 2024
ca930a3
Merge branch 'master' into jj/sdk
Coniferish Jun 26, 2024
fc1869e
Merge branch 'master' into jj/sdk
Coniferish Jun 28, 2024
963786f
Merge branch 'master' into jj/sdk
Coniferish Jul 1, 2024
22cda6b
Merge branch 'master' into jj/sdk
Coniferish Jul 1, 2024
a762de3
Merge branch 'master' into jj/sdk
Coniferish Jul 1, 2024
a098943
undo linting changes to unrelated docs
Coniferish Jul 1, 2024
b002838
fix notebook merge and mode bug
Coniferish Jul 2, 2024
3bc9143
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
0994404
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
cea7419
minor fix
Coniferish Jul 2, 2024
fb0bac5
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
992f485
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
231f421
linting
Coniferish Jul 2, 2024
0f7c1b3
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
336cded
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
e3f0800
Merge branch 'master' into jj/sdk
Coniferish Jul 2, 2024
f50a10b
alphabetize loaders
Coniferish Jul 2, 2024
88dd95d
add UnstructuredBaseLoader to package and import tests
Coniferish Jul 2, 2024
94bcde9
init partner package
Coniferish Jul 2, 2024
57d7b16
replicate SDK Loaders in partners
Coniferish Jul 2, 2024
bf18f52
update all references to api url
Coniferish Jul 2, 2024
c706390
update docstring and remove unused class
Coniferish Jul 3, 2024
0a4e47b
undo changes from rebase attempt
Coniferish Jul 3, 2024
ad5e874
Merge branch 'langchain-ai:master' into jj/sdk
Coniferish Jul 3, 2024
9fd5f7b
restore API loaders so they don't use the unstrd client, improve test…
Coniferish Jul 3, 2024
5349c8b
Merge branch 'master' into jj/sdk
Coniferish Jul 8, 2024
1e8f135
Merge branch 'master' into jj/sdk
Coniferish Jul 8, 2024
860f311
Merge branch 'master' into jj/sdk
Coniferish Jul 9, 2024
e625686
deprecate API Loaders and update docs to use SDK Loaders
Coniferish Jul 9, 2024
5b12875
fix docstring
Coniferish Jul 9, 2024
cd77607
Merge branch 'master' into jj/sdk
Coniferish Jul 10, 2024
38825ea
update test assertions and remove 'mode' param from SDK loaders
Coniferish Jul 10, 2024
3b5a039
add tests for mode
Coniferish Jul 11, 2024
0607648
Merge branch 'master' into jj/sdk
Coniferish Jul 11, 2024
82aca5c
remove comments
Coniferish Jul 11, 2024
6496f8c
address comments about private class, documentation, etc.
Coniferish Jul 14, 2024
295914a
remove libs/partners/unstructured/docs/document_loaders.ipynb
Coniferish Jul 15, 2024
0551990
Merge branch 'master' into jj/sdk
Coniferish Jul 15, 2024
6798635
linting
Coniferish Jul 15, 2024
befd745
Merge branch 'master' into jj/sdk
Coniferish Jul 15, 2024
b1a33e3
Merge branch 'master' into jj/sdk
Coniferish Jul 15, 2024
2f4f1ef
Merge branch 'master' into jj/sdk
Coniferish Jul 15, 2024
579a985
poetry lock --no-update
Coniferish Jul 15, 2024
7bc0b67
change all references back to UnstructuredBaseLoader
Coniferish Jul 15, 2024
028245a
Implement UnstructuredLoader and update unit tests after making metho…
Coniferish Jul 17, 2024
d5c6112
update tests
Coniferish Jul 18, 2024
1ad6d77
wip
Coniferish Jul 18, 2024
6eece75
refactor to simplify interface and misc.
Coniferish Jul 19, 2024
30e28a8
fix classes to pass tests
Coniferish Jul 19, 2024
ce8b66b
update docs and make file_path a positional arg
Coniferish Jul 22, 2024
9c32799
remove UnstructuredBaseLoader as a public class and references to SDK…
Coniferish Jul 22, 2024
a10781a
add deprecation decorators, undo some refactoring, and add type hints…
Coniferish Jul 22, 2024
dcec3c8
Merge branch 'master' into jj/sdk
Coniferish Jul 22, 2024
c5b087e
update README
Coniferish Jul 22, 2024
39e48db
linting and type hinting
Coniferish Jul 22, 2024
e3d5d74
add SDK example to docs
Coniferish Jul 22, 2024
890f84c
Merge branch 'master' into jj/sdk
Coniferish Jul 22, 2024
174dfd6
add unstructured to list of providers
Coniferish Jul 22, 2024
165454c
address comments
Coniferish Jul 23, 2024
f538f2f
Merge branch 'langchain-ai:master' into jj/sdk
Coniferish Jul 23, 2024
c60324c
revert files
Coniferish Jul 23, 2024
01d7ab4
refactor to simplify diff
Coniferish Jul 23, 2024
afb4679
Merge branch 'master' into jj/sdk
Coniferish Jul 23, 2024
e34670e
Merge branch 'master' into jj/sdk
efriis Jul 23, 2024
d4f6673
x
efriis Jul 24, 2024
532f9bc
add return values and address CI errors
Coniferish Jul 24, 2024
f2af016
Merge branch 'master' into jj/sdk
Coniferish Jul 24, 2024
b98eab7
format
efriis Jul 24, 2024
fabc70c
Merge branch 'master' into jj/sdk
efriis Jul 24, 2024
b9ff6b7
x
efriis Jul 24, 2024
02c99c4
x
efriis Jul 24, 2024
96a0812
docs: add tables for search and code interpreter tools (#24586)
isahers1 Jul 24, 2024
f21772f
cli: remove snapshot flag from pytest defaults (#24622)
efriis Jul 24, 2024
dbf2dab
milvus: release 0.1.3 (#24624)
efriis Jul 24, 2024
490e2b3
partners[milvus]: add dynamic field (#24544)
zc277584121 Jul 24, 2024
26c60a9
cli: release 0.0.26 (#24623)
efriis Jul 24, 2024
f7064bc
change client to UnstructuredClient, add os.getenv(), and update jupy…
Coniferish Jul 24, 2024
c124358
linting
Coniferish Jul 24, 2024
4a0c8ec
fix TypeAlias import
Coniferish Jul 24, 2024
9aa868c
Merge branch 'master' into jj/sdk
efriis Jul 24, 2024
35e58d3
x
efriis Jul 24, 2024
638ea5f
linting
Coniferish Jul 24, 2024
536ac18
x
efriis Jul 24, 2024
84138b5
Merge branch 'jj/sdk' of github.com:Coniferish/langchain into jj/sdk
efriis Jul 24, 2024
50c28b5
Merge branch 'master' into jj/sdk
efriis Jul 24, 2024
d384b96
x
efriis Jul 24, 2024
0d8331f
x
efriis Jul 24, 2024
9be2081
x
efriis Jul 24, 2024
9579813
x
efriis Jul 24, 2024
e73c148
x
efriis Jul 24, 2024
ec98d2a
x
efriis Jul 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 12 additions & 12 deletions cookbook/self_query_hotel_search.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -355,15 +355,15 @@
"metadata": {},
"outputs": [],
"source": [
"attribute_info[-2][\"description\"] += (\n",
" f\". Valid values are {sorted(latest_price['starrating'].value_counts().index.tolist())}\"\n",
")\n",
"attribute_info[3][\"description\"] += (\n",
" f\". Valid values are {sorted(latest_price['maxoccupancy'].value_counts().index.tolist())}\"\n",
")\n",
"attribute_info[-3][\"description\"] += (\n",
" f\". Valid values are {sorted(latest_price['country'].value_counts().index.tolist())}\"\n",
")"
"attribute_info[-2][\n",
" \"description\"\n",
"] += f\". Valid values are {sorted(latest_price['starrating'].value_counts().index.tolist())}\"\n",
"attribute_info[3][\n",
" \"description\"\n",
"] += f\". Valid values are {sorted(latest_price['maxoccupancy'].value_counts().index.tolist())}\"\n",
"attribute_info[-3][\n",
" \"description\"\n",
"] += f\". Valid values are {sorted(latest_price['country'].value_counts().index.tolist())}\""
]
},
{
Expand Down Expand Up @@ -688,9 +688,9 @@
"metadata": {},
"outputs": [],
"source": [
"attribute_info[-3][\"description\"] += (\n",
" \". NOTE: Only use the 'eq' operator if a specific country is mentioned. If a region is mentioned, include all relevant countries in filter.\"\n",
")\n",
"attribute_info[-3][\n",
" \"description\"\n",
"] += \". NOTE: Only use the 'eq' operator if a specific country is mentioned. If a region is mentioned, include all relevant countries in filter.\"\n",
"chain = load_query_constructor_runnable(\n",
" ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0),\n",
" doc_contents,\n",
Expand Down
129 changes: 69 additions & 60 deletions docs/docs/integrations/document_loaders/unstructured_file.ipynb
Coniferish marked this conversation as resolved.
Show resolved Hide resolved

Large diffs are not rendered by default.

16 changes: 8 additions & 8 deletions docs/docs/integrations/providers/unstructured.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,15 @@ ecosystem within LangChain.

## Installation and Setup

If you are using a loader that runs locally, use the following steps to get `unstructured` and
its dependencies running locally.
If you are using a loader that runs locally, use the following steps to get `unstructured` and its
dependencies running.

- Install the Python SDK with `pip install unstructured`.
- For the smallest installation footprint and to take advantage of features not available in the
open-source `unstructured package`, install the Python SDK with `pip install unstructured-client`.
Coniferish marked this conversation as resolved.
Show resolved Hide resolved
- Unstructured's documentation for the sdk can be found here:
https://docs.unstructured.io/api-reference/api-services/sdk

- Install the open-source python package with `pip install unstructured`.
- You can install document specific dependencies with extras, i.e. `pip install "unstructured[docx]"`.
- To install the dependencies for all document types, use `pip install "unstructured[all-docs]"`.
- Install the following system dependencies if they are not already available on your system.
Expand All @@ -22,11 +27,6 @@ its dependencies running locally.
- `libreoffice` (MS Office docs)
- `pandoc` (EPUBs)

If you want to get up and running with less set up, you can
simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or
`UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API.


The `Unstructured API` requires API keys to make requests.
You can request an API key [here](https://unstructured.io/api-key-hosted) and start using it today!
Checkout the README [here](https://github.com/Unstructured-IO/unstructured-api) here to get started making API calls.
Expand Down
12 changes: 6 additions & 6 deletions docs/docs/integrations/vectorstores/azure_cosmos_db.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -92,13 +92,13 @@
"# Set up the OpenAI Environment Variables\n",
"os.environ[\"OPENAI_API_TYPE\"] = \"azure\"\n",
"os.environ[\"OPENAI_API_VERSION\"] = \"2023-05-15\"\n",
"os.environ[\"OPENAI_API_BASE\"] = (\n",
" \"YOUR_OPEN_AI_ENDPOINT\" # https://example.openai.azure.com/\n",
")\n",
"os.environ[\n",
" \"OPENAI_API_BASE\"\n",
"] = \"YOUR_OPEN_AI_ENDPOINT\" # https://example.openai.azure.com/\n",
"os.environ[\"OPENAI_API_KEY\"] = \"YOUR_OPENAI_API_KEY\"\n",
"os.environ[\"OPENAI_EMBEDDINGS_DEPLOYMENT\"] = (\n",
" \"smart-agent-embedding-ada\" # the deployment name for the embedding model\n",
")\n",
"os.environ[\n",
" \"OPENAI_EMBEDDINGS_DEPLOYMENT\"\n",
"] = \"smart-agent-embedding-ada\" # the deployment name for the embedding model\n",
"os.environ[\"OPENAI_EMBEDDINGS_MODEL_NAME\"] = \"text-embedding-ada-002\" # the model name"
]
},
Expand Down
6 changes: 3 additions & 3 deletions docs/docs/integrations/vectorstores/documentdb.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -105,9 +105,9 @@
"\n",
"# Set up the OpenAI Environment Variables\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n",
"os.environ[\"OPENAI_EMBEDDINGS_DEPLOYMENT\"] = (\n",
" \"smart-agent-embedding-ada\" # the deployment name for the embedding model\n",
")\n",
"os.environ[\n",
" \"OPENAI_EMBEDDINGS_DEPLOYMENT\"\n",
"] = \"smart-agent-embedding-ada\" # the deployment name for the embedding model\n",
"os.environ[\"OPENAI_EMBEDDINGS_MODEL_NAME\"] = \"text-embedding-ada-002\" # the model name"
]
},
Expand Down
6 changes: 3 additions & 3 deletions docs/scripts/arxiv_references.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,9 +176,9 @@ def search_code_for_arxiv_references(code_dir: Path) -> dict[str, set[str]]:
else:
module_name_and_member_reduced.add(module_name_and_member)
if module_name_and_member_reduced:
arxiv_id2module_name_and_members_reduced[arxiv_id] = (
module_name_and_member_reduced
)
arxiv_id2module_name_and_members_reduced[
Coniferish marked this conversation as resolved.
Show resolved Hide resolved
arxiv_id
] = module_name_and_member_reduced
if removed_modules:
logger.warning(
f"{arxiv_id}: Removed the following modules with 2+ -part namespaces: {removed_modules}."
Expand Down
Loading