Skip to content

Commit

Permalink
feat: Sample ingest project with S3 connector (#218)
Browse files Browse the repository at this point in the history
  • Loading branch information
cragwolfe authored Feb 14, 2023
1 parent 6d1d50d commit ab542ca
Show file tree
Hide file tree
Showing 26 changed files with 2,253 additions and 24 deletions.
3 changes: 3 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[run]
omit =
unstructured/ingest/*
2 changes: 2 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,8 @@ jobs:
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr
make test
make check-coverage
make install-ingest-s3
./test_unstructured_ingest/test-ingest.sh
changelog:
runs-on: ubuntu-latest
Expand Down
51 changes: 49 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,54 @@ dmypy.json
# Pyre type checker
.pyre/

# VSCode
.vscode/
# ingest outputs
/structured-output
# ingest temporary files
/tmp-ingest*

## https://github.com/github/gitignore/blob/main/Global/Emacs.gitignore (partial)

*~
\#*\#
/.emacs.desktop
/.emacs.desktop.lock
*.elc
auto-save-list
tramp
.\#*

## https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
.vscode/*
!.vscode/settings.json
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
!.vscode/*.code-snippets

# Local History for Visual Studio Code
.history/

# Built Visual Studio Code Extensions
*.vsix

## https://github.com/github/gitignore/blob/main/Global/Vim.gitignore
# Swap
[._]*.s[a-v][a-z]
!*.svg # comment out if you don't need vector files
[._]*.sw[a-p]
[._]s[a-rt-v][a-z]
[._]ss[a-gi-z]
[._]sw[a-p]

# Session
Session.vim
Sessionx.vim

# Temporary
.netrwhist
# Auto-generated tag files
tags
# Persistent undo
[._]*.un~

.DS_Store
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
## 0.4.9-dev0

* Added ingest modules and s3 connector

## 0.4.8

* Modified XML and HTML parsers not to load comments.
Expand Down
36 changes: 36 additions & 0 deletions Ingest.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Batch Processing Documents

Several classes are provided in the Unstructured library
to enable effecient batch processing of documents.

## The Abstractions

```mermaid
sequenceDiagram
participant MainProcess
participant DocReader (connector)
participant DocProcessor
participant StructuredDocWriter (conncector)
MainProcess->>DocReader (connector): Initialize / Authorize
DocReader (connector)->>MainProcess: All doc metadata (no file content)
loop Single doc at a time (allows for multiprocessing)
MainProcess->>DocProcessor: Raw document metadata (no file content)
DocProcessor->>DocReader (connector): Request document
DocReader (connector)->>DocProcessor: Single document payload
Note over DocProcessor: Process through Unstructured
DocProcessor->>StructuredDocWriter (conncector): Write Structured Data
Note over StructuredDocWriter (conncector): <br /> Optionally store version info, filename, etc
DocProcessor->>MainProcess: Structured Data (only JSON in V0)
end
Note over MainProcess: Optional - process structured data from all docs
```

## Sample Connector: S3

See the sample project [examples/ingest/s3-small-batch/main.py](examples/ingest/s3-small-batch/main.py), which processes all the documents under a given s3 URL with 2 parallel processes, writing the structured json output to `structured-outputs/`.

You can try it out with

PYTHONPATH=. python examples/ingest/s3-small-batch/main.py

The abstractions in the above diagram are honored in this project (though ABC's are not yet written), with the exception of the StructuredDocWriter which may be added more formally at a later time.
6 changes: 6 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,11 @@ install-dev:
install-build:
pip install -r requirements/build.txt

## install-ingest-s3: install requirements for the s3 connector
.PHONY: install-ingest-s3
install-ingest-s3:
pip install -r requirements/ingest-s3.txt

.PHONY: install-unstructured-inference
install-unstructured-inference:
pip install -r requirements/local-inference.txt
Expand Down Expand Up @@ -78,6 +83,7 @@ pip-compile:
# NOTE(robinson) - doc/requirements.txt is where the GitHub action for building
# sphinx docs looks for additional requirements
cp requirements/build.txt docs/requirements.txt
pip-compile --upgrade requirements/ingest-s3.in requirements/base.txt --output-file requirements/ingest-s3.txt

## install-project-local: install unstructured into your local python environment
.PHONY: install-project-local
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -276,3 +276,4 @@ information on how to report security vulnerabilities.
|-|-|
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |
| [Documentation](https://unstructured-io.github.io/unstructured) | Full API documentation |
| [Batch Processing](Ingest.md) | Ingesting batches of documents through Unstructured |
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ sphinxcontrib-serializinghtml==1.1.5
# via sphinx
urllib3==1.26.14
# via requests
zipp==3.12.1
zipp==3.13.0
# via importlib-metadata

# The following packages are considered to be unsafe in a requirements file:
Expand Down
54 changes: 54 additions & 0 deletions examples/ingest/s3-small-batch/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import multiprocessing as mp
import os
from unstructured.ingest.connector.s3_connector import S3Connector, SimpleS3Config
from unstructured.ingest.doc_processor.generalized import process_document

class MainProcess:

def __init__(self, doc_connector, doc_processor_fn, num_processes):
# initialize the reader and writer
self.doc_connector = doc_connector
self.doc_processor_fn = doc_processor_fn
self.num_processes = num_processes


def initialize(self):
"""Slower initialization things: check connections, load things into memory, etc."""
self.doc_connector.initialize()

def cleanup(self):
self.doc_connector.cleanup()

def run(self):
self.initialize()

self.doc_connector.fetch_docs()

# fetch the list of lazy downloading IngestDoc obj's
docs = self.doc_connector.fetch_docs()

# Debugging tip: use the below line and comment out the mp.Pool loop
# block to remain in single process
#self.doc_processor_fn(docs[0])

with mp.Pool(processes=self.num_processes) as pool:
results = pool.map(self.doc_processor_fn, docs)

self.cleanup()

@staticmethod
def main():
doc_connector = S3Connector(
config=SimpleS3Config(
s3_url="s3://utic-dev-tech-fixtures/small-pdf-set/",
output_dir="structured-output",
# set to False to use your AWS creds (not needed for this public s3 url)
anonymous=True,
),
)
MainProcess(doc_connector=doc_connector,
doc_processor_fn=process_document,
num_processes=2).run()

if __name__ == '__main__':
MainProcess.main()
4 changes: 2 additions & 2 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
anyio==3.6.2
# via httpcore
argilla==1.2.1
argilla==1.3.0
# via unstructured (setup.py)
backoff==2.2.1
# via argilla
Expand Down Expand Up @@ -50,7 +50,7 @@ numpy==1.23.5
# via
# argilla
# pandas
openpyxl==3.1.0
openpyxl==3.1.1
# via unstructured (setup.py)
packaging==23.0
# via argilla
Expand Down
2 changes: 1 addition & 1 deletion requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ sphinxcontrib-serializinghtml==1.1.5
# via sphinx
urllib3==1.26.14
# via requests
zipp==3.12.1
zipp==3.13.0
# via importlib-metadata

# The following packages are considered to be unsafe in a requirements file:
Expand Down
13 changes: 8 additions & 5 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -59,15 +59,15 @@ importlib-metadata==6.0.0
# nbconvert
importlib-resources==5.10.2
# via jsonschema
ipykernel==6.21.1
ipykernel==6.21.2
# via
# ipywidgets
# jupyter
# jupyter-console
# nbclassic
# notebook
# qtconsole
ipython==8.9.0
ipython==8.10.0
# via
# -r requirements/dev.in
# ipykernel
Expand Down Expand Up @@ -107,13 +107,14 @@ jupyter-client==8.0.2
# nbclient
# notebook
# qtconsole
jupyter-console==6.4.4
jupyter-console==6.5.1
# via jupyter
jupyter-core==5.2.0
# via
# -r requirements/dev.in
# ipykernel
# jupyter-client
# jupyter-console
# jupyter-server
# nbclassic
# nbclient
Expand Down Expand Up @@ -223,14 +224,15 @@ python-dateutil==2.8.2
# via
# arrow
# jupyter-client
python-json-logger==2.0.4
python-json-logger==2.0.5
# via jupyter-events
pyyaml==6.0
# via jupyter-events
pyzmq==25.0.0
# via
# ipykernel
# jupyter-client
# jupyter-console
# jupyter-server
# nbclassic
# notebook
Expand Down Expand Up @@ -291,6 +293,7 @@ traitlets==5.9.0
# ipython
# ipywidgets
# jupyter-client
# jupyter-console
# jupyter-core
# jupyter-events
# jupyter-server
Expand Down Expand Up @@ -319,7 +322,7 @@ wheel==0.38.4
# pip-tools
widgetsnbextension==4.0.5
# via ipywidgets
zipp==3.12.1
zipp==3.13.0
# via
# importlib-metadata
# importlib-resources
Expand Down
6 changes: 3 additions & 3 deletions requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
anyio==3.6.2
# via httpcore
argilla==1.2.1
argilla==1.3.0
# via unstructured (setup.py)
backoff==2.2.1
# via argilla
Expand Down Expand Up @@ -63,7 +63,7 @@ numpy==1.23.5
# argilla
# pandas
# transformers
openpyxl==3.1.0
openpyxl==3.1.1
# via unstructured (setup.py)
packaging==23.0
# via
Expand Down Expand Up @@ -131,7 +131,7 @@ tqdm==4.64.1
# nltk
# sacremoses
# transformers
transformers==4.26.0
transformers==4.26.1
# via unstructured (setup.py)
typing-extensions==4.4.0
# via
Expand Down
Loading

0 comments on commit ab542ca

Please sign in to comment.