Skip to content

Commit

Permalink
chore: Remove PDF parsing code and dependencies (Unstructured-IO#75)
Browse files Browse the repository at this point in the history
Remove PDF parsing code and dependencies.
  • Loading branch information
mallorih authored Nov 21, 2022
1 parent baa15d0 commit 53fcf4e
Show file tree
Hide file tree
Showing 20 changed files with 69 additions and 539 deletions.
15 changes: 11 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,11 @@ jobs:
path: |
.venv
nltk_data
key: ${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements/*.txt') }}

key: unstructured-${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements/*.txt') }}
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Setup virtual environment (no cache hit)
if: steps.virtualenv-cache.outputs.cache-hit != 'true'
run: |
Expand All @@ -43,9 +46,13 @@ jobs:
id: virtualenv-cache
with:
path: .venv
key: ${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements/*.txt') }}
key: unstructured-${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements/*.txt') }}
# NOTE(robinson) - This is a fallback in case the lint job does not find the cache.
# We can take this out when we implement the fix in CORE-99
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Setup virtual environment (no cache hit)
if: steps.virtualenv-cache.outputs.cache-hit != 'true'
run: |
Expand Down Expand Up @@ -77,7 +84,7 @@ jobs:
path: |
.venv
nltk_data
key: ${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements/*.txt') }}
key: unstructured-${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements/*.txt') }}
# NOTE(robinson) - This is a fallback in case the lint job does not find the cache.
# We can take this out when we implement the fix in CORE-99
- name: Setup virtual environment (no cache hit)
Expand Down
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
## 0.2.6-dev1
## 0.3.0-dev1

* Removing the local PDF parsing code and any dependencies and tests.

## 0.2.6

* Small change to how _read is placed within the inheritance structure since it doesn't really apply to pdf
* Add partitioning brick for calling the document image analysis API
Expand Down
19 changes: 2 additions & 17 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ install-base: install-base-pip-packages install-nltk-models

## install: installs all test, dev, and experimental requirements
.PHONY: install
install: install-base-pip-packages install-dev install-detectron2 install-nltk-models install-test
install: install-base-pip-packages install-dev install-nltk-models install-test

.PHONY: install-ci
install-ci: install-base-pip-packages install-pdf install-test install-nltk-models install-huggingface
install-ci: install-base-pip-packages install-test install-nltk-models install-huggingface

.PHONY: install-base-pip-packages
install-base-pip-packages:
Expand All @@ -32,18 +32,6 @@ install-huggingface:
python3 -m pip install pip==${PIP_VERSION}
pip install -r requirements/huggingface.txt

.PHONY: install-pdf
install-pdf:
python3 -m pip install pip==${PIP_VERSION}
pip install -r requirements/pdf.txt
@echo "\n\n========================================================================"
@echo " WARNING: PDF parsing capabilities in unstructured is still experimental"
@echo "========================================================================\n\n"

.PHONY: install-detectron2
install-detectron2: install-pdf
pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"

.PHONE: install-nltk-models
install-nltk-models:
python -c "import nltk; nltk.download('punkt')"
Expand All @@ -67,12 +55,9 @@ pip-compile:
pip-compile -o requirements/base.txt
# Extra requirements for huggingface staging functions
pip-compile --extra huggingface -o requirements/huggingface.txt
# Extra requirements for parsing PDF files
pip-compile --extra pdf -o requirements/pdf.txt
# NOTE(robinson) - We want the dependencies for detectron2 in the requirements.txt, but not
# the detectron2 repo itself. If detectron2 is in the requirements.txt file, an order of
# operations issue related to the torch library causes the install to fail
sed 's/^detectron2 @/# detectron2 @/g' requirements/pdf.txt
pip-compile requirements/dev.in
pip-compile requirements/test.in
pip-compile requirements/build.in
Expand Down
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,17 +88,16 @@ titles and narrative text.

### PDF Parsing

You can use the following workflow to parse PDF documents. Note, PDF parsing is currently
expiremental and will be refined in the coming months.
You can use the following workflow to parse PDF documents.

```python
from unstructured.documents.pdf import PDFDocument
from unstructured.nlp.partition import partition_pdf

doc = PDFDocument.from_file("example-docs/layout-parser-paper.pdf")
elements = partition_pdf("example-docs/layout-parser-paper.pdf")
print(doc)
```

At this point, `print(doc)` will print out a string representation of the PDF file. The
At this point, `print(elements)` will print out a string representation of the PDF file. The
first page of output looks like the following:

```
Expand Down
8 changes: 4 additions & 4 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ alabaster==0.7.12
# via sphinx
babel==2.10.3
# via sphinx
certifi==2022.9.14
certifi==2022.9.24
# via requests
charset-normalizer==2.1.1
# via requests
Expand Down Expand Up @@ -38,11 +38,11 @@ requests==2.28.1
# via sphinx
snowballstemmer==2.2.0
# via sphinx
sphinx==5.2.3
sphinx==5.3.0
# via
# -r requirements/build.in
# sphinx-rtd-theme
sphinx-rtd-theme==1.0.0
sphinx-rtd-theme==1.1.1
# via -r requirements/build.in
sphinxcontrib-applehelp==1.0.2
# via sphinx
Expand All @@ -58,5 +58,5 @@ sphinxcontrib-serializinghtml==1.1.5
# via sphinx
urllib3==1.26.12
# via requests
zipp==3.9.0
zipp==3.10.0
# via importlib-metadata
20 changes: 1 addition & 19 deletions docs/source/installing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@ Installation

You can install the library by cloning the repo and running ``make install`` from the
root directory. Developers can run ``make install-local`` to install the dev and test
requirements alongside the base requirements. Specific parsing capabilities may require
extra dependencies, as documented below. If you want a minimal installation without any
requirements alongside the base requirements. If you want a minimal installation without any
parser specific dependencies, run ``make install-base``.

Logging
Expand Down Expand Up @@ -37,23 +36,6 @@ that with:
$ brew install libxml2
$ brew install libxslt
================
PDF Dependencies
================

Currently, PDF parsing capabilities rely on the
`Detectron2 <https://github.com/facebookresearch/detectron2>`_ object detection model. The
``make install-local`` command installs all of the dependencies for Detectron2. If you
need to parse PDFs and Detectron2 is not already installed, you can install it with
``make install-detectron2``.

Also ensure that you have ``poppler`` installed on your system. On a Mac, you can run:

.. code:: console
$ brew install poppler
========================
Huggingface Dependencies
========================
Expand Down
8 changes: 6 additions & 2 deletions requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ idna==3.4
# via requests
imagesize==1.4.1
# via sphinx
importlib-metadata==5.0.0
# via sphinx
jinja2==3.1.2
# via sphinx
markupsafe==2.1.1
Expand All @@ -38,10 +40,10 @@ snowballstemmer==2.2.0
# via sphinx
sphinx==5.3.0
# via
# -r build.in
# -r requirements/build.in
# sphinx-rtd-theme
sphinx-rtd-theme==1.1.1
# via -r build.in
# via -r requirements/build.in
sphinxcontrib-applehelp==1.0.2
# via sphinx
sphinxcontrib-devhelp==1.0.2
Expand All @@ -56,3 +58,5 @@ sphinxcontrib-serializinghtml==1.1.5
# via sphinx
urllib3==1.26.12
# via requests
zipp==3.10.0
# via importlib-metadata
24 changes: 21 additions & 3 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@
#
# pip-compile requirements/dev.in
#
appnope==0.1.3
# via
# ipykernel
# ipython
argon2-cffi==21.3.0
# via notebook
argon2-cffi-bindings==21.2.0
Expand Down Expand Up @@ -36,6 +40,10 @@ executing==1.0.0
# via stack-data
fastjsonschema==2.16.2
# via nbformat
importlib-metadata==5.0.0
# via nbconvert
importlib-resources==5.10.0
# via jsonschema
ipykernel==6.15.3
# via
# ipywidgets
Expand All @@ -45,7 +53,7 @@ ipykernel==6.15.3
# qtconsole
ipython==8.6.0
# via
# -r dev.in
# -r requirements/dev.in
# ipykernel
# ipywidgets
# jupyter-console
Expand All @@ -64,7 +72,7 @@ jinja2==3.1.2
jsonschema==4.16.0
# via nbformat
jupyter==1.0.0
# via -r dev.in
# via -r requirements/dev.in
jupyter-client==7.3.5
# via
# ipykernel
Expand Down Expand Up @@ -133,7 +141,9 @@ pexpect==4.8.0
pickleshare==0.7.5
# via ipython
pip-tools==6.10.0
# via -r dev.in
# via -r requirements/dev.in
pkgutil-resolve-name==1.3.10
# via jsonschema
prometheus-client==0.14.1
# via notebook
prompt-toolkit==3.0.31
Expand Down Expand Up @@ -187,6 +197,10 @@ terminado==0.15.0
# via notebook
tinycss2==1.1.1
# via nbconvert
tomli==2.0.1
# via
# build
# pep517
tornado==6.2
# via
# ipykernel
Expand Down Expand Up @@ -216,6 +230,10 @@ wheel==0.37.1
# via pip-tools
widgetsnbextension==4.0.3
# via ipywidgets
zipp==3.10.0
# via
# importlib-metadata
# importlib-resources

# The following packages are considered to be unsafe in a requirements file:
# pip
Expand Down
Loading

0 comments on commit 53fcf4e

Please sign in to comment.