Skip to content

Commit

Permalink
docs: add a quick start page to the readme and docs (Unstructured-IO#240
Browse files Browse the repository at this point in the history
)

* added quick start section to the readme

* added quick start to docs

* parenthetical on extra deps

* typo

* fix typo

* fixed mixed tabs/spaces
  • Loading branch information
MthwRobinson authored Feb 17, 2023
1 parent 601f250 commit 7472e1b
Show file tree
Hide file tree
Showing 2 changed files with 77 additions and 7 deletions.
43 changes: 40 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,48 @@ about. Bricks in the library fall into three categories:
- :performing_arts: ***Staging bricks*** that format data for downstream tasks, such as ML inference
and data labeling.
<br></br>
## :eight_pointed_black_star: Installation
## :eight_pointed_black_star: Quick Start

Use the following instructions to get up and running with `unstructured` and test your
installation.

- Install the Python SDK with `pip install unstructured[local-inference]`
- If you do not need to process PDFs or images, you can run `pip install unstructured`
- Install the following system dependencies if they are not already available on your system.
Depending on what document types you're parsing, you may not need all of these.
- `libmagic-dev` (filetype detection)
- `poppler-utils` (images and PDFs)
- `tesseract-ocr` (images and PDFs)
- `libreoffice` (MS Office docs)
- Run the following to install NLTK dependencies. `unstructured` will handle this automatically
soon.
- `python -c "import nltk; nltk.download('punkt')"`
- `python -c "import nltk; nltk.download('averaged_perceptron_tagger')"`
- If you are parsing PDFs, run the following to install the `detectron2` model, which
`unstructured` uses for layout detection:
- `pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"`

At this point, you should be able to run the following code:

To install the library, run `pip install unstructured`.
```python
from unstructured.partition.auto import partition

elements = partition(filename="example-docs/fake-email.eml")
```

And if you installed with `local-inference`, you should be able to run this as well:

```python
from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper.pdf")
```


## :coffee: Installation Instructions for Local Development

## :coffee: Getting Started
The following instructions are intended to help you get up and running with `unstructured`
locally if you are planning to contribute to the project.

* Using `pyenv` to manage virtualenv's is recommended but not necessary
* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.
Expand Down
41 changes: 37 additions & 4 deletions docs/source/installing.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,43 @@
Installation
============

You can install the library by cloning the repo and running ``make install`` from the
root directory. Developers can run ``make install-local`` to install the dev and test
requirements alongside the base requirements. If you want a minimal installation without any
parser specific dependencies, run ``make install-base``.
Quick Start
-----------

Use the following instructions to get up and running with ``unstructured`` and test your
installation.

* Install the Python SDK with ``pip install unstructured[local-inference]``
* If you do not need to process PDFs or images, you can run ``pip install unstructured``

* Install the following system dependencies if they are not already available on your system. Depending on what document types you're parsing, you may not need all of these.
* ``libmagic-dev`` (filetype detection)
* ``poppler-utils`` (images and PDFs)
* ``tesseract-ocr`` (images and PDFs)
* ``libreoffice`` (MS Office docs)

* Run the following to install NLTK dependencies. ``unstructured`` will handle this automatically soon.
* ``python -c "import nltk; nltk.download('punkt')"``
* ``python -c "import nltk; nltk.download('averaged_perceptron_tagger')"``

* If you are parsing PDFs, run the following to install the ``detectron2`` model, which ``unstructured`` uses for layout detection:
* ``pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"``

At this point, you should be able to run the following code:

.. code:: python
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/fake-email.eml")
And if you installed with `local-inference`, you should be able to run this as well:

.. code:: python
from unstructured.partition.auto import partition
elements = partition("example-docs/layout-parser-paper.pdf")
Installation with ``conda`` on Windows
Expand Down

0 comments on commit 7472e1b

Please sign in to comment.