docs: add a quick start page to the readme and docs (Unstructured-IO#240

) * added quick start section to the readme * added quick start to docs * parenthetical on extra deps * typo * fix typo * fixed mixed tabs/spaces
siddartha-RE · Feb 17, 2023 · 7472e1b · 7472e1b
1 parent 601f250
commit 7472e1b
Show file tree

Hide file tree

Showing 2 changed files with 77 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -49,11 +49,48 @@ about. Bricks in the library fall into three categories:
 - :performing_arts: ***Staging bricks*** that format data for downstream tasks, such as ML inference
   and data labeling.
 <br></br>
-## :eight_pointed_black_star: Installation
+## :eight_pointed_black_star: Quick Start
+
+Use the following instructions to get up and running with `unstructured` and test your
+installation.
+
+- Install the Python SDK with `pip install unstructured[local-inference]`
+		- If you do not need to process PDFs or images, you can run `pip install unstructured`
+- Install the following system dependencies if they are not already available on your system.
+  Depending on what document types you're parsing, you may not need all of these.
+    - `libmagic-dev` (filetype detection)
+    - `poppler-utils` (images and PDFs)
+    - `tesseract-ocr` (images and PDFs)
+    - `libreoffice` (MS Office docs)
+- Run the following to install NLTK dependencies. `unstructured` will handle this automatically
+  soon.
+    - `python -c "import nltk; nltk.download('punkt')"`
+    - `python -c "import nltk; nltk.download('averaged_perceptron_tagger')"`
+- If you are parsing PDFs, run the following to install the `detectron2` model, which
+  `unstructured` uses for layout detection:
+    - `pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"`
+
+At this point, you should be able to run the following code:
 
-To install the library, run `pip install unstructured`.
+```python
+from unstructured.partition.auto import partition
+
+elements = partition(filename="example-docs/fake-email.eml")
+```
+
+And if you installed with `local-inference`, you should be able to run this as well:
+
+```python
+from unstructured.partition.auto import partition
+
+elements = partition("example-docs/layout-parser-paper.pdf")
+```
+
+
+## :coffee: Installation Instructions for Local Development
 
-## :coffee: Getting Started
+The following instructions are intended to help you get up and running with `unstructured`
+locally if you are planning to contribute to the project.
 
 * Using `pyenv` to manage virtualenv's is recommended but not necessary
 	* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.

diff --git a/docs/source/installing.rst b/docs/source/installing.rst
@@ -1,10 +1,43 @@
 Installation
 ============
 
-You can install the library by cloning the repo and running ``make install`` from the
-root directory. Developers can run ``make install-local`` to install the dev and test
-requirements alongside the base requirements. If you want a minimal installation without any
-parser specific dependencies, run ``make install-base``.
+Quick Start
+-----------
+
+Use the following instructions to get up and running with ``unstructured`` and test your
+installation.
+
+* Install the Python SDK with ``pip install unstructured[local-inference]``
+  * If you do not need to process PDFs or images, you can run ``pip install unstructured``
+
+* Install the following system dependencies if they are not already available on your system. Depending on what document types you're parsing, you may not need all of these.
+	* ``libmagic-dev`` (filetype detection)
+	* ``poppler-utils`` (images and PDFs)
+	* ``tesseract-ocr`` (images and PDFs)
+	* ``libreoffice`` (MS Office docs)
+
+* Run the following to install NLTK dependencies. ``unstructured`` will handle this automatically soon.
+	* ``python -c "import nltk; nltk.download('punkt')"``
+	* ``python -c "import nltk; nltk.download('averaged_perceptron_tagger')"``
+
+* If you are parsing PDFs, run the following to install the ``detectron2`` model, which ``unstructured`` uses for layout detection:
+	* ``pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"``
+
+At this point, you should be able to run the following code:
+
+.. code:: python
+
+  from unstructured.partition.auto import partition
+
+  elements = partition(filename="example-docs/fake-email.eml")
+
+And if you installed with `local-inference`, you should be able to run this as well:
+
+.. code:: python
+
+  from unstructured.partition.auto import partition
+
+  elements = partition("example-docs/layout-parser-paper.pdf")
 
 
 Installation with ``conda`` on Windows