Update documentation

elshimone · Jan 21, 2023 · 3452165 · 3452165
1 parent efaed1c
commit 3452165
Show file tree

Hide file tree

Showing 4 changed files with 1,542 additions and 38 deletions.
diff --git a/README.md b/README.md
@@ -29,7 +29,12 @@
 
 -------------------------------------------------------------------------------------------------------------------------------------------------------
 
-paperetl is an ETL library for processing medical and scientific papers. It supports the following sources:
+paperetl is an ETL library for processing medical and scientific papers.
+
+![architecture](https://raw.githubusercontent.com/neuml/paperetl/master/images/architecture.png#gh-light-mode-only)
+![architecture](https://raw.githubusercontent.com/neuml/paperetl/master/images/architecture-dark.png#gh-dark-mode-only)
+
+paperetl supports the following sources:
 
 - File formats:
     - PDF
@@ -48,13 +53,17 @@ paperetl supports the following output options for storing articles:
 
 The easiest way to install is via pip and PyPI
 
-    pip install paperetl
+```
+pip install paperetl
+```
 
 Python 3.7+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.
 
 paperetl can also be installed directly from GitHub to access the latest, unreleased features.
 
-    pip install git+https://github.com/neuml/paperetl
+```
+pip install git+https://github.com/neuml/paperetl
+```
 
 ### Additional dependencies
 
@@ -70,7 +79,7 @@ A Dockerfile with commands to install paperetl, all dependencies and scripts is
 
 Clone this git repository and run the following to build and run the Docker image.
 
-```bash
+```
 docker build -t paperetl -f docker/Dockerfile .
 docker run --name paperetl --rm -it paperetl
 ```
@@ -83,8 +92,6 @@ This will bring up a paperetl command shell. Standard Docker commands can be use
 
 | Notebook  |  Description |
 |:----------|:-------------|
-| [CORD-19 Article Entry Dates](https://www.kaggle.com/davidmezzetti/cord-19-article-entry-dates) | Generates CORD-19 entry-dates.csv file |
-| [CORD-19 ETL](https://www.kaggle.com/davidmezzetti/cord-19-etl) | Builds an article.sqlite database for CORD-19 data |
 
 ### Load Articles into SQLite
 
@@ -94,27 +101,57 @@ The following example shows how to use paperetl to load a set of medical/scienti
 
 2. Build the database
 
-    ```bash
+    ```
     python -m paperetl.file paperetl/data paperetl/models paperetl/models
     ```
 
 Once complete, there will be an articles.sqlite file in paperetl/models
 
-### Load CORD-19 into SQLite
+### Load into Elasticsearch
+
+Elasticsearch is also a supported datastore as shown below. This example assumes Elasticsearch is running locally, change the URL to a remote server as appropriate.
+
+```
+python -m paperetl.file paperetl/data http://localhost:9200 paperetl/models
+```
+
+Once complete, there will be an articles index in Elasticsearch with the metadata and full text stored.
+
+### Convert articles to JSON/YAML
+
+paperetl can also be used to convert articles into JSON or YAML files. This is useful if the data is to be fed into another system or for manual inspection/debugging of a single file.
+
+JSON:
+
+```
+python -m paperetl.file paperetl/data json://paperetl/json paperetl/models
+```
+
+YAML:
+
+```
+python -m paperetl.file paperetl/data yaml://paperetl/yaml paperetl/models
+```
+
+Converted files will be stored in paperetl/(json|yaml)
+
+### Load CORD-19
+
+_Note: The final version of CORD-19 was released on 2022-06-22. But this is still a large, valuable set of medical documents._
 
 The following example shows how to use paperetl to load the CORD-19 dataset into a SQLite database.
 
 1. Download and extract the dataset from [Allen Institute for AI CORD-19 Release Page](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html).
 
-    ```bash
+    ```
     scripts/getcord19.sh cord19/data
     ```
 
     The script above retrieves and unpacks the latest copy of CORD-19 into a directory named `cord19/data`. An optional second argument sets a specific date of the dataset in the format YYYY-MM-DD (ex. 2021-01-01) which defaults to the latest date.
 
 2. Generate entry-dates.csv for current version of the dataset
 
-    ```bash
+    ```
     python -m paperetl.cord19.entry cord19/data
     ```
 
@@ -123,36 +160,12 @@ The following example shows how to use paperetl to load the CORD-19 dataset into
 
 3. Build database
 
-    ```bash
+    ```
     python -m paperetl.cord19 cord19/data cord19/models
     ```
 
-Once complete, there will be an articles.sqlite file in cord19/models
-
-### Load into Elasticsearch
-
-Both of the examples above also support storing data in Elasticsearch with the following changes. These examples assume Elasticsearch is running locally, change the URL to a remote server as appropriate.
-
-Articles:
-
-    python -m paperetl.file paperetl/data http://localhost:9200 paperetl/models
-
-CORD-19:
+    Once complete, there will be an articles.sqlite file in cord19/models. As with earlier examples, the data can also be loaded into Elasticsearch.
 
+    ```
     python -m paperetl.cord19 cord19/data http://localhost:9200
-
-Once complete, there will be an articles index in elasticsearch with the metadata and full text stored.
-
-### Convert articles to JSON/YAML
-
-paperetl can also be used to convert articles into JSON or YAML files. This is useful if the data is to be fed into another system or for manual inspection/debugging of a single file.
-
-JSON:
-
-    python -m paperetl.file paperetl/data json://paperetl/json paperetl/models
-
-YAML:
-
-    python -m paperetl.file paperetl/data yaml://paperetl/yaml paperetl/models
-
-Converted files will be stored in paperetl/(json|yaml)
+    ```
diff --git a/images/architecture-dark.png b/images/architecture-dark.png